Whenever I talked to academics, they always asked for the same thing. “Do you have a breakdown of character per season?” Up until this summer, I said no. As of today, that answer is ‘yes.’
Besides the obvious concern of the monumental amount of manual work involved in entering and then maintaining a dataset like this, my concern was how to store the data in an easily coded way. The issues were:
- Characters on multiple shows (aka The Sara Lance Problem)
- Characters with multiple characters (aka The Soap Problem)
- Characters with multiple roles per show (… Sara Lance again)
- Actors with multiple characters per show (The Clone Club Problem)
- Shows that don’t air in ‘traditional’ seasons (The UK Does What It Wants Problem)
Taking all that in, I decided the best way to approach the situation was to track the characters per show. After all, our site is about tracking the characters, which means we don’t need to concern ourselves with actor data quite as much (that’s what IMDb is for, anyway).
In addition, since we have a number of show that don’t follow the US ‘traditional’ season structure, I opted to track characters per year instead of trying to make things fit per ‘season.’
Data Collection Woes
The next problem to tackle was how to get the data. You’d think I could just scrape IMDb for everything, but that site is oriented towards actors. When you look up actors on a show, it lists all the characters the actor played on the show, up to a certain point. This caused a problem when I got to animated series, and people voiced multiple characters (like on The Simpsons).
While I was able to collect most of the data from IMDb, web series and animation required me to deep dive a little more. The same was true of more rare and old series that aren’t fully documented on IMDb.
However, since IMDb did have most of what I need, I found it easiest to open up the full cast page for the show and add all the characters per-show.
As of today, all characters have a list on their page to show you what years they appeared for each show.
The show list is, clearly expanded now in order to make the data readable and understandable when a character is on multiple shows. It was also needed because some characters have been on TV shows forever.
This data let us generate more useful ‘per year’ information. The This Year feature has been totally revamped and updated for the modern web with a welcoming intro page and the ability to deep dive.
If you look at Characters on Air for 2018, you’ll see all the characters by name but also broken down by show. If you look at shows, you’ll see them broken out by name, by country, or by format.
Representation Per Year
Having that data meant I could easily generate a list of the last 20 years of TV shows and how many characters we’ve had on air.
This data is from September 1, 2019, which means it’s pre the start of the US television season. That means that odd looking low of only around 500 characters on air will change as more new shows start and more new characters are introduced.
Like I mentioned in the stagnation of representation back in June, numbers tend to jump a little. However doing all of this data adding meant I cleaned up a lot of incorrect data in general. We went from only 26 new shows in 2019 to 45, but down from 320 shows on air to 312. The only correct extrapolation thus far is the increase of dead characters and cancelled show.
Now that we have the data, it’s time to study it and understand what it actually means. With the data we have, we have the ability to break down how many regular TV characters there are versus how many guests. We can also look at a show per year and see how inclusive it is or isn’t.
If there’s data you’re looking for, give us a shout and we’ll see what we can do.