By Jeff Hughes (July 28, 2017)

I recently came across the MovieLens database, which has 20+ million ratings of about 27,000 movies. Because I thought that sounded pretty cool, I decided to check out MovieLens itself, and it turns out that’s pretty cool too! It uses a collaborative filtering (machine learning) method to recommend movies to you based on movies you’ve already watched and rated.1 The nice thing is that it shows you the predicted rating for each movie—what the algorithm thinks you would rate the movie. So while going through the initial process of rating movies you’ve seen, you can see it get more and more accurate. I’ve now rated close to 400 movies and it probably knows me better than my closest friends. It actually told me where I had misplaced my keys last week. I think it might actually use incantations of black magic somewhere in the algorithm.

Anyway, the point is that they provide their database online for free, which is great! Kudos to the GroupLens research group for that. So I decided to play around with it and see what interesting information I could pull from it. You can see the code for all of these analyses by clicking on the buttons to the right, or go to the top of the page on the right and select “Show All Code”. Note that I’ve converted the CSV files to R-native RDS files, just to make it a bit quicker to load the data, but the raw data is available here. The code for all these analyses is available on Github. First let’s load some packages we’ll need, and the files for the list of movies, and list of ratings.

library(knitr)
library(psych)
library(stringr)
library(ggplot2)
library(tidyr)
library(dplyr)

movies <- readRDS('movies.rds')
ratings <- readRDS('ratings.rds')

Ratings by Year

One thing I wanted to know is whether there was a pattern for movie ratings across the year of release. Perhaps there was a “golden age” of cinema, so we should see a downward trend. Or perhaps there’s a recency bias, such that people come out of the movie theatre, say “that was awesome!” and rate the movie as 5 stars. In the latter case, we’d expect an upward trend, because people might be relying on their hazy memories of older movies, unless they’ve rewatched them recently.

I start by taking all 20 million ratings, and aggregating them by movie to calculate the mean rating for each (as well as the standard deviation). Below you can see an overall average. (The scale goes from a minimum of 0.5 stars to 5.0 stars.) It’s a little tricky because this is an average of averages, but we can see that movies in this database were on average rated 3.13, with a standard deviation of about .66. And the median number of ratings for a movie was 18, though the range on that is quite impressive.

ratings_summ <- ratings %>%
    group_by(movieId) %>%
    summarise(
        avg_rating=mean(rating, na.rm=TRUE),
        sd=sd(rating, na.rm=TRUE),
        n=n())

# pull titles and genres back in to the data set
ratings_summ <- inner_join(movies, ratings_summ, by='movieId')

ratings_desc <- ratings_summ %>%
    select(avg_rating, sd, n) %>%
    describe() %>%
    select(mean, sd, median, min, max)

rownames(ratings_desc) <- c('Avg. Rating', 'SD', 'Num. Ratings')

kable(
    ratings_desc,
    row.names=TRUE,
    col.names=c('Mean', 'SD', 'Median', 'Min.', 'Max.'),
    align='c',
    digits=2)
Mean SD Median Min. Max.
Avg. Rating 3.13 0.66 3.24 0.5 5.00
SD 0.92 0.32 0.94 0.0 3.18
Num. Ratings 747.84 3085.82 18.00 1.0 67310.00

The database puts the year of release at the end of the title of each movie, so we first need to extract that with a little regex magic. We then aggregate across movies by year, calculating the average rating for all movies in that year, plus the 95% confidence intervals. In the graph below, though, you can barely see the confidence intervals (the gray ribbon behind the line), as it’s quite small and only really visible for the older movies. For this graph, I removed movies before 1915, as there was very little data for those older movies.

ratings_summ$year <- as.numeric(str_match(ratings_summ$title, '\\(([0-9]+)\\)$')[, 2])

by_year <- ratings_summ %>%
    group_by(year) %>%
    summarise(
        rating=mean(avg_rating, na.rm=TRUE),
        se=sd(avg_rating, na.rm=TRUE) / sqrt(sum(n, na.rm=TRUE)),
        n=sum(n, na.rm=TRUE),
        lo=rating - 1.96*se,
        hi=rating + 1.96*se)

# not much data for most years before 1915
by_year <- by_year %>%
    filter(!is.na(year), year >= 1915)

ggplot(by_year, aes(x=year, y=rating)) +
    geom_ribbon(aes(ymin=lo, ymax=hi), alpha=.2) +
    geom_line() +
    labs(x='Year', y='Avg. Rating (0.5 - 5.0)') +
    theme_minimal()

Visually, there doesn’t seem to be a very strong trend, but if we fit a linear model to it, there is actually a slight downward trend across years. This is basically as simple a model as you can get, but it does explain about 30% of the variance, so that’s…interesting. This could suggest that perhaps there’s a nostalgia factor, with people remembering movies as being better the further back in time they were made; or it could mean movies really were better back then; or it could be a selection bias, that older movies that get added to the database tend to be the better ones that people actually remember. It’s hard to really say which it could be. At any rate, it looks like maybe the late 80s to early 90s show the low point of a slump, but in recent years the ratings have been increasing again. I’m not sure what to make of the most recent year of data (2015), but there are fewer movies in the database, so I suspect the data dump happened partway through the year, so maybe the movies early in the year were especially bad.2

summary(lm(rating ~ year, by_year))
## 
## Call:
## lm(formula = rating ~ year, data = by_year)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.28499 -0.04838  0.00564  0.05831  0.32783 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.6711732  0.6783529  11.309  < 2e-16 ***
## year        -0.0022870  0.0003452  -6.626 1.82e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1011 on 99 degrees of freedom
## Multiple R-squared:  0.3072, Adjusted R-squared:  0.3002 
## F-statistic:  43.9 on 1 and 99 DF,  p-value: 1.82e-09

Top Rated Movies

Let’s take a look at what our trusty movie raters think about the best and worst movies, shall we? Here are the top 20:

by_movie <- ratings %>%
    group_by(movieId) %>%
    summarise(
        avg_rating=mean(rating, na.rm=TRUE),
        num_ratings=n(),
        rating_sd=sd(rating, na.rm=TRUE),
        rating_se=rating_sd / sqrt(n())) %>%
    inner_join(movies, by='movieId') %>%
    arrange(desc(avg_rating))

by_movie %>%
    filter(num_ratings > 100) %>%
    select(title, avg_rating, rating_sd, num_ratings) %>%
    head(n=20) %>%
    kable(
        row.names=FALSE,
        col.names=c('Title', 'Avg. Rating', 'SD', 'Num. Ratings'),
        align='lccc',
        digits=c(NA, 2, 2, 0))
Title Avg. Rating SD Num. Ratings
Shawshank Redemption, The (1994) 4.45 0.72 63366
Godfather, The (1972) 4.36 0.84 41355
Usual Suspects, The (1995) 4.33 0.76 47006
Schindler’s List (1993) 4.31 0.83 50054
Godfather: Part II, The (1974) 4.28 0.86 27398
Seven Samurai (Shichinin no samurai) (1954) 4.27 0.84 11611
Rear Window (1954) 4.27 0.75 17449
Band of Brothers (2001) 4.26 0.91 4305
Casablanca (1942) 4.26 0.86 24349
Sunset Blvd. (a.k.a. Sunset Boulevard) (1950) 4.26 0.83 6525
One Flew Over the Cuckoo’s Nest (1975) 4.25 0.77 29932
Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb (1964) 4.25 0.86 23220
Third Man, The (1949) 4.25 0.82 6565
City of God (Cidade de Deus) (2002) 4.24 0.76 12937
Lives of Others, The (Das leben der Anderen) (2006) 4.23 0.73 5720
North by Northwest (1959) 4.23 0.75 15627
Paths of Glory (1957) 4.23 0.82 3568
Fight Club (1999) 4.23 0.87 40106
Double Indemnity (1944) 4.22 0.81 4909
12 Angry Men (1957) 4.22 0.78 12934

This looks about right to me. There are a few on here I haven’t seen, but of the ones I have, these seem pretty much in line with what I would expect. And if they’re in line with my own personal preferences in movies, they must be correct, right? Right? Okay, so of course this is subjective. But the bottom line is…if you haven’t seen any of these movies, go watch them now. Well, finish reading this, and then go watch them.

But let’s see what the worst movies of all time are.

bottom20 <- by_movie %>%
    filter(num_ratings > 100) %>%
    select(title, avg_rating, rating_sd, num_ratings) %>%
    tail(n=20)

# fix a couple non-ASCII characters
bottom20[9, 'title'] <- 'Pokemon 4 Ever (a.k.a. Pokemon 4: The Movie) (2002)'
bottom20[14, 'title'] <- 'Pokemon Heroes (2003)'

bottom20 %>%
    kable(
        row.names=FALSE,
        col.names=c('Title', 'Avg. Rating', 'SD', 'Num. Ratings'),
        align='lccc',
        digits=c(NA, 2, 2, 0))
Title Avg. Rating SD Num. Ratings
Carnosaur 2 (1995) 1.45 1.03 167
Prom Night IV: Deliver Us From Evil (1992) 1.43 1.03 152
Ernest Goes to Africa (1997) 1.40 1.16 223
Turbo: A Power Rangers Movie (1997) 1.39 1.07 715
Faces of Death 4 (1990) 1.38 1.23 179
House of the Dead, The (2003) 1.36 1.01 447
Faces of Death: Fact or Fiction? (1999) 1.35 1.22 124
Faces of Death 5 (1996) 1.34 1.22 157
Pokemon 4 Ever (a.k.a. Pokemon 4: The Movie) (2002) 1.32 1.21 472
Faces of Death 6 (1996) 1.26 1.18 174
Son of the Mask (2005) 1.25 0.95 467
Disaster Movie (2008) 1.25 1.06 397
Carnosaur 3: Primal Species (1996) 1.24 0.97 139
Pokemon Heroes (2003) 1.17 1.16 325
Gigli (2003) 1.16 0.91 701
Barney’s Great Adventure (1998) 1.16 0.89 419
Glitter (2001) 1.12 0.83 685
Bratz: The Movie (2007) 1.11 1.10 180
From Justin to Kelly (2003) 0.97 0.82 426
SuperBabies: Baby Geniuses 2 (2004) 0.84 0.78 209

Personally, the thing I love most about this list is just how many sequels there are. Carnosaur 2 was apparently one of the worst movies of all time, and then they decided to make Carnosaur 3. Faces of Death has four separate movies in the bottom 20 movies of all time. But come on, people—How is Barney’s Great Adventure this low? That movie was a masterpiece.

Ratings by User Tag

Another fun thing about this database is that users can add “tags” to a movie to identify them with different themes or put them in different groups. These tags are completely open-ended, so there are lots of great tags like “wasted potential” and “Ethan Hunt Should Stop Hogging The Screen!” and “the director should be shot”. Those might not be particularly useful for identifying trends in movies, since they are often only applied to one movie. But there are lots of other great user tags. So what tags might you look for to find the best movies? Or the worst?

tags <- readRDS('tags.rds')
tags <- inner_join(tags, by_movie, by='movieId')
tags$tag <- tolower(tags$tag)

all_tags <- tags %>%
    select(movieId, tag, avg_rating, num_ratings) %>%
    distinct() %>%  # only want one tag per movie
    group_by(tag) %>%
    summarise(
        avg_rating=mean(avg_rating),
        num_ratings=sum(num_ratings),
        num_movies=n())

all_tags %>%
    filter(num_ratings > 100, num_movies > 5) %>%
    arrange(desc(avg_rating)) %>%
    head(n=30) %>%
    kable(
        row.names=FALSE,
        col.names=c('Tag', 'Avg. Rating', 'Num. Ratings', 'Num. Movies'),
        align='lccc',
        digits=c(NA, 2, 0, 0))
Tag Avg. Rating Num. Ratings Num. Movies
afi100 4.13 216653 10
palestine 4.11 5769 9
wildly overrated 4.10 175705 6
pelicula anime 4.10 38706 6
top 10 ever 4.09 233910 7
top 250 imdb 4.08 451455 51
super interesting 4.08 284879 8
great cast excellent 4.07 102723 7
afi 10 (courtroom drama) 4.06 47089 6
zibri studio 4.05 46155 9
afi 4.04 84615 9
emerson must see 4.03 369838 39
claude rains 4.02 47981 6
estrangeinro 4.02 68370 6
wikilens top pick 4.02 8265 6
best picture 4.02 597810 35
spielberg/lucas 4.02 365626 8
great screenplays 4.01 141381 14
imdb top 250 4.01 4080803 322
100 greatest movies 4.01 552716 52
bfi classic 4.00 192813 36
tag 4.00 80829 6
kurosawa 4.00 38897 14
wallace and gromit 4.00 47978 6
criticker top pick 4.00 201480 14
afi 100 3.99 1439167 103
miyazaki 3.99 51638 10
oscar nominee: cinematography 3.99 143218 11
usa film registry 3.99 263060 26
fantastic foreign films 3.99 25420 9

Some of these are fairly straight-forward. The tags having to do with “AFI” refer to the American Film Institute, which puts out a ranking of top 100 movies; and tags related to IMDB top 250 refer to another common list of best movies. But we can also see that anime films, films with Claude Rains, and films by directors like Spielberg, Lucas, Kurosawa, and Miyazaki, all get highly rated. People also seem to like movies about Palestine, Wallace and Gromit, and films nominated for Oscars in cinematography. I do love that the “wildly overrated” tag shows up on highly-rated films, too. Apparently somebody doesn’t agree very much with those ratings. In case you’re curious, here are the overrated films. (I happen to disagree with whoever tagged them as overrated. I think they’re rated just fine the way they are, thank you.)

overrated <- tags %>%
    filter(tag == 'wildly overrated') %>%
    select(title)

overrated[4, 'title'] <- 'Amelie (Fabuleux destin d\'Amelie Poulain, Le) (2001)'

overrated %>%
    kable()

title

Shawshank Redemption, The (1994)
American Beauty (1999)
Magnolia (1999)
Amelie (Fabuleux destin d’Amelie Poulain, Le) (2001) Eternal Sunshine of the Spotless Mind (2004)
Crash (2004)

And just for good measure, here are the tags people gave to the lowest-rated films:

all_tags %>%
    filter(num_ratings > 100, num_movies > 5) %>%
    arrange(desc(avg_rating)) %>%
    tail(n=30) %>%
    kable(
        row.names=FALSE,
        col.names=c('Tag', 'Avg. Rating', 'Num. Ratings', 'Num. Movies'),
        align='lccc',
        digits=c(NA, 2, 0, 0))
Tag Avg. Rating Num. Ratings Num. Movies
eco-horror 2.53 1318 6
steve guttenberg 2.52 29932 9
adapted from:game 2.52 69929 17
based on a tv show? 2.52 26213 9
based on a cartoon 2.48 14234 9
so bad it’s almost good 2.47 12920 17
golden raspberry (worst actor) 2.47 99215 32
sharks 2.46 8389 7
nostalgia critic 2.45 314931 72
albert pyun 2.42 1252 9
animation remade as live action 2.41 53652 18
flashback overload 2.39 5908 6
subgenre:slasher 2.38 45907 28
shark 2.37 40456 11
chuck norris 2.34 4429 21
franchise beaten to death 2.34 44247 6
bollocks 2.34 3751 11
worst movie ever! 2.31 14327 8
golden raspberry (worst picture) 2.29 57463 25
menahem golan 2.28 13736 27
steven seagal 2.24 33572 25
character:jason vorhees 2.21 13343 10
golden raspberry (worst actress) 2.18 58314 27
full moon entertainment 2.18 2356 20
mst3k 2.16 7228 16
jason vorhees 2.15 9156 7
pokemon 1.96 3991 9
uwe boll 1.86 1716 6
no rec? 1.85 41829 18
imdb bottom 100 1.82 24040 59

Looks like people don’t like Steve Guttenberg, Albert Pyun, Chuck Norris, Menahem Golan, Steven Seagal, and of course, Uwe Boll. People also don’t seem to like film adaptations of games, TV shows, cartoons, and animations remade as live action. Also, apparently the Jason films are really disliked, at least by people who can’t spell the name of the main character (Jason Voorhees).

Ratings by Cast and Crew

One thing that I really wanted to be able to do is take a look at how movies by particular directors, actors, etc. are rated. While this information isn’t in the MovieLens database, it does provide the ID numbers to link movies with IMDB and TMDB, the latter of which has an open API. From there, I was able to pull the full cast and crew credits for almost all the movies in the MovieLens database!

links <- readRDS('links.rds')

# NOTE: This isn't being run because it takes a long time. But if you're interested in doing the same,
# you can run the same function, just give it your API.

# get_cast_crew <- function(ids, api_key) {
#     counter <- 0
#     cast <- data.frame(movie_id=numeric(0), person_id=numeric(0), name=character(0),
#         character=character(0), gender=numeric(0), order=numeric(0), stringsAsFactors=FALSE)
#     crew <- data.frame(movie_id=numeric(0), person_id=numeric(0), name=character(0),
#         department=character(0), job=character(0), gender=numeric(0), stringsAsFactors=FALSE)
#     
#     for (i in 1:length(ids)) {
#         if (!is.na(links[i, 'tmdbId'])) {
#             url <- paste0('https://api.themoviedb.org/3/movie/', ids[i], '/credits?api_key=', api_key, '&language=en-US')
#             rqst <- GET(url)
#             
#             if (rqst$status_code == 200) {
#                 data <- content(rqst)
#                 
#                 cs_cnt <- length(data$cast)
#                 if (cs_cnt > 0) {
#                     cast_tmp <- data.frame(movie_id=numeric(cs_cnt), person_id=numeric(cs_cnt),
#                         name=character(cs_cnt), character=character(cs_cnt),
#                         gender=numeric(cs_cnt), order=numeric(cs_cnt), stringsAsFactors=FALSE)
#                 
#                     for (cs in 1:cs_cnt) {
#                         cast_tmp[cs, 'movie_id'] <- ids[i]
#                         cast_tmp[cs, 'person_id'] <- data$cast[[cs]]$id
#                         cast_tmp[cs, 'name'] <- data$cast[[cs]]$name
#                         cast_tmp[cs, 'character'] <- data$cast[[cs]]$character
#                         cast_tmp[cs, 'gender'] <- data$cast[[cs]]$gender
#                         cast_tmp[cs, 'order'] <- data$cast[[cs]]$order
#                     }
#                     cast <- rbind(cast, cast_tmp)
#                 }
#                 
#                 cr_cnt <- length(data$crew)
#                 if (cr_cnt > 0) {
#                     crew_tmp <- data.frame(movie_id=numeric(cr_cnt), person_id=numeric(cr_cnt),
#                         name=character(cr_cnt), department=character(cr_cnt), job=character(cr_cnt),
#                         gender=numeric(cr_cnt), stringsAsFactors=FALSE)
#                     
#                     for (cr in 1:cr_cnt) {
#                         crew_tmp[cr, 'movie_id'] <- ids[i]
#                         crew_tmp[cr, 'person_id'] <- data$crew[[cr]]$id
#                         crew_tmp[cr, 'name'] <- data$crew[[cr]]$name
#                         crew_tmp[cr, 'department'] <- data$crew[[cr]]$department
#                         crew_tmp[cr, 'job'] <- data$crew[[cr]]$job
#                         crew_tmp[cr, 'gender'] <- data$crew[[cr]]$gender
#                     }
#                     crew <- rbind(crew, crew_tmp)
#                 }
#             }
#             
#             # limit rate
#             if ('x-ratelimit-remaining' %in% names(rqst$headers) &&
#                     as.numeric(rqst$headers$`x-ratelimit-remaining`) == 1) {
#                 
#                 reset_time <- as.POSIXct(as.numeric(rqst$headers$`x-ratelimit-reset`), origin='1970-01-01')
#                 sleep_time <- difftime(reset_time, Sys.time())
#                 if (sleep_time > 0) {
#                     Sys.sleep(sleep_time + 1)
#                 }
#             }
#         }
#     }
#     
#     return(list(cast=cast, crew=crew))
# }
# 
# output <- get_cast_crew(links[1:nrow(links), 'tmdbId'], '<<YOUR API HERE>>')
# saveRDS(output$cast, file='cast.rds')
# saveRDS(output$crew, file='crew.rds')

cast <- readRDS('cast.rds')
crew <- readRDS('crew.rds')

cast <- cast %>%
    inner_join(select(links, tmdbId, movieId), by=c('movie_id'='tmdbId')) %>%
    rename(tmdbId=movie_id)
crew <- crew %>%
    inner_join(select(links, tmdbId, movieId), by=c('movie_id'='tmdbId')) %>%
    rename(tmdbId=movie_id)

cast <- cast %>%
    inner_join(movies, by='movieId')
crew <- crew %>%
    inner_join(movies, by='movieId')

cast <- cast %>%
    inner_join(select(by_movie, -title, -genres), by='movieId')
crew <- crew %>%
    inner_join(select(by_movie, -title, -genres), by='movieId')

Of course, once I realized that I could get the credits for all these movies, I immediately wanted to compare movie ratings by who was the Best Boy for the movie, so we could determine once and for all who the real “Best” Boy is. Unfortunately, the credits only cover the more major roles.3 So I’m a little more limited in the questions I can answer, but there’s still a boatload of information for all these movies. So let’s pick some fun ones!

Best Directors

Let’s start with movie directors. Who are the best directors of all time? We can, of course, easily answer this definitively and objectively, with this data based on completely subjective ratings.

by_director <- crew %>%
    filter(job == 'Director') %>%
    group_by(person_id) %>%
    summarise(
        name=nth(name, 1),
        avg_rating=mean(avg_rating, na.rm=TRUE),
        num_movies=n(),
        num_ratings=sum(num_ratings, na.rm=TRUE)) %>%
    filter(num_ratings > 100, num_movies > 5) %>%
    arrange(desc(avg_rating))

head(by_director, n=20) %>%
    select(name, avg_rating, num_ratings, num_movies) %>%
    kable(
        row.names=FALSE,
        col.names=c('Name', 'Avg. Rating', 'Num. Ratings', 'Num. Movies'),
        align='lccc',
        digits=c(NA, 2, 0, 0))
Name Avg. Rating Num. Ratings Num. Movies
Hayao Miyazaki 4.00 47763 12
Christopher Nolan 3.99 108895 9
Sergio Leone 3.95 34759 6
Shannon Hartman 3.95 726 7
Nick Park 3.94 58864 7
Krzysztof Kieslowski 3.89 21762 9
Masaki Kobayashi 3.87 1013 14
Quentin Tarantino 3.86 168465 9
Frederick Wiseman 3.85 106 15
Satyajit Ray 3.83 1945 16
Wes Anderson 3.81 46341 9
Louis C.K. 3.80 3244 7
Paolo Sorrentino 3.77 453 6
Akira Kurosawa 3.76 33408 30
Park Chan-wook 3.76 9796 7
Preston Sturges 3.75 4125 9
Kaneto Shindô 3.75 244 6
David Fincher 3.74 136303 10
Joel Coen 3.74 154372 18
Alfonso Cuarón 3.73 37711 8

Some of these completely go without saying. Miyazaki, Nolan, Tarantino, Kurosawa, Fincher, Coen—even if you might quibble over which of these belong in which order, they are certainly worthy of being on a Top 20 list. However, I was surprised by Louis C.K. being on there—I guess people really love his comedy. Keep in mind that his directorial credits are primarily for his comedy specials, which probably get rated on a somewhat different scale in relation to more traditional films. That said, his TV show Louie (though not in this database) is pretty great! Shannon Hartman is also a director of many comedy specials; Nick Park is director on the Wallace and Gromit films (which as noted above, people love). It’s important to keep in mind that although great directors make great films, the greatness of a film could be due to something other than the director. These are ratings of the movies, not ratings of the directors. As such, this list is really a list of directors who make movies that people (on average) really like. So yes, that includes the work of Studio Ghibli (Miyazaki), but it also includes the comedy stylings of Louis C.K.

Best Lead Actors

The casting data from TMDB also includes the billing order, so we can take a look at which lead actors are in the top-rated movies:

by_lead <- cast %>%
    filter(num_ratings > 100, order == 0) %>%
    group_by(person_id) %>%
    summarise(
        name=nth(name, 1),
        avg_rating=mean(avg_rating, na.rm=TRUE),
        num_movies=n(),
        num_ratings=sum(num_ratings, na.rm=TRUE)) %>%
    filter(num_movies > 5) %>%
    arrange(desc(avg_rating))

head(by_lead, n=20) %>%
    select(name, avg_rating, num_ratings, num_movies) %>%
    kable(
        row.names=FALSE,
        col.names=c('Name', 'Avg. Rating', 'Num. Ratings', 'Num. Movies'),
        align='lccc',
        digits=c(NA, 2, 0, 0))
Name Avg. Rating Num. Ratings Num. Movies
Toshiro Mifune 3.98 24296 14
Charlie Chaplin 3.91 14510 10
Groucho Marx 3.90 10622 8
Humphrey Bogart 3.87 69578 18
Henry Fonda 3.85 26629 12
William Powell 3.84 7116 9
Buster Keaton 3.83 4106 10
Cary Grant 3.82 55568 24
James Stewart 3.82 65062 19
Gong Li 3.80 5750 6
James Cagney 3.78 2950 7
Takeshi Kitano 3.77 9139 7
Spencer Tracy 3.77 15176 12
Katharine Hepburn 3.77 10599 8
Tony Leung Chiu-Wai 3.76 4548 6
Gary Cooper 3.75 8416 10
Peter O’Toole 3.74 18436 8
Gael García Bernal 3.74 18397 7
Ryan Gosling 3.73 9068 7
Fred Astaire 3.72 6668 11

There’s a definite skew there toward actors in older films, and perhaps that’s because I limited it to those who were the leads in more than five movies (otherwise it tends to skew toward those who had one good movie). But hey, Ryan Gosling is representing the modern era here! My data here suggests that if he were to put on some glasses and a thick mustache, like Groucho Marx, he might improve his career even further.

And here we have the leads in the worst-rated movies:

tail(by_lead, n=20) %>%
    select(name, avg_rating, num_ratings, num_movies) %>%
    kable(
        row.names=FALSE,
        col.names=c('Name', 'Avg. Rating', 'Num. Ratings', 'Num. Movies'),
        align='lccc',
        digits=c(NA, 2, 0, 0))
Name Avg. Rating Num. Ratings Num. Movies
Burt Reynolds 2.73 9170 12
John Candy 2.73 9462 10
Christopher Lambert 2.72 29348 9
Lindsay Lohan 2.71 8990 6
Jennifer Lopez 2.71 23731 9
Macaulay Culkin 2.68 43283 6
Jim Belushi 2.68 2765 7
Dudley Moore 2.67 7627 8
Charles Bronson 2.64 3070 9
Steven Seagal 2.62 19756 10
Jamie Lee Curtis 2.61 9212 7
Kim Basinger 2.60 3181 6
Jean-Claude Van Damme 2.59 27852 18
Martin Lawrence 2.55 10440 12
Freddie Prinze Jr. 2.50 12068 7
Rob Schneider 2.41 6869 6
Chuck Norris 2.41 2795 9
Robert Englund 2.40 10505 6
Madonna 2.35 14504 8
Jim Varney 1.98 7776 7

This is…about what I would expect. Rob Schneider nearing the bottom? Yep. I’m just sad for poor ol’ Jim Varney, who starred in all those “Ernest” movies I watched as a kid, which probably would not at all hold up if I were to rewatch them. I’ll just leave those as a pleasant memory.4

Well, this makes me wonder if actors have an arc to their career. Do they start off poorly, then have a peak somewhere in their career, before fading off into obscurity? Perhaps that’s what they experience with regard to their fame as celebrities, but does that bear out in the ratings of the movies they make over their career? Let’s find out!

To test this, I calculated the average movie ratings for each actor in each year of their career. I then fit a hierarchical linear model (HLM) that gives each actor a random intercept (i.e., the model fits each actor with their own mean level), and then looks at the linear and quadratic effects of year of career. If you’re not familiar with HLM, that’s okay—the main story here is that there’s a slight downward linear trend across an actor’s career, but also evidence for a slight U-shape. It’s a little easier to see in the graph below.

cast$year <- as.numeric(stringr::str_match(cast$title, '\\(([0-9]+)\\)$')[, 2])

# average across all movies for an actor per year
by_person_by_year <- cast %>%
    group_by(person_id, year) %>%
    summarise(
        name=nth(name, 1),
        avg_rating=mean(avg_rating, na.rm=TRUE),
        num_movies=n(),
        num_ratings=sum(num_ratings, na.rm=TRUE)) %>%
    group_by(person_id) %>%
    mutate(year=year - first(year) + 1)  # this resets the year so all actors start at year 1

# hierarchical linear model testing linear and quadratic effects of year, after allowing
# for a random intercept for each actor
library(nlme)
model <- lme(avg_rating ~ year + I(year^2), random=~1|person_id, data=by_person_by_year,
    na.action=na.omit)
summary(model)
## Linear mixed-effects model fit by REML
##  Data: by_person_by_year 
##        AIC      BIC    logLik
##   536064.4 536117.4 -268027.2
## 
## Random effects:
##  Formula: ~1 | person_id
##         (Intercept)  Residual
## StdDev:    0.271587 0.5460772
## 
## Fixed effects: avg_rating ~ year + I(year^2) 
##                 Value    Std.Error     DF   t-value p-value
## (Intercept)  3.207716 0.0016480635 160692 1946.3544       0
## year        -0.005021 0.0002876172 160692  -17.4565       0
## I(year^2)    0.000067 0.0000071770 160692    9.3353       0
##  Correlation: 
##           (Intr) year  
## year      -0.529       
## I(year^2)  0.404 -0.900
## 
## Standardized Within-Group Residuals:
##        Min         Q1        Med         Q3        Max 
## -5.3352075 -0.4792010  0.1220818  0.6284551  3.7568826 
## 
## Number of Observations: 297992
## Number of Groups: 137298
grid <- data.frame(year=1:max(by_person_by_year$year, na.rm=TRUE))
grid$avg_rating <- predict(model, grid, level=0, na.action=na.omit)

# this is just to calculate standard error from an HLM
# code from: http://glmm.wikidot.com/faq
designmat <- model.matrix(formula(model)[-2], grid)
predvar <- diag(designmat %*% vcov(model) %*% t(designmat))
se <- sqrt(predvar)

grid$error_upper <- grid$avg_rating + 1.96 * se
grid$error_lower <- grid$avg_rating - 1.96 * se

ggplot(grid, aes(x=year, y=avg_rating)) +
    geom_ribbon(aes(ymin=error_lower, ymax=error_upper), alpha=.2) +
    geom_line() +
    ylim(0.5, 5) +
    labs(x='Year of Actor\'s Career', y='Avg. Rating (0.5 - 5.0)') +
    theme_minimal()

It looks as though actors tend to start strong in their first few years, but movie ratings tend to dip a little mid-career. If they have a long career, they may get a rise in ratings again. However, keep in mind that this could be a survivorship bias—it may be that better actors are just more likely to have a long career, so as the poorer actors drop out, the average tends to increase. It’s hard to say. The notable thing about the graph is that the dip is still quite slight. There’s much more variation between movies for any given year than there are overall changes in average ratings across an actor’s career.

Best Screenplay

Another group of people we can look at are screenplay writers. Let’s take a look at writers who have written more than five screenplays:

crew %>%
    filter(job == 'Screenplay') %>%
    group_by(person_id) %>%
    summarise(
        name=nth(name, 1),
        avg_rating=mean(avg_rating, na.rm=TRUE),
        num_movies=n(),
        num_ratings=sum(num_ratings, na.rm=TRUE)) %>%
    filter(num_ratings > 100, num_movies > 5) %>%
    arrange(desc(avg_rating)) %>%
    head(n=20) %>%
    select(name, avg_rating, num_ratings, num_movies) %>%
    kable(
        row.names=FALSE,
        col.names=c('Name', 'Avg. Rating', 'Num. Ratings', 'Num. Movies'),
        align='lccc',
        digits=c(NA, 2, 0, 0))
Name Avg. Rating Num. Ratings Num. Movies
Christopher Nolan 4.06 100934 7
Shinobu Hashimoto 4.01 17981 7
Akira Kurosawa 3.96 32752 14
Sergio Leone 3.95 34759 6
Stanley Kubrick 3.94 144282 10
Orson Welles 3.91 24160 6
Enrico Medioli 3.90 4464 6
Ingmar Bergman 3.86 11020 12
Philippa Boyens 3.84 108280 6
Robert Bolt 3.84 23206 6
Hayao Miyazaki 3.83 37829 11
Ennio Flaiano 3.83 5883 8
Leigh Brackett 3.80 54146 7
Michael Wilson 3.80 26231 6
Satyajit Ray 3.80 1925 13
Martin Scorsese 3.79 49849 6
Park Chan-wook 3.79 9367 6
Kenneth Branagh 3.79 25636 6
Luis Buñuel 3.78 6333 11
Krzysztof Kieslowski 3.78 21336 6

Again, a lot of these names should probably be familiar. Having a great screenplay really shows, and these folks have done amazing things. People really love their Nolan films, and it’s no surprise why.

Best Novel Adaptations

One interesting tidbit in the credits is that if a movie is adapted from a book, the author shows up in the credits as well. So we can easily find all the authors who have had really good (and really bad) adaptations of their work:

novel <- crew %>%
    filter(job == 'Novel') %>%
    group_by(person_id) %>%
    summarise(
        name=nth(name, 1),
        avg_rating=mean(avg_rating, na.rm=TRUE),
        num_movies=n(),
        num_ratings=sum(num_ratings, na.rm=TRUE)) %>%
    filter(num_ratings > 100, num_movies >= 3) %>%
    arrange(desc(avg_rating))

head(novel, n=20) %>%
    select(name, avg_rating, num_ratings, num_movies) %>%
    kable(
        row.names=FALSE,
        col.names=c('Name', 'Avg. Rating', 'Num. Ratings', 'Num. Movies'),
        align='lccc',
        digits=c(NA, 2, 0, 0))
Name Avg. Rating Num. Ratings Num. Movies
Bibhutibhushan Bandyopadhyay 4.10 1869 4
Marcel Pagnol 4.01 7002 4
Alan Le May 3.92 3319 3
Jane Austen 3.87 37980 12
Yasuhiko Takiguchi 3.84 520 3
Stieg Larsson 3.83 6914 4
José Giovanni 3.83 116 3
Jumpei Gomikawa 3.82 201 3
Joanne K. Rowling 3.77 72889 10
Joseph Kessel 3.75 315 3
Karen Blixen 3.75 6197 3
Mario Puzo 3.75 68814 3
Eiji Yoshikawa 3.73 339 3
Philip K. Dick 3.72 34806 4
Charlotte Brontë 3.71 3065 3
Bohumil Hrabal 3.68 252 3
Larry McMurtry 3.67 9404 4
Frances Hodgson Burnett 3.67 12800 4
Alberto Moravia 3.66 1693 4
William Shakespeare 3.65 5540 4

I had no idea who the top author was. Turns out that he was a Bengali author who wrote a novel that was later adapted into a trilogy of films frequently heralded as the greatest movies from Bollywood, The Apu Trilogy. Cool stuff! Apparently I’m not really up on my Bollywood films. Some of the others I’m familiar with—Jane Austen should need no introduction, Stieg Larsson and J.K. Rowling have more recent works that have been adapted, and there’s good ol’ Billy Shakespeare rounding out the Top 20. And of course, Mario Puzo wrote The Godfather, so of course his film adaptations are highly rated!

But yeah, let’s look at the authors with the worst film adaptations:

tail(novel, n=20) %>%
    select(name, avg_rating, num_ratings, num_movies) %>%
    kable(
        row.names=FALSE,
        col.names=c('Name', 'Avg. Rating', 'Num. Ratings', 'Num. Movies'),
        align='lccc',
        digits=c(NA, 2, 0, 0))
Name Avg. Rating Num. Ratings Num. Movies
James Michener 2.96 147 3
Friedrich Dürrenmatt 2.94 1516 3
Marvin H. Albert 2.93 109 6
Akira Toriyama 2.91 521 17
Ayn Rand 2.89 308 4
Jules Verne 2.87 11367 13
Joan McLeod 2.85 1304 4
Arthur Hailey 2.85 2800 4
Eric Knight 2.79 177 3
Jack Ketchum 2.78 354 4
Peter Benchley 2.77 24349 5
H. Rider Haggard 2.72 572 7
Stephenie Meyer 2.70 5127 6
Arturo Pérez-Reverte 2.64 3523 4
Dr. Seuss 2.63 1900 3
Arthur Conan Doyle 2.59 752 5
Edgar Rice Burroughs 2.57 5744 5
H.P. Lovecraft 2.43 457 4
Jonathan Swift 2.28 226 3
Sax Rohmer 2.17 296 4

Some of these make me sad. Jules Verne! Arthur Conan Doyle! Jonathan Swift! I love all these authors. But I guess it’s not their fault that somebody made terrible film adaptations of their excellent books. Then, of course, there are other authors like Ayn Rand and Stephenie Meyer…when the source material is that bad, it’s hard to make a good film adaptation, am I right?

Conclusion

There is lots of information in this database. You could take a look, for instance, at which actors do better in leading roles than in supporting roles. Or because each rating is timestamped, you could take a look at how ratings evolve—from when the movie first comes out, to later on in time. TMDB also includes information about the budget for each film, so you could take a look at how much budget influences ratings, or which films got the most bang for their buck. I might do a follow-up on some of these questions, because there’s so much you can do with this data! But for now…this article is done, so you can go watch all those top 20 films now. I highly recommend Seven Samurai and City of God. Two fantastic films. But if you want better recommendations than just the ones from my head, go sign up for MovieLens and let its algorithm recommend for you! (I’m not in any way affiliated with them; I just think they made a really cool website.)

FIN.


  1. They actually have several different algorithms implemented that you can switch between, which is great if you’re nerdy like me and want to see how your recommendations change depending on which information is being used.

  2. The downward linear trend is still there, with about the same strength, if movies for 2015 are removed.

  3. Forgive me for calling probably half the roles in the film industry “minor.” I’m not really sure how it’s decided who gets put in the credits, and why someone’s name would be in the credits of the film but not in the database. I’m assuming there’s some sort of “major/minor” cutoff, but I really don’t know. Some credits in the database include hairstylists and visual effects but not lighting or camera operators, so I don’t have any clue who gets included. The point is I’m not trying to say your job isn’t important, Mr. Best Boy. You may not get the recognition you deserve on The Movie Database, but I’m sure your mother is very proud.

  4. The sad thing is that Varney’s face is so expressive, I can vividly see in my mind his glum expression upon seeing he’s at the bottom of the list. Poor Jim Varney.