I recently came across the MovieLens database, which has 20+ million ratings of about 27,000 movies. Because I thought that sounded pretty cool, I decided to check out MovieLens itself, and it turns out that’s pretty cool too! It uses a collaborative filtering (machine learning) method to recommend movies to you based on movies you’ve already watched and rated.1 The nice thing is that it shows you the predicted rating for each movie—what the algorithm thinks you would rate the movie. So while going through the initial process of rating movies you’ve seen, you can see it get more and more accurate. I’ve now rated close to 400 movies and it probably knows me better than my closest friends. It actually told me where I had misplaced my keys last week. I think it might actually use incantations of black magic somewhere in the algorithm.
Anyway, the point is that they provide their database online for free, which is great! Kudos to the GroupLens research group for that. So I decided to play around with it and see what interesting information I could pull from it. You can see the code for all of these analyses by clicking on the buttons to the right, or go to the top of the page on the right and select “Show All Code”. Note that I’ve converted the CSV files to R-native RDS files, just to make it a bit quicker to load the data, but the raw data is available here. The code for all these analyses is available on Github. First let’s load some packages we’ll need, and the files for the list of movies, and list of ratings.
library(knitr)
library(psych)
library(stringr)
library(ggplot2)
library(tidyr)
library(dplyr)
movies <- readRDS('movies.rds')
ratings <- readRDS('ratings.rds')
One thing I wanted to know is whether there was a pattern for movie ratings across the year of release. Perhaps there was a “golden age” of cinema, so we should see a downward trend. Or perhaps there’s a recency bias, such that people come out of the movie theatre, say “that was awesome!” and rate the movie as 5 stars. In the latter case, we’d expect an upward trend, because people might be relying on their hazy memories of older movies, unless they’ve rewatched them recently.
I start by taking all 20 million ratings, and aggregating them by movie to calculate the mean rating for each (as well as the standard deviation). Below you can see an overall average. (The scale goes from a minimum of 0.5 stars to 5.0 stars.) It’s a little tricky because this is an average of averages, but we can see that movies in this database were on average rated 3.13, with a standard deviation of about .66. And the median number of ratings for a movie was 18, though the range on that is quite impressive.
ratings_summ <- ratings %>%
group_by(movieId) %>%
summarise(
avg_rating=mean(rating, na.rm=TRUE),
sd=sd(rating, na.rm=TRUE),
n=n())
# pull titles and genres back in to the data set
ratings_summ <- inner_join(movies, ratings_summ, by='movieId')
ratings_desc <- ratings_summ %>%
select(avg_rating, sd, n) %>%
describe() %>%
select(mean, sd, median, min, max)
rownames(ratings_desc) <- c('Avg. Rating', 'SD', 'Num. Ratings')
kable(
ratings_desc,
row.names=TRUE,
col.names=c('Mean', 'SD', 'Median', 'Min.', 'Max.'),
align='c',
digits=2)
Mean | SD | Median | Min. | Max. | |
---|---|---|---|---|---|
Avg. Rating | 3.13 | 0.66 | 3.24 | 0.5 | 5.00 |
SD | 0.92 | 0.32 | 0.94 | 0.0 | 3.18 |
Num. Ratings | 747.84 | 3085.82 | 18.00 | 1.0 | 67310.00 |
The database puts the year of release at the end of the title of each movie, so we first need to extract that with a little regex magic. We then aggregate across movies by year, calculating the average rating for all movies in that year, plus the 95% confidence intervals. In the graph below, though, you can barely see the confidence intervals (the gray ribbon behind the line), as it’s quite small and only really visible for the older movies. For this graph, I removed movies before 1915, as there was very little data for those older movies.
ratings_summ$year <- as.numeric(str_match(ratings_summ$title, '\\(([0-9]+)\\)$')[, 2])
by_year <- ratings_summ %>%
group_by(year) %>%
summarise(
rating=mean(avg_rating, na.rm=TRUE),
se=sd(avg_rating, na.rm=TRUE) / sqrt(sum(n, na.rm=TRUE)),
n=sum(n, na.rm=TRUE),
lo=rating - 1.96*se,
hi=rating + 1.96*se)
# not much data for most years before 1915
by_year <- by_year %>%
filter(!is.na(year), year >= 1915)
ggplot(by_year, aes(x=year, y=rating)) +
geom_ribbon(aes(ymin=lo, ymax=hi), alpha=.2) +
geom_line() +
labs(x='Year', y='Avg. Rating (0.5 - 5.0)') +
theme_minimal()
Visually, there doesn’t seem to be a very strong trend, but if we fit a linear model to it, there is actually a slight downward trend across years. This is basically as simple a model as you can get, but it does explain about 30% of the variance, so that’s…interesting. This could suggest that perhaps there’s a nostalgia factor, with people remembering movies as being better the further back in time they were made; or it could mean movies really were better back then; or it could be a selection bias, that older movies that get added to the database tend to be the better ones that people actually remember. It’s hard to really say which it could be. At any rate, it looks like maybe the late 80s to early 90s show the low point of a slump, but in recent years the ratings have been increasing again. I’m not sure what to make of the most recent year of data (2015), but there are fewer movies in the database, so I suspect the data dump happened partway through the year, so maybe the movies early in the year were especially bad.2
summary(lm(rating ~ year, by_year))
##
## Call:
## lm(formula = rating ~ year, data = by_year)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.28499 -0.04838 0.00564 0.05831 0.32783
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.6711732 0.6783529 11.309 < 2e-16 ***
## year -0.0022870 0.0003452 -6.626 1.82e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1011 on 99 degrees of freedom
## Multiple R-squared: 0.3072, Adjusted R-squared: 0.3002
## F-statistic: 43.9 on 1 and 99 DF, p-value: 1.82e-09
Let’s take a look at what our trusty movie raters think about the best and worst movies, shall we? Here are the top 20:
by_movie <- ratings %>%
group_by(movieId) %>%
summarise(
avg_rating=mean(rating, na.rm=TRUE),
num_ratings=n(),
rating_sd=sd(rating, na.rm=TRUE),
rating_se=rating_sd / sqrt(n())) %>%
inner_join(movies, by='movieId') %>%
arrange(desc(avg_rating))
by_movie %>%
filter(num_ratings > 100) %>%
select(title, avg_rating, rating_sd, num_ratings) %>%
head(n=20) %>%
kable(
row.names=FALSE,
col.names=c('Title', 'Avg. Rating', 'SD', 'Num. Ratings'),
align='lccc',
digits=c(NA, 2, 2, 0))
Title | Avg. Rating | SD | Num. Ratings |
---|---|---|---|
Shawshank Redemption, The (1994) | 4.45 | 0.72 | 63366 |
Godfather, The (1972) | 4.36 | 0.84 | 41355 |
Usual Suspects, The (1995) | 4.33 | 0.76 | 47006 |
Schindler’s List (1993) | 4.31 | 0.83 | 50054 |
Godfather: Part II, The (1974) | 4.28 | 0.86 | 27398 |
Seven Samurai (Shichinin no samurai) (1954) | 4.27 | 0.84 | 11611 |
Rear Window (1954) | 4.27 | 0.75 | 17449 |
Band of Brothers (2001) | 4.26 | 0.91 | 4305 |
Casablanca (1942) | 4.26 | 0.86 | 24349 |
Sunset Blvd. (a.k.a. Sunset Boulevard) (1950) | 4.26 | 0.83 | 6525 |
One Flew Over the Cuckoo’s Nest (1975) | 4.25 | 0.77 | 29932 |
Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb (1964) | 4.25 | 0.86 | 23220 |
Third Man, The (1949) | 4.25 | 0.82 | 6565 |
City of God (Cidade de Deus) (2002) | 4.24 | 0.76 | 12937 |
Lives of Others, The (Das leben der Anderen) (2006) | 4.23 | 0.73 | 5720 |
North by Northwest (1959) | 4.23 | 0.75 | 15627 |
Paths of Glory (1957) | 4.23 | 0.82 | 3568 |
Fight Club (1999) | 4.23 | 0.87 | 40106 |
Double Indemnity (1944) | 4.22 | 0.81 | 4909 |
12 Angry Men (1957) | 4.22 | 0.78 | 12934 |
This looks about right to me. There are a few on here I haven’t seen, but of the ones I have, these seem pretty much in line with what I would expect. And if they’re in line with my own personal preferences in movies, they must be correct, right? Right? Okay, so of course this is subjective. But the bottom line is…if you haven’t seen any of these movies, go watch them now. Well, finish reading this, and then go watch them.
But let’s see what the worst movies of all time are.
bottom20 <- by_movie %>%
filter(num_ratings > 100) %>%
select(title, avg_rating, rating_sd, num_ratings) %>%
tail(n=20)
# fix a couple non-ASCII characters
bottom20[9, 'title'] <- 'Pokemon 4 Ever (a.k.a. Pokemon 4: The Movie) (2002)'
bottom20[14, 'title'] <- 'Pokemon Heroes (2003)'
bottom20 %>%
kable(
row.names=FALSE,
col.names=c('Title', 'Avg. Rating', 'SD', 'Num. Ratings'),
align='lccc',
digits=c(NA, 2, 2, 0))
Title | Avg. Rating | SD | Num. Ratings |
---|---|---|---|
Carnosaur 2 (1995) | 1.45 | 1.03 | 167 |
Prom Night IV: Deliver Us From Evil (1992) | 1.43 | 1.03 | 152 |
Ernest Goes to Africa (1997) | 1.40 | 1.16 | 223 |
Turbo: A Power Rangers Movie (1997) | 1.39 | 1.07 | 715 |
Faces of Death 4 (1990) | 1.38 | 1.23 | 179 |
House of the Dead, The (2003) | 1.36 | 1.01 | 447 |
Faces of Death: Fact or Fiction? (1999) | 1.35 | 1.22 | 124 |
Faces of Death 5 (1996) | 1.34 | 1.22 | 157 |
Pokemon 4 Ever (a.k.a. Pokemon 4: The Movie) (2002) | 1.32 | 1.21 | 472 |
Faces of Death 6 (1996) | 1.26 | 1.18 | 174 |
Son of the Mask (2005) | 1.25 | 0.95 | 467 |
Disaster Movie (2008) | 1.25 | 1.06 | 397 |
Carnosaur 3: Primal Species (1996) | 1.24 | 0.97 | 139 |
Pokemon Heroes (2003) | 1.17 | 1.16 | 325 |
Gigli (2003) | 1.16 | 0.91 | 701 |
Barney’s Great Adventure (1998) | 1.16 | 0.89 | 419 |
Glitter (2001) | 1.12 | 0.83 | 685 |
Bratz: The Movie (2007) | 1.11 | 1.10 | 180 |
From Justin to Kelly (2003) | 0.97 | 0.82 | 426 |
SuperBabies: Baby Geniuses 2 (2004) | 0.84 | 0.78 | 209 |
Personally, the thing I love most about this list is just how many sequels there are. Carnosaur 2 was apparently one of the worst movies of all time, and then they decided to make Carnosaur 3. Faces of Death has four separate movies in the bottom 20 movies of all time. But come on, people—How is Barney’s Great Adventure this low? That movie was a masterpiece.
Another fun thing about this database is that users can add “tags” to a movie to identify them with different themes or put them in different groups. These tags are completely open-ended, so there are lots of great tags like “wasted potential” and “Ethan Hunt Should Stop Hogging The Screen!” and “the director should be shot”. Those might not be particularly useful for identifying trends in movies, since they are often only applied to one movie. But there are lots of other great user tags. So what tags might you look for to find the best movies? Or the worst?
tags <- readRDS('tags.rds')
tags <- inner_join(tags, by_movie, by='movieId')
tags$tag <- tolower(tags$tag)
all_tags <- tags %>%
select(movieId, tag, avg_rating, num_ratings) %>%
distinct() %>% # only want one tag per movie
group_by(tag) %>%
summarise(
avg_rating=mean(avg_rating),
num_ratings=sum(num_ratings),
num_movies=n())
all_tags %>%
filter(num_ratings > 100, num_movies > 5) %>%
arrange(desc(avg_rating)) %>%
head(n=30) %>%
kable(
row.names=FALSE,
col.names=c('Tag', 'Avg. Rating', 'Num. Ratings', 'Num. Movies'),
align='lccc',
digits=c(NA, 2, 0, 0))
Tag | Avg. Rating | Num. Ratings | Num. Movies |
---|---|---|---|
afi100 | 4.13 | 216653 | 10 |
palestine | 4.11 | 5769 | 9 |
wildly overrated | 4.10 | 175705 | 6 |
pelicula anime | 4.10 | 38706 | 6 |
top 10 ever | 4.09 | 233910 | 7 |
top 250 imdb | 4.08 | 451455 | 51 |
super interesting | 4.08 | 284879 | 8 |
great cast excellent | 4.07 | 102723 | 7 |
afi 10 (courtroom drama) | 4.06 | 47089 | 6 |
zibri studio | 4.05 | 46155 | 9 |
afi | 4.04 | 84615 | 9 |
emerson must see | 4.03 | 369838 | 39 |
claude rains | 4.02 | 47981 | 6 |
estrangeinro | 4.02 | 68370 | 6 |
wikilens top pick | 4.02 | 8265 | 6 |
best picture | 4.02 | 597810 | 35 |
spielberg/lucas | 4.02 | 365626 | 8 |
great screenplays | 4.01 | 141381 | 14 |
imdb top 250 | 4.01 | 4080803 | 322 |
100 greatest movies | 4.01 | 552716 | 52 |
bfi classic | 4.00 | 192813 | 36 |
tag | 4.00 | 80829 | 6 |
kurosawa | 4.00 | 38897 | 14 |
wallace and gromit | 4.00 | 47978 | 6 |
criticker top pick | 4.00 | 201480 | 14 |
afi 100 | 3.99 | 1439167 | 103 |
miyazaki | 3.99 | 51638 | 10 |
oscar nominee: cinematography | 3.99 | 143218 | 11 |
usa film registry | 3.99 | 263060 | 26 |
fantastic foreign films | 3.99 | 25420 | 9 |
Some of these are fairly straight-forward. The tags having to do with “AFI” refer to the American Film Institute, which puts out a ranking of top 100 movies; and tags related to IMDB top 250 refer to another common list of best movies. But we can also see that anime films, films with Claude Rains, and films by directors like Spielberg, Lucas, Kurosawa, and Miyazaki, all get highly rated. People also seem to like movies about Palestine, Wallace and Gromit, and films nominated for Oscars in cinematography. I do love that the “wildly overrated” tag shows up on highly-rated films, too. Apparently somebody doesn’t agree very much with those ratings. In case you’re curious, here are the overrated films. (I happen to disagree with whoever tagged them as overrated. I think they’re rated just fine the way they are, thank you.)
overrated <- tags %>%
filter(tag == 'wildly overrated') %>%
select(title)
overrated[4, 'title'] <- 'Amelie (Fabuleux destin d\'Amelie Poulain, Le) (2001)'
overrated %>%
kable()
Shawshank Redemption, The (1994)
American Beauty (1999)
Magnolia (1999)
Amelie (Fabuleux destin d’Amelie Poulain, Le) (2001) Eternal Sunshine of the Spotless Mind (2004)
Crash (2004)
And just for good measure, here are the tags people gave to the lowest-rated films:
all_tags %>%
filter(num_ratings > 100, num_movies > 5) %>%
arrange(desc(avg_rating)) %>%
tail(n=30) %>%
kable(
row.names=FALSE,
col.names=c('Tag', 'Avg. Rating', 'Num. Ratings', 'Num. Movies'),
align='lccc',
digits=c(NA, 2, 0, 0))
Tag | Avg. Rating | Num. Ratings | Num. Movies |
---|---|---|---|
eco-horror | 2.53 | 1318 | 6 |
steve guttenberg | 2.52 | 29932 | 9 |
adapted from:game | 2.52 | 69929 | 17 |
based on a tv show? | 2.52 | 26213 | 9 |
based on a cartoon | 2.48 | 14234 | 9 |
so bad it’s almost good | 2.47 | 12920 | 17 |
golden raspberry (worst actor) | 2.47 | 99215 | 32 |
sharks | 2.46 | 8389 | 7 |
nostalgia critic | 2.45 | 314931 | 72 |
albert pyun | 2.42 | 1252 | 9 |
animation remade as live action | 2.41 | 53652 | 18 |
flashback overload | 2.39 | 5908 | 6 |
subgenre:slasher | 2.38 | 45907 | 28 |
shark | 2.37 | 40456 | 11 |
chuck norris | 2.34 | 4429 | 21 |
franchise beaten to death | 2.34 | 44247 | 6 |
bollocks | 2.34 | 3751 | 11 |
worst movie ever! | 2.31 | 14327 | 8 |
golden raspberry (worst picture) | 2.29 | 57463 | 25 |
menahem golan | 2.28 | 13736 | 27 |
steven seagal | 2.24 | 33572 | 25 |
character:jason vorhees | 2.21 | 13343 | 10 |
golden raspberry (worst actress) | 2.18 | 58314 | 27 |
full moon entertainment | 2.18 | 2356 | 20 |
mst3k | 2.16 | 7228 | 16 |
jason vorhees | 2.15 | 9156 | 7 |
pokemon | 1.96 | 3991 | 9 |
uwe boll | 1.86 | 1716 | 6 |
no rec? | 1.85 | 41829 | 18 |
imdb bottom 100 | 1.82 | 24040 | 59 |
Looks like people don’t like Steve Guttenberg, Albert Pyun, Chuck Norris, Menahem Golan, Steven Seagal, and of course, Uwe Boll. People also don’t seem to like film adaptations of games, TV shows, cartoons, and animations remade as live action. Also, apparently the Jason films are really disliked, at least by people who can’t spell the name of the main character (Jason Voorhees).
One thing that I really wanted to be able to do is take a look at how movies by particular directors, actors, etc. are rated. While this information isn’t in the MovieLens database, it does provide the ID numbers to link movies with IMDB and TMDB, the latter of which has an open API. From there, I was able to pull the full cast and crew credits for almost all the movies in the MovieLens database!
links <- readRDS('links.rds')
# NOTE: This isn't being run because it takes a long time. But if you're interested in doing the same,
# you can run the same function, just give it your API.
# get_cast_crew <- function(ids, api_key) {
# counter <- 0
# cast <- data.frame(movie_id=numeric(0), person_id=numeric(0), name=character(0),
# character=character(0), gender=numeric(0), order=numeric(0), stringsAsFactors=FALSE)
# crew <- data.frame(movie_id=numeric(0), person_id=numeric(0), name=character(0),
# department=character(0), job=character(0), gender=numeric(0), stringsAsFactors=FALSE)
#
# for (i in 1:length(ids)) {
# if (!is.na(links[i, 'tmdbId'])) {
# url <- paste0('https://api.themoviedb.org/3/movie/', ids[i], '/credits?api_key=', api_key, '&language=en-US')
# rqst <- GET(url)
#
# if (rqst$status_code == 200) {
# data <- content(rqst)
#
# cs_cnt <- length(data$cast)
# if (cs_cnt > 0) {
# cast_tmp <- data.frame(movie_id=numeric(cs_cnt), person_id=numeric(cs_cnt),
# name=character(cs_cnt), character=character(cs_cnt),
# gender=numeric(cs_cnt), order=numeric(cs_cnt), stringsAsFactors=FALSE)
#
# for (cs in 1:cs_cnt) {
# cast_tmp[cs, 'movie_id'] <- ids[i]
# cast_tmp[cs, 'person_id'] <- data$cast[[cs]]$id
# cast_tmp[cs, 'name'] <- data$cast[[cs]]$name
# cast_tmp[cs, 'character'] <- data$cast[[cs]]$character
# cast_tmp[cs, 'gender'] <- data$cast[[cs]]$gender
# cast_tmp[cs, 'order'] <- data$cast[[cs]]$order
# }
# cast <- rbind(cast, cast_tmp)
# }
#
# cr_cnt <- length(data$crew)
# if (cr_cnt > 0) {
# crew_tmp <- data.frame(movie_id=numeric(cr_cnt), person_id=numeric(cr_cnt),
# name=character(cr_cnt), department=character(cr_cnt), job=character(cr_cnt),
# gender=numeric(cr_cnt), stringsAsFactors=FALSE)
#
# for (cr in 1:cr_cnt) {
# crew_tmp[cr, 'movie_id'] <- ids[i]
# crew_tmp[cr, 'person_id'] <- data$crew[[cr]]$id
# crew_tmp[cr, 'name'] <- data$crew[[cr]]$name
# crew_tmp[cr, 'department'] <- data$crew[[cr]]$department
# crew_tmp[cr, 'job'] <- data$crew[[cr]]$job
# crew_tmp[cr, 'gender'] <- data$crew[[cr]]$gender
# }
# crew <- rbind(crew, crew_tmp)
# }
# }
#
# # limit rate
# if ('x-ratelimit-remaining' %in% names(rqst$headers) &&
# as.numeric(rqst$headers$`x-ratelimit-remaining`) == 1) {
#
# reset_time <- as.POSIXct(as.numeric(rqst$headers$`x-ratelimit-reset`), origin='1970-01-01')
# sleep_time <- difftime(reset_time, Sys.time())
# if (sleep_time > 0) {
# Sys.sleep(sleep_time + 1)
# }
# }
# }
# }
#
# return(list(cast=cast, crew=crew))
# }
#
# output <- get_cast_crew(links[1:nrow(links), 'tmdbId'], '<<YOUR API HERE>>')
# saveRDS(output$cast, file='cast.rds')
# saveRDS(output$crew, file='crew.rds')
cast <- readRDS('cast.rds')
crew <- readRDS('crew.rds')
cast <- cast %>%
inner_join(select(links, tmdbId, movieId), by=c('movie_id'='tmdbId')) %>%
rename(tmdbId=movie_id)
crew <- crew %>%
inner_join(select(links, tmdbId, movieId), by=c('movie_id'='tmdbId')) %>%
rename(tmdbId=movie_id)
cast <- cast %>%
inner_join(movies, by='movieId')
crew <- crew %>%
inner_join(movies, by='movieId')
cast <- cast %>%
inner_join(select(by_movie, -title, -genres), by='movieId')
crew <- crew %>%
inner_join(select(by_movie, -title, -genres), by='movieId')
Of course, once I realized that I could get the credits for all these movies, I immediately wanted to compare movie ratings by who was the Best Boy for the movie, so we could determine once and for all who the real “Best” Boy is. Unfortunately, the credits only cover the more major roles.3 So I’m a little more limited in the questions I can answer, but there’s still a boatload of information for all these movies. So let’s pick some fun ones!
Let’s start with movie directors. Who are the best directors of all time? We can, of course, easily answer this definitively and objectively, with this data based on completely subjective ratings.
by_director <- crew %>%
filter(job == 'Director') %>%
group_by(person_id) %>%
summarise(
name=nth(name, 1),
avg_rating=mean(avg_rating, na.rm=TRUE),
num_movies=n(),
num_ratings=sum(num_ratings, na.rm=TRUE)) %>%
filter(num_ratings > 100, num_movies > 5) %>%
arrange(desc(avg_rating))
head(by_director, n=20) %>%
select(name, avg_rating, num_ratings, num_movies) %>%
kable(
row.names=FALSE,
col.names=c('Name', 'Avg. Rating', 'Num. Ratings', 'Num. Movies'),
align='lccc',
digits=c(NA, 2, 0, 0))
Name | Avg. Rating | Num. Ratings | Num. Movies |
---|---|---|---|
Hayao Miyazaki | 4.00 | 47763 | 12 |
Christopher Nolan | 3.99 | 108895 | 9 |
Sergio Leone | 3.95 | 34759 | 6 |
Shannon Hartman | 3.95 | 726 | 7 |
Nick Park | 3.94 | 58864 | 7 |
Krzysztof Kieslowski | 3.89 | 21762 | 9 |
Masaki Kobayashi | 3.87 | 1013 | 14 |
Quentin Tarantino | 3.86 | 168465 | 9 |
Frederick Wiseman | 3.85 | 106 | 15 |
Satyajit Ray | 3.83 | 1945 | 16 |
Wes Anderson | 3.81 | 46341 | 9 |
Louis C.K. | 3.80 | 3244 | 7 |
Paolo Sorrentino | 3.77 | 453 | 6 |
Akira Kurosawa | 3.76 | 33408 | 30 |
Park Chan-wook | 3.76 | 9796 | 7 |
Preston Sturges | 3.75 | 4125 | 9 |
Kaneto Shindô | 3.75 | 244 | 6 |
David Fincher | 3.74 | 136303 | 10 |
Joel Coen | 3.74 | 154372 | 18 |
Alfonso Cuarón | 3.73 | 37711 | 8 |
Some of these completely go without saying. Miyazaki, Nolan, Tarantino, Kurosawa, Fincher, Coen—even if you might quibble over which of these belong in which order, they are certainly worthy of being on a Top 20 list. However, I was surprised by Louis C.K. being on there—I guess people really love his comedy. Keep in mind that his directorial credits are primarily for his comedy specials, which probably get rated on a somewhat different scale in relation to more traditional films. That said, his TV show Louie (though not in this database) is pretty great! Shannon Hartman is also a director of many comedy specials; Nick Park is director on the Wallace and Gromit films (which as noted above, people love). It’s important to keep in mind that although great directors make great films, the greatness of a film could be due to something other than the director. These are ratings of the movies, not ratings of the directors. As such, this list is really a list of directors who make movies that people (on average) really like. So yes, that includes the work of Studio Ghibli (Miyazaki), but it also includes the comedy stylings of Louis C.K.
The casting data from TMDB also includes the billing order, so we can take a look at which lead actors are in the top-rated movies:
by_lead <- cast %>%
filter(num_ratings > 100, order == 0) %>%
group_by(person_id) %>%
summarise(
name=nth(name, 1),
avg_rating=mean(avg_rating, na.rm=TRUE),
num_movies=n(),
num_ratings=sum(num_ratings, na.rm=TRUE)) %>%
filter(num_movies > 5) %>%
arrange(desc(avg_rating))
head(by_lead, n=20) %>%
select(name, avg_rating, num_ratings, num_movies) %>%
kable(
row.names=FALSE,
col.names=c('Name', 'Avg. Rating', 'Num. Ratings', 'Num. Movies'),
align='lccc',
digits=c(NA, 2, 0, 0))
Name | Avg. Rating | Num. Ratings | Num. Movies |
---|---|---|---|
Toshiro Mifune | 3.98 | 24296 | 14 |
Charlie Chaplin | 3.91 | 14510 | 10 |
Groucho Marx | 3.90 | 10622 | 8 |
Humphrey Bogart | 3.87 | 69578 | 18 |
Henry Fonda | 3.85 | 26629 | 12 |
William Powell | 3.84 | 7116 | 9 |
Buster Keaton | 3.83 | 4106 | 10 |
Cary Grant | 3.82 | 55568 | 24 |
James Stewart | 3.82 | 65062 | 19 |
Gong Li | 3.80 | 5750 | 6 |
James Cagney | 3.78 | 2950 | 7 |
Takeshi Kitano | 3.77 | 9139 | 7 |
Spencer Tracy | 3.77 | 15176 | 12 |
Katharine Hepburn | 3.77 | 10599 | 8 |
Tony Leung Chiu-Wai | 3.76 | 4548 | 6 |
Gary Cooper | 3.75 | 8416 | 10 |
Peter O’Toole | 3.74 | 18436 | 8 |
Gael García Bernal | 3.74 | 18397 | 7 |
Ryan Gosling | 3.73 | 9068 | 7 |
Fred Astaire | 3.72 | 6668 | 11 |
There’s a definite skew there toward actors in older films, and perhaps that’s because I limited it to those who were the leads in more than five movies (otherwise it tends to skew toward those who had one good movie). But hey, Ryan Gosling is representing the modern era here! My data here suggests that if he were to put on some glasses and a thick mustache, like Groucho Marx, he might improve his career even further.
And here we have the leads in the worst-rated movies:
tail(by_lead, n=20) %>%
select(name, avg_rating, num_ratings, num_movies) %>%
kable(
row.names=FALSE,
col.names=c('Name', 'Avg. Rating', 'Num. Ratings', 'Num. Movies'),
align='lccc',
digits=c(NA, 2, 0, 0))
Name | Avg. Rating | Num. Ratings | Num. Movies |
---|---|---|---|
Burt Reynolds | 2.73 | 9170 | 12 |
John Candy | 2.73 | 9462 | 10 |
Christopher Lambert | 2.72 | 29348 | 9 |
Lindsay Lohan | 2.71 | 8990 | 6 |
Jennifer Lopez | 2.71 | 23731 | 9 |
Macaulay Culkin | 2.68 | 43283 | 6 |
Jim Belushi | 2.68 | 2765 | 7 |
Dudley Moore | 2.67 | 7627 | 8 |
Charles Bronson | 2.64 | 3070 | 9 |
Steven Seagal | 2.62 | 19756 | 10 |
Jamie Lee Curtis | 2.61 | 9212 | 7 |
Kim Basinger | 2.60 | 3181 | 6 |
Jean-Claude Van Damme | 2.59 | 27852 | 18 |
Martin Lawrence | 2.55 | 10440 | 12 |
Freddie Prinze Jr. | 2.50 | 12068 | 7 |
Rob Schneider | 2.41 | 6869 | 6 |
Chuck Norris | 2.41 | 2795 | 9 |
Robert Englund | 2.40 | 10505 | 6 |
Madonna | 2.35 | 14504 | 8 |
Jim Varney | 1.98 | 7776 | 7 |
This is…about what I would expect. Rob Schneider nearing the bottom? Yep. I’m just sad for poor ol’ Jim Varney, who starred in all those “Ernest” movies I watched as a kid, which probably would not at all hold up if I were to rewatch them. I’ll just leave those as a pleasant memory.4
Well, this makes me wonder if actors have an arc to their career. Do they start off poorly, then have a peak somewhere in their career, before fading off into obscurity? Perhaps that’s what they experience with regard to their fame as celebrities, but does that bear out in the ratings of the movies they make over their career? Let’s find out!
To test this, I calculated the average movie ratings for each actor in each year of their career. I then fit a hierarchical linear model (HLM) that gives each actor a random intercept (i.e., the model fits each actor with their own mean level), and then looks at the linear and quadratic effects of year of career. If you’re not familiar with HLM, that’s okay—the main story here is that there’s a slight downward linear trend across an actor’s career, but also evidence for a slight U-shape. It’s a little easier to see in the graph below.
cast$year <- as.numeric(stringr::str_match(cast$title, '\\(([0-9]+)\\)$')[, 2])
# average across all movies for an actor per year
by_person_by_year <- cast %>%
group_by(person_id, year) %>%
summarise(
name=nth(name, 1),
avg_rating=mean(avg_rating, na.rm=TRUE),
num_movies=n(),
num_ratings=sum(num_ratings, na.rm=TRUE)) %>%
group_by(person_id) %>%
mutate(year=year - first(year) + 1) # this resets the year so all actors start at year 1
# hierarchical linear model testing linear and quadratic effects of year, after allowing
# for a random intercept for each actor
library(nlme)
model <- lme(avg_rating ~ year + I(year^2), random=~1|person_id, data=by_person_by_year,
na.action=na.omit)
summary(model)
## Linear mixed-effects model fit by REML
## Data: by_person_by_year
## AIC BIC logLik
## 536064.4 536117.4 -268027.2
##
## Random effects:
## Formula: ~1 | person_id
## (Intercept) Residual
## StdDev: 0.271587 0.5460772
##
## Fixed effects: avg_rating ~ year + I(year^2)
## Value Std.Error DF t-value p-value
## (Intercept) 3.207716 0.0016480635 160692 1946.3544 0
## year -0.005021 0.0002876172 160692 -17.4565 0
## I(year^2) 0.000067 0.0000071770 160692 9.3353 0
## Correlation:
## (Intr) year
## year -0.529
## I(year^2) 0.404 -0.900
##
## Standardized Within-Group Residuals:
## Min Q1 Med Q3 Max
## -5.3352075 -0.4792010 0.1220818 0.6284551 3.7568826
##
## Number of Observations: 297992
## Number of Groups: 137298
grid <- data.frame(year=1:max(by_person_by_year$year, na.rm=TRUE))
grid$avg_rating <- predict(model, grid, level=0, na.action=na.omit)
# this is just to calculate standard error from an HLM
# code from: http://glmm.wikidot.com/faq
designmat <- model.matrix(formula(model)[-2], grid)
predvar <- diag(designmat %*% vcov(model) %*% t(designmat))
se <- sqrt(predvar)
grid$error_upper <- grid$avg_rating + 1.96 * se
grid$error_lower <- grid$avg_rating - 1.96 * se
ggplot(grid, aes(x=year, y=avg_rating)) +
geom_ribbon(aes(ymin=error_lower, ymax=error_upper), alpha=.2) +
geom_line() +
ylim(0.5, 5) +
labs(x='Year of Actor\'s Career', y='Avg. Rating (0.5 - 5.0)') +
theme_minimal()
It looks as though actors tend to start strong in their first few years, but movie ratings tend to dip a little mid-career. If they have a long career, they may get a rise in ratings again. However, keep in mind that this could be a survivorship bias—it may be that better actors are just more likely to have a long career, so as the poorer actors drop out, the average tends to increase. It’s hard to say. The notable thing about the graph is that the dip is still quite slight. There’s much more variation between movies for any given year than there are overall changes in average ratings across an actor’s career.
Another group of people we can look at are screenplay writers. Let’s take a look at writers who have written more than five screenplays:
crew %>%
filter(job == 'Screenplay') %>%
group_by(person_id) %>%
summarise(
name=nth(name, 1),
avg_rating=mean(avg_rating, na.rm=TRUE),
num_movies=n(),
num_ratings=sum(num_ratings, na.rm=TRUE)) %>%
filter(num_ratings > 100, num_movies > 5) %>%
arrange(desc(avg_rating)) %>%
head(n=20) %>%
select(name, avg_rating, num_ratings, num_movies) %>%
kable(
row.names=FALSE,
col.names=c('Name', 'Avg. Rating', 'Num. Ratings', 'Num. Movies'),
align='lccc',
digits=c(NA, 2, 0, 0))
Name | Avg. Rating | Num. Ratings | Num. Movies |
---|---|---|---|
Christopher Nolan | 4.06 | 100934 | 7 |
Shinobu Hashimoto | 4.01 | 17981 | 7 |
Akira Kurosawa | 3.96 | 32752 | 14 |
Sergio Leone | 3.95 | 34759 | 6 |
Stanley Kubrick | 3.94 | 144282 | 10 |
Orson Welles | 3.91 | 24160 | 6 |
Enrico Medioli | 3.90 | 4464 | 6 |
Ingmar Bergman | 3.86 | 11020 | 12 |
Philippa Boyens | 3.84 | 108280 | 6 |
Robert Bolt | 3.84 | 23206 | 6 |
Hayao Miyazaki | 3.83 | 37829 | 11 |
Ennio Flaiano | 3.83 | 5883 | 8 |
Leigh Brackett | 3.80 | 54146 | 7 |
Michael Wilson | 3.80 | 26231 | 6 |
Satyajit Ray | 3.80 | 1925 | 13 |
Martin Scorsese | 3.79 | 49849 | 6 |
Park Chan-wook | 3.79 | 9367 | 6 |
Kenneth Branagh | 3.79 | 25636 | 6 |
Luis Buñuel | 3.78 | 6333 | 11 |
Krzysztof Kieslowski | 3.78 | 21336 | 6 |
Again, a lot of these names should probably be familiar. Having a great screenplay really shows, and these folks have done amazing things. People really love their Nolan films, and it’s no surprise why.
One interesting tidbit in the credits is that if a movie is adapted from a book, the author shows up in the credits as well. So we can easily find all the authors who have had really good (and really bad) adaptations of their work:
novel <- crew %>%
filter(job == 'Novel') %>%
group_by(person_id) %>%
summarise(
name=nth(name, 1),
avg_rating=mean(avg_rating, na.rm=TRUE),
num_movies=n(),
num_ratings=sum(num_ratings, na.rm=TRUE)) %>%
filter(num_ratings > 100, num_movies >= 3) %>%
arrange(desc(avg_rating))
head(novel, n=20) %>%
select(name, avg_rating, num_ratings, num_movies) %>%
kable(
row.names=FALSE,
col.names=c('Name', 'Avg. Rating', 'Num. Ratings', 'Num. Movies'),
align='lccc',
digits=c(NA, 2, 0, 0))
Name | Avg. Rating | Num. Ratings | Num. Movies |
---|---|---|---|
Bibhutibhushan Bandyopadhyay | 4.10 | 1869 | 4 |
Marcel Pagnol | 4.01 | 7002 | 4 |
Alan Le May | 3.92 | 3319 | 3 |
Jane Austen | 3.87 | 37980 | 12 |
Yasuhiko Takiguchi | 3.84 | 520 | 3 |
Stieg Larsson | 3.83 | 6914 | 4 |
José Giovanni | 3.83 | 116 | 3 |
Jumpei Gomikawa | 3.82 | 201 | 3 |
Joanne K. Rowling | 3.77 | 72889 | 10 |
Joseph Kessel | 3.75 | 315 | 3 |
Karen Blixen | 3.75 | 6197 | 3 |
Mario Puzo | 3.75 | 68814 | 3 |
Eiji Yoshikawa | 3.73 | 339 | 3 |
Philip K. Dick | 3.72 | 34806 | 4 |
Charlotte Brontë | 3.71 | 3065 | 3 |
Bohumil Hrabal | 3.68 | 252 | 3 |
Larry McMurtry | 3.67 | 9404 | 4 |
Frances Hodgson Burnett | 3.67 | 12800 | 4 |
Alberto Moravia | 3.66 | 1693 | 4 |
William Shakespeare | 3.65 | 5540 | 4 |
I had no idea who the top author was. Turns out that he was a Bengali author who wrote a novel that was later adapted into a trilogy of films frequently heralded as the greatest movies from Bollywood, The Apu Trilogy. Cool stuff! Apparently I’m not really up on my Bollywood films. Some of the others I’m familiar with—Jane Austen should need no introduction, Stieg Larsson and J.K. Rowling have more recent works that have been adapted, and there’s good ol’ Billy Shakespeare rounding out the Top 20. And of course, Mario Puzo wrote The Godfather, so of course his film adaptations are highly rated!
But yeah, let’s look at the authors with the worst film adaptations:
tail(novel, n=20) %>%
select(name, avg_rating, num_ratings, num_movies) %>%
kable(
row.names=FALSE,
col.names=c('Name', 'Avg. Rating', 'Num. Ratings', 'Num. Movies'),
align='lccc',
digits=c(NA, 2, 0, 0))
Name | Avg. Rating | Num. Ratings | Num. Movies |
---|---|---|---|
James Michener | 2.96 | 147 | 3 |
Friedrich Dürrenmatt | 2.94 | 1516 | 3 |
Marvin H. Albert | 2.93 | 109 | 6 |
Akira Toriyama | 2.91 | 521 | 17 |
Ayn Rand | 2.89 | 308 | 4 |
Jules Verne | 2.87 | 11367 | 13 |
Joan McLeod | 2.85 | 1304 | 4 |
Arthur Hailey | 2.85 | 2800 | 4 |
Eric Knight | 2.79 | 177 | 3 |
Jack Ketchum | 2.78 | 354 | 4 |
Peter Benchley | 2.77 | 24349 | 5 |
H. Rider Haggard | 2.72 | 572 | 7 |
Stephenie Meyer | 2.70 | 5127 | 6 |
Arturo Pérez-Reverte | 2.64 | 3523 | 4 |
Dr. Seuss | 2.63 | 1900 | 3 |
Arthur Conan Doyle | 2.59 | 752 | 5 |
Edgar Rice Burroughs | 2.57 | 5744 | 5 |
H.P. Lovecraft | 2.43 | 457 | 4 |
Jonathan Swift | 2.28 | 226 | 3 |
Sax Rohmer | 2.17 | 296 | 4 |
Some of these make me sad. Jules Verne! Arthur Conan Doyle! Jonathan Swift! I love all these authors. But I guess it’s not their fault that somebody made terrible film adaptations of their excellent books. Then, of course, there are other authors like Ayn Rand and Stephenie Meyer…when the source material is that bad, it’s hard to make a good film adaptation, am I right?
There is lots of information in this database. You could take a look, for instance, at which actors do better in leading roles than in supporting roles. Or because each rating is timestamped, you could take a look at how ratings evolve—from when the movie first comes out, to later on in time. TMDB also includes information about the budget for each film, so you could take a look at how much budget influences ratings, or which films got the most bang for their buck. I might do a follow-up on some of these questions, because there’s so much you can do with this data! But for now…this article is done, so you can go watch all those top 20 films now. I highly recommend Seven Samurai and City of God. Two fantastic films. But if you want better recommendations than just the ones from my head, go sign up for MovieLens and let its algorithm recommend for you! (I’m not in any way affiliated with them; I just think they made a really cool website.)
FIN.
They actually have several different algorithms implemented that you can switch between, which is great if you’re nerdy like me and want to see how your recommendations change depending on which information is being used.↩
The downward linear trend is still there, with about the same strength, if movies for 2015 are removed.↩
Forgive me for calling probably half the roles in the film industry “minor.” I’m not really sure how it’s decided who gets put in the credits, and why someone’s name would be in the credits of the film but not in the database. I’m assuming there’s some sort of “major/minor” cutoff, but I really don’t know. Some credits in the database include hairstylists and visual effects but not lighting or camera operators, so I don’t have any clue who gets included. The point is I’m not trying to say your job isn’t important, Mr. Best Boy. You may not get the recognition you deserve on The Movie Database, but I’m sure your mother is very proud.↩
The sad thing is that Varney’s face is so expressive, I can vividly see in my mind his glum expression upon seeing he’s at the bottom of the list. Poor Jim Varney.↩