#### By Jeff Hughes (July 28, 2017)

I recently came across the MovieLens database, which has 20+ million ratings of about 27,000 movies. Because I thought that sounded pretty cool, I decided to check out MovieLens itself, and it turns out that’s pretty cool too! It uses a collaborative filtering (machine learning) method to recommend movies to you based on movies you’ve already watched and rated.1 The nice thing is that it shows you the predicted rating for each movie—what the algorithm thinks you would rate the movie. So while going through the initial process of rating movies you’ve seen, you can see it get more and more accurate. I’ve now rated close to 400 movies and it probably knows me better than my closest friends. It actually told me where I had misplaced my keys last week. I think it might actually use incantations of black magic somewhere in the algorithm.

Anyway, the point is that they provide their database online for free, which is great! Kudos to the GroupLens research group for that. So I decided to play around with it and see what interesting information I could pull from it. You can see the code for all of these analyses by clicking on the buttons to the right, or go to the top of the page on the right and select “Show All Code”. Note that I’ve converted the CSV files to R-native RDS files, just to make it a bit quicker to load the data, but the raw data is available here. The code for all these analyses is available on Github. First let’s load some packages we’ll need, and the files for the list of movies, and list of ratings.

library(knitr)
library(psych)
library(stringr)
library(ggplot2)
library(tidyr)
library(dplyr)

ratings <- readRDS('ratings.rds')

## Ratings by Year

One thing I wanted to know is whether there was a pattern for movie ratings across the year of release. Perhaps there was a “golden age” of cinema, so we should see a downward trend. Or perhaps there’s a recency bias, such that people come out of the movie theatre, say “that was awesome!” and rate the movie as 5 stars. In the latter case, we’d expect an upward trend, because people might be relying on their hazy memories of older movies, unless they’ve rewatched them recently.

I start by taking all 20 million ratings, and aggregating them by movie to calculate the mean rating for each (as well as the standard deviation). Below you can see an overall average. (The scale goes from a minimum of 0.5 stars to 5.0 stars.) It’s a little tricky because this is an average of averages, but we can see that movies in this database were on average rated 3.13, with a standard deviation of about .66. And the median number of ratings for a movie was 18, though the range on that is quite impressive.

ratings_summ <- ratings %>%
group_by(movieId) %>%
summarise(
avg_rating=mean(rating, na.rm=TRUE),
sd=sd(rating, na.rm=TRUE),
n=n())

# pull titles and genres back in to the data set
ratings_summ <- inner_join(movies, ratings_summ, by='movieId')

ratings_desc <- ratings_summ %>%
select(avg_rating, sd, n) %>%
describe() %>%
select(mean, sd, median, min, max)

rownames(ratings_desc) <- c('Avg. Rating', 'SD', 'Num. Ratings')

kable(
ratings_desc,
row.names=TRUE,
col.names=c('Mean', 'SD', 'Median', 'Min.', 'Max.'),
align='c',
digits=2)
Mean SD Median Min. Max.
Avg. Rating 3.13 0.66 3.24 0.5 5.00
SD 0.92 0.32 0.94 0.0 3.18
Num. Ratings 747.84 3085.82 18.00 1.0 67310.00

The database puts the year of release at the end of the title of each movie, so we first need to extract that with a little regex magic. We then aggregate across movies by year, calculating the average rating for all movies in that year, plus the 95% confidence intervals. In the graph below, though, you can barely see the confidence intervals (the gray ribbon behind the line), as it’s quite small and only really visible for the older movies. For this graph, I removed movies before 1915, as there was very little data for those older movies.

ratings_summ$year <- as.numeric(str_match(ratings_summ$title, '\$$([0-9]+)\$$')[, 2]) by_year <- ratings_summ %>% group_by(year) %>% summarise( rating=mean(avg_rating, na.rm=TRUE), se=sd(avg_rating, na.rm=TRUE) / sqrt(sum(n, na.rm=TRUE)), n=sum(n, na.rm=TRUE), lo=rating - 1.96*se, hi=rating + 1.96*se) # not much data for most years before 1915 by_year <- by_year %>% filter(!is.na(year), year >= 1915) ggplot(by_year, aes(x=year, y=rating)) + geom_ribbon(aes(ymin=lo, ymax=hi), alpha=.2) + geom_line() + labs(x='Year', y='Avg. Rating (0.5 - 5.0)') + theme_minimal() Visually, there doesn’t seem to be a very strong trend, but if we fit a linear model to it, there is actually a slight downward trend across years. This is basically as simple a model as you can get, but it does explain about 30% of the variance, so that’s…interesting. This could suggest that perhaps there’s a nostalgia factor, with people remembering movies as being better the further back in time they were made; or it could mean movies really were better back then; or it could be a selection bias, that older movies that get added to the database tend to be the better ones that people actually remember. It’s hard to really say which it could be. At any rate, it looks like maybe the late 80s to early 90s show the low point of a slump, but in recent years the ratings have been increasing again. I’m not sure what to make of the most recent year of data (2015), but there are fewer movies in the database, so I suspect the data dump happened partway through the year, so maybe the movies early in the year were especially bad.2 summary(lm(rating ~ year, by_year)) ## ## Call: ## lm(formula = rating ~ year, data = by_year) ## ## Residuals: ## Min 1Q Median 3Q Max ## -0.28499 -0.04838 0.00564 0.05831 0.32783 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 7.6711732 0.6783529 11.309 < 2e-16 *** ## year -0.0022870 0.0003452 -6.626 1.82e-09 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.1011 on 99 degrees of freedom ## Multiple R-squared: 0.3072, Adjusted R-squared: 0.3002 ## F-statistic: 43.9 on 1 and 99 DF, p-value: 1.82e-09 ## Top Rated Movies Let’s take a look at what our trusty movie raters think about the best and worst movies, shall we? Here are the top 20: by_movie <- ratings %>% group_by(movieId) %>% summarise( avg_rating=mean(rating, na.rm=TRUE), num_ratings=n(), rating_sd=sd(rating, na.rm=TRUE), rating_se=rating_sd / sqrt(n())) %>% inner_join(movies, by='movieId') %>% arrange(desc(avg_rating)) by_movie %>% filter(num_ratings > 100) %>% select(title, avg_rating, rating_sd, num_ratings) %>% head(n=20) %>% kable( row.names=FALSE, col.names=c('Title', 'Avg. Rating', 'SD', 'Num. Ratings'), align='lccc', digits=c(NA, 2, 2, 0)) Title Avg. Rating SD Num. Ratings Shawshank Redemption, The (1994) 4.45 0.72 63366 Godfather, The (1972) 4.36 0.84 41355 Usual Suspects, The (1995) 4.33 0.76 47006 Schindler’s List (1993) 4.31 0.83 50054 Godfather: Part II, The (1974) 4.28 0.86 27398 Seven Samurai (Shichinin no samurai) (1954) 4.27 0.84 11611 Rear Window (1954) 4.27 0.75 17449 Band of Brothers (2001) 4.26 0.91 4305 Casablanca (1942) 4.26 0.86 24349 Sunset Blvd. (a.k.a. Sunset Boulevard) (1950) 4.26 0.83 6525 One Flew Over the Cuckoo’s Nest (1975) 4.25 0.77 29932 Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb (1964) 4.25 0.86 23220 Third Man, The (1949) 4.25 0.82 6565 City of God (Cidade de Deus) (2002) 4.24 0.76 12937 Lives of Others, The (Das leben der Anderen) (2006) 4.23 0.73 5720 North by Northwest (1959) 4.23 0.75 15627 Paths of Glory (1957) 4.23 0.82 3568 Fight Club (1999) 4.23 0.87 40106 Double Indemnity (1944) 4.22 0.81 4909 12 Angry Men (1957) 4.22 0.78 12934 This looks about right to me. There are a few on here I haven’t seen, but of the ones I have, these seem pretty much in line with what I would expect. And if they’re in line with my own personal preferences in movies, they must be correct, right? Right? Okay, so of course this is subjective. But the bottom line is…if you haven’t seen any of these movies, go watch them now. Well, finish reading this, and then go watch them. But let’s see what the worst movies of all time are. bottom20 <- by_movie %>% filter(num_ratings > 100) %>% select(title, avg_rating, rating_sd, num_ratings) %>% tail(n=20) # fix a couple non-ASCII characters bottom20[9, 'title'] <- 'Pokemon 4 Ever (a.k.a. Pokemon 4: The Movie) (2002)' bottom20[14, 'title'] <- 'Pokemon Heroes (2003)' bottom20 %>% kable( row.names=FALSE, col.names=c('Title', 'Avg. Rating', 'SD', 'Num. Ratings'), align='lccc', digits=c(NA, 2, 2, 0)) Title Avg. Rating SD Num. Ratings Carnosaur 2 (1995) 1.45 1.03 167 Prom Night IV: Deliver Us From Evil (1992) 1.43 1.03 152 Ernest Goes to Africa (1997) 1.40 1.16 223 Turbo: A Power Rangers Movie (1997) 1.39 1.07 715 Faces of Death 4 (1990) 1.38 1.23 179 House of the Dead, The (2003) 1.36 1.01 447 Faces of Death: Fact or Fiction? (1999) 1.35 1.22 124 Faces of Death 5 (1996) 1.34 1.22 157 Pokemon 4 Ever (a.k.a. Pokemon 4: The Movie) (2002) 1.32 1.21 472 Faces of Death 6 (1996) 1.26 1.18 174 Son of the Mask (2005) 1.25 0.95 467 Disaster Movie (2008) 1.25 1.06 397 Carnosaur 3: Primal Species (1996) 1.24 0.97 139 Pokemon Heroes (2003) 1.17 1.16 325 Gigli (2003) 1.16 0.91 701 Barney’s Great Adventure (1998) 1.16 0.89 419 Glitter (2001) 1.12 0.83 685 Bratz: The Movie (2007) 1.11 1.10 180 From Justin to Kelly (2003) 0.97 0.82 426 SuperBabies: Baby Geniuses 2 (2004) 0.84 0.78 209 Personally, the thing I love most about this list is just how many sequels there are. Carnosaur 2 was apparently one of the worst movies of all time, and then they decided to make Carnosaur 3. Faces of Death has four separate movies in the bottom 20 movies of all time. But come on, people—How is Barney’s Great Adventure this low? That movie was a masterpiece. ## Ratings by User Tag Another fun thing about this database is that users can add “tags” to a movie to identify them with different themes or put them in different groups. These tags are completely open-ended, so there are lots of great tags like “wasted potential” and “Ethan Hunt Should Stop Hogging The Screen!” and “the director should be shot”. Those might not be particularly useful for identifying trends in movies, since they are often only applied to one movie. But there are lots of other great user tags. So what tags might you look for to find the best movies? Or the worst? tags <- readRDS('tags.rds') tags <- inner_join(tags, by_movie, by='movieId') tagstag <- tolower(tagstag) all_tags <- tags %>% select(movieId, tag, avg_rating, num_ratings) %>% distinct() %>% # only want one tag per movie group_by(tag) %>% summarise( avg_rating=mean(avg_rating), num_ratings=sum(num_ratings), num_movies=n()) all_tags %>% filter(num_ratings > 100, num_movies > 5) %>% arrange(desc(avg_rating)) %>% head(n=30) %>% kable( row.names=FALSE, col.names=c('Tag', 'Avg. Rating', 'Num. Ratings', 'Num. Movies'), align='lccc', digits=c(NA, 2, 0, 0)) Tag Avg. Rating Num. Ratings Num. Movies afi100 4.13 216653 10 palestine 4.11 5769 9 wildly overrated 4.10 175705 6 pelicula anime 4.10 38706 6 top 10 ever 4.09 233910 7 top 250 imdb 4.08 451455 51 super interesting 4.08 284879 8 great cast excellent 4.07 102723 7 afi 10 (courtroom drama) 4.06 47089 6 zibri studio 4.05 46155 9 afi 4.04 84615 9 emerson must see 4.03 369838 39 claude rains 4.02 47981 6 estrangeinro 4.02 68370 6 wikilens top pick 4.02 8265 6 best picture 4.02 597810 35 spielberg/lucas 4.02 365626 8 great screenplays 4.01 141381 14 imdb top 250 4.01 4080803 322 100 greatest movies 4.01 552716 52 bfi classic 4.00 192813 36 tag 4.00 80829 6 kurosawa 4.00 38897 14 wallace and gromit 4.00 47978 6 criticker top pick 4.00 201480 14 afi 100 3.99 1439167 103 miyazaki 3.99 51638 10 oscar nominee: cinematography 3.99 143218 11 usa film registry 3.99 263060 26 fantastic foreign films 3.99 25420 9 Some of these are fairly straight-forward. The tags having to do with “AFI” refer to the American Film Institute, which puts out a ranking of top 100 movies; and tags related to IMDB top 250 refer to another common list of best movies. But we can also see that anime films, films with Claude Rains, and films by directors like Spielberg, Lucas, Kurosawa, and Miyazaki, all get highly rated. People also seem to like movies about Palestine, Wallace and Gromit, and films nominated for Oscars in cinematography. I do love that the “wildly overrated” tag shows up on highly-rated films, too. Apparently somebody doesn’t agree very much with those ratings. In case you’re curious, here are the overrated films. (I happen to disagree with whoever tagged them as overrated. I think they’re rated just fine the way they are, thank you.) overrated <- tags %>% filter(tag == 'wildly overrated') %>% select(title) overrated[4, 'title'] <- 'Amelie (Fabuleux destin d\'Amelie Poulain, Le) (2001)' overrated %>% kable() ## title Shawshank Redemption, The (1994) American Beauty (1999) Magnolia (1999) Amelie (Fabuleux destin d’Amelie Poulain, Le) (2001) Eternal Sunshine of the Spotless Mind (2004) Crash (2004) And just for good measure, here are the tags people gave to the lowest-rated films: all_tags %>% filter(num_ratings > 100, num_movies > 5) %>% arrange(desc(avg_rating)) %>% tail(n=30) %>% kable( row.names=FALSE, col.names=c('Tag', 'Avg. Rating', 'Num. Ratings', 'Num. Movies'), align='lccc', digits=c(NA, 2, 0, 0)) Tag Avg. Rating Num. Ratings Num. Movies eco-horror 2.53 1318 6 steve guttenberg 2.52 29932 9 adapted from:game 2.52 69929 17 based on a tv show? 2.52 26213 9 based on a cartoon 2.48 14234 9 so bad it’s almost good 2.47 12920 17 golden raspberry (worst actor) 2.47 99215 32 sharks 2.46 8389 7 nostalgia critic 2.45 314931 72 albert pyun 2.42 1252 9 animation remade as live action 2.41 53652 18 flashback overload 2.39 5908 6 subgenre:slasher 2.38 45907 28 shark 2.37 40456 11 chuck norris 2.34 4429 21 franchise beaten to death 2.34 44247 6 bollocks 2.34 3751 11 worst movie ever! 2.31 14327 8 golden raspberry (worst picture) 2.29 57463 25 menahem golan 2.28 13736 27 steven seagal 2.24 33572 25 character:jason vorhees 2.21 13343 10 golden raspberry (worst actress) 2.18 58314 27 full moon entertainment 2.18 2356 20 mst3k 2.16 7228 16 jason vorhees 2.15 9156 7 pokemon 1.96 3991 9 uwe boll 1.86 1716 6 no rec? 1.85 41829 18 imdb bottom 100 1.82 24040 59 Looks like people don’t like Steve Guttenberg, Albert Pyun, Chuck Norris, Menahem Golan, Steven Seagal, and of course, Uwe Boll. People also don’t seem to like film adaptations of games, TV shows, cartoons, and animations remade as live action. Also, apparently the Jason films are really disliked, at least by people who can’t spell the name of the main character (Jason Voorhees). ## Ratings by Cast and Crew One thing that I really wanted to be able to do is take a look at how movies by particular directors, actors, etc. are rated. While this information isn’t in the MovieLens database, it does provide the ID numbers to link movies with IMDB and TMDB, the latter of which has an open API. From there, I was able to pull the full cast and crew credits for almost all the movies in the MovieLens database! links <- readRDS('links.rds') # NOTE: This isn't being run because it takes a long time. But if you're interested in doing the same, # you can run the same function, just give it your API. # get_cast_crew <- function(ids, api_key) { # counter <- 0 # cast <- data.frame(movie_id=numeric(0), person_id=numeric(0), name=character(0), # character=character(0), gender=numeric(0), order=numeric(0), stringsAsFactors=FALSE) # crew <- data.frame(movie_id=numeric(0), person_id=numeric(0), name=character(0), # department=character(0), job=character(0), gender=numeric(0), stringsAsFactors=FALSE) # # for (i in 1:length(ids)) { # if (!is.na(links[i, 'tmdbId'])) { # url <- paste0('https://api.themoviedb.org/3/movie/', ids[i], '/credits?api_key=', api_key, '&language=en-US') # rqst <- GET(url) # # if (rqststatus_code == 200) {
#                 data <- content(rqst)
#
#                 cs_cnt <- length(data$cast) # if (cs_cnt > 0) { # cast_tmp <- data.frame(movie_id=numeric(cs_cnt), person_id=numeric(cs_cnt), # name=character(cs_cnt), character=character(cs_cnt), # gender=numeric(cs_cnt), order=numeric(cs_cnt), stringsAsFactors=FALSE) # # for (cs in 1:cs_cnt) { # cast_tmp[cs, 'movie_id'] <- ids[i] # cast_tmp[cs, 'person_id'] <- data$cast[[cs]]$id # cast_tmp[cs, 'name'] <- data$cast[[cs]]$name # cast_tmp[cs, 'character'] <- data$cast[[cs]]$character # cast_tmp[cs, 'gender'] <- data$cast[[cs]]$gender # cast_tmp[cs, 'order'] <- data$cast[[cs]]$order # } # cast <- rbind(cast, cast_tmp) # } # # cr_cnt <- length(data$crew)
#                 if (cr_cnt > 0) {
#                     crew_tmp <- data.frame(movie_id=numeric(cr_cnt), person_id=numeric(cr_cnt),
#                         name=character(cr_cnt), department=character(cr_cnt), job=character(cr_cnt),
#                         gender=numeric(cr_cnt), stringsAsFactors=FALSE)
#
#                     for (cr in 1:cr_cnt) {
#                         crew_tmp[cr, 'movie_id'] <- ids[i]
#                         crew_tmp[cr, 'person_id'] <- data$crew[[cr]]$id
#                         crew_tmp[cr, 'name'] <- data$crew[[cr]]$name
#                         crew_tmp[cr, 'department'] <- data$crew[[cr]]$department
#                         crew_tmp[cr, 'job'] <- data$crew[[cr]]$job
#                         crew_tmp[cr, 'gender'] <- data$crew[[cr]]$gender
#                     }
#                     crew <- rbind(crew, crew_tmp)
#                 }
#             }
#
#             # limit rate
#             if ('x-ratelimit-remaining' %in% names(rqst$headers) && # as.numeric(rqst$headers$x-ratelimit-remaining) == 1) { # # reset_time <- as.POSIXct(as.numeric(rqst$headers$x-ratelimit-reset), origin='1970-01-01') # sleep_time <- difftime(reset_time, Sys.time()) # if (sleep_time > 0) { # Sys.sleep(sleep_time + 1) # } # } # } # } # # return(list(cast=cast, crew=crew)) # } # # output <- get_cast_crew(links[1:nrow(links), 'tmdbId'], '<<YOUR API HERE>>') # saveRDS(output$cast, file='cast.rds')
# saveRDS(outputcrew, file='crew.rds') cast <- readRDS('cast.rds') crew <- readRDS('crew.rds') cast <- cast %>% inner_join(select(links, tmdbId, movieId), by=c('movie_id'='tmdbId')) %>% rename(tmdbId=movie_id) crew <- crew %>% inner_join(select(links, tmdbId, movieId), by=c('movie_id'='tmdbId')) %>% rename(tmdbId=movie_id) cast <- cast %>% inner_join(movies, by='movieId') crew <- crew %>% inner_join(movies, by='movieId') cast <- cast %>% inner_join(select(by_movie, -title, -genres), by='movieId') crew <- crew %>% inner_join(select(by_movie, -title, -genres), by='movieId') Of course, once I realized that I could get the credits for all these movies, I immediately wanted to compare movie ratings by who was the Best Boy for the movie, so we could determine once and for all who the real “Best” Boy is. Unfortunately, the credits only cover the more major roles.3 So I’m a little more limited in the questions I can answer, but there’s still a boatload of information for all these movies. So let’s pick some fun ones! ### Best Directors Let’s start with movie directors. Who are the best directors of all time? We can, of course, easily answer this definitively and objectively, with this data based on completely subjective ratings. by_director <- crew %>% filter(job == 'Director') %>% group_by(person_id) %>% summarise( name=nth(name, 1), avg_rating=mean(avg_rating, na.rm=TRUE), num_movies=n(), num_ratings=sum(num_ratings, na.rm=TRUE)) %>% filter(num_ratings > 100, num_movies > 5) %>% arrange(desc(avg_rating)) head(by_director, n=20) %>% select(name, avg_rating, num_ratings, num_movies) %>% kable( row.names=FALSE, col.names=c('Name', 'Avg. Rating', 'Num. Ratings', 'Num. Movies'), align='lccc', digits=c(NA, 2, 0, 0)) Name Avg. Rating Num. Ratings Num. Movies Hayao Miyazaki 4.00 47763 12 Christopher Nolan 3.99 108895 9 Sergio Leone 3.95 34759 6 Shannon Hartman 3.95 726 7 Nick Park 3.94 58864 7 Krzysztof Kieslowski 3.89 21762 9 Masaki Kobayashi 3.87 1013 14 Quentin Tarantino 3.86 168465 9 Frederick Wiseman 3.85 106 15 Satyajit Ray 3.83 1945 16 Wes Anderson 3.81 46341 9 Louis C.K. 3.80 3244 7 Paolo Sorrentino 3.77 453 6 Akira Kurosawa 3.76 33408 30 Park Chan-wook 3.76 9796 7 Preston Sturges 3.75 4125 9 Kaneto Shindô 3.75 244 6 David Fincher 3.74 136303 10 Joel Coen 3.74 154372 18 Alfonso Cuarón 3.73 37711 8 Some of these completely go without saying. Miyazaki, Nolan, Tarantino, Kurosawa, Fincher, Coen—even if you might quibble over which of these belong in which order, they are certainly worthy of being on a Top 20 list. However, I was surprised by Louis C.K. being on there—I guess people really love his comedy. Keep in mind that his directorial credits are primarily for his comedy specials, which probably get rated on a somewhat different scale in relation to more traditional films. That said, his TV show Louie (though not in this database) is pretty great! Shannon Hartman is also a director of many comedy specials; Nick Park is director on the Wallace and Gromit films (which as noted above, people love). It’s important to keep in mind that although great directors make great films, the greatness of a film could be due to something other than the director. These are ratings of the movies, not ratings of the directors. As such, this list is really a list of directors who make movies that people (on average) really like. So yes, that includes the work of Studio Ghibli (Miyazaki), but it also includes the comedy stylings of Louis C.K. ### Best Lead Actors The casting data from TMDB also includes the billing order, so we can take a look at which lead actors are in the top-rated movies: by_lead <- cast %>% filter(num_ratings > 100, order == 0) %>% group_by(person_id) %>% summarise( name=nth(name, 1), avg_rating=mean(avg_rating, na.rm=TRUE), num_movies=n(), num_ratings=sum(num_ratings, na.rm=TRUE)) %>% filter(num_movies > 5) %>% arrange(desc(avg_rating)) head(by_lead, n=20) %>% select(name, avg_rating, num_ratings, num_movies) %>% kable( row.names=FALSE, col.names=c('Name', 'Avg. Rating', 'Num. Ratings', 'Num. Movies'), align='lccc', digits=c(NA, 2, 0, 0)) Name Avg. Rating Num. Ratings Num. Movies Toshiro Mifune 3.98 24296 14 Charlie Chaplin 3.91 14510 10 Groucho Marx 3.90 10622 8 Humphrey Bogart 3.87 69578 18 Henry Fonda 3.85 26629 12 William Powell 3.84 7116 9 Buster Keaton 3.83 4106 10 Cary Grant 3.82 55568 24 James Stewart 3.82 65062 19 Gong Li 3.80 5750 6 James Cagney 3.78 2950 7 Takeshi Kitano 3.77 9139 7 Spencer Tracy 3.77 15176 12 Katharine Hepburn 3.77 10599 8 Tony Leung Chiu-Wai 3.76 4548 6 Gary Cooper 3.75 8416 10 Peter O’Toole 3.74 18436 8 Gael García Bernal 3.74 18397 7 Ryan Gosling 3.73 9068 7 Fred Astaire 3.72 6668 11 There’s a definite skew there toward actors in older films, and perhaps that’s because I limited it to those who were the leads in more than five movies (otherwise it tends to skew toward those who had one good movie). But hey, Ryan Gosling is representing the modern era here! My data here suggests that if he were to put on some glasses and a thick mustache, like Groucho Marx, he might improve his career even further. And here we have the leads in the worst-rated movies: tail(by_lead, n=20) %>% select(name, avg_rating, num_ratings, num_movies) %>% kable( row.names=FALSE, col.names=c('Name', 'Avg. Rating', 'Num. Ratings', 'Num. Movies'), align='lccc', digits=c(NA, 2, 0, 0)) Name Avg. Rating Num. Ratings Num. Movies Burt Reynolds 2.73 9170 12 John Candy 2.73 9462 10 Christopher Lambert 2.72 29348 9 Lindsay Lohan 2.71 8990 6 Jennifer Lopez 2.71 23731 9 Macaulay Culkin 2.68 43283 6 Jim Belushi 2.68 2765 7 Dudley Moore 2.67 7627 8 Charles Bronson 2.64 3070 9 Steven Seagal 2.62 19756 10 Jamie Lee Curtis 2.61 9212 7 Kim Basinger 2.60 3181 6 Jean-Claude Van Damme 2.59 27852 18 Martin Lawrence 2.55 10440 12 Freddie Prinze Jr. 2.50 12068 7 Rob Schneider 2.41 6869 6 Chuck Norris 2.41 2795 9 Robert Englund 2.40 10505 6 Madonna 2.35 14504 8 Jim Varney 1.98 7776 7 This is…about what I would expect. Rob Schneider nearing the bottom? Yep. I’m just sad for poor ol’ Jim Varney, who starred in all those “Ernest” movies I watched as a kid, which probably would not at all hold up if I were to rewatch them. I’ll just leave those as a pleasant memory.4 Well, this makes me wonder if actors have an arc to their career. Do they start off poorly, then have a peak somewhere in their career, before fading off into obscurity? Perhaps that’s what they experience with regard to their fame as celebrities, but does that bear out in the ratings of the movies they make over their career? Let’s find out! To test this, I calculated the average movie ratings for each actor in each year of their career. I then fit a hierarchical linear model (HLM) that gives each actor a random intercept (i.e., the model fits each actor with their own mean level), and then looks at the linear and quadratic effects of year of career. If you’re not familiar with HLM, that’s okay—the main story here is that there’s a slight downward linear trend across an actor’s career, but also evidence for a slight U-shape. It’s a little easier to see in the graph below. castyear <- as.numeric(stringr::str_match(cast$title, '\$$([0-9]+)\$$$')[, 2])

# average across all movies for an actor per year
by_person_by_year <- cast %>%
group_by(person_id, year) %>%
summarise(
name=nth(name, 1),
avg_rating=mean(avg_rating, na.rm=TRUE),
num_movies=n(),
num_ratings=sum(num_ratings, na.rm=TRUE)) %>%
group_by(person_id) %>%
mutate(year=year - first(year) + 1)  # this resets the year so all actors start at year 1

# hierarchical linear model testing linear and quadratic effects of year, after allowing
# for a random intercept for each actor
library(nlme)
model <- lme(avg_rating ~ year + I(year^2), random=~1|person_id, data=by_person_by_year,
na.action=na.omit)
summary(model)
## Linear mixed-effects model fit by REML
##  Data: by_person_by_year
##        AIC      BIC    logLik
##   536064.4 536117.4 -268027.2
##
## Random effects:
##  Formula: ~1 | person_id
##         (Intercept)  Residual
## StdDev:    0.271587 0.5460772
##
## Fixed effects: avg_rating ~ year + I(year^2)
##                 Value    Std.Error     DF   t-value p-value
## (Intercept)  3.207716 0.0016480635 160692 1946.3544       0
## year        -0.005021 0.0002876172 160692  -17.4565       0
## I(year^2)    0.000067 0.0000071770 160692    9.3353       0
##  Correlation:
##           (Intr) year
## year      -0.529
## I(year^2)  0.404 -0.900
##
## Standardized Within-Group Residuals:
##        Min         Q1        Med         Q3        Max
## -5.3352075 -0.4792010  0.1220818  0.6284551  3.7568826
##
## Number of Observations: 297992
## Number of Groups: 137298
grid <- data.frame(year=1:max(by_person_by_year$year, na.rm=TRUE)) grid$avg_rating <- predict(model, grid, level=0, na.action=na.omit)

# this is just to calculate standard error from an HLM
# code from: http://glmm.wikidot.com/faq
designmat <- model.matrix(formula(model)[-2], grid)
predvar <- diag(designmat %*% vcov(model) %*% t(designmat))
se <- sqrt(predvar)

grid$error_upper <- grid$avg_rating + 1.96 * se
grid$error_lower <- grid$avg_rating - 1.96 * se

ggplot(grid, aes(x=year, y=avg_rating)) +
geom_ribbon(aes(ymin=error_lower, ymax=error_upper), alpha=.2) +
geom_line() +
ylim(0.5, 5) +
labs(x='Year of Actor\'s Career', y='Avg. Rating (0.5 - 5.0)') +
theme_minimal()

It looks as though actors tend to start strong in their first few years, but movie ratings tend to dip a little mid-career. If they have a long career, they may get a rise in ratings again. However, keep in mind that this could be a survivorship bias—it may be that better actors are just more likely to have a long career, so as the poorer actors drop out, the average tends to increase. It’s hard to say. The notable thing about the graph is that the dip is still quite slight. There’s much more variation between movies for any given year than there are overall changes in average ratings across an actor’s career.

### Best Screenplay

Another group of people we can look at are screenplay writers. Let’s take a look at writers who have written more than five screenplays:

crew %>%
filter(job == 'Screenplay') %>%
group_by(person_id) %>%
summarise(
name=nth(name, 1),
avg_rating=mean(avg_rating, na.rm=TRUE),
num_movies=n(),
num_ratings=sum(num_ratings, na.rm=TRUE)) %>%
filter(num_ratings > 100, num_movies > 5) %>%
arrange(desc(avg_rating)) %>%
select(name, avg_rating, num_ratings, num_movies) %>%
kable(
row.names=FALSE,
col.names=c('Name', 'Avg. Rating', 'Num. Ratings', 'Num. Movies'),
align='lccc',
digits=c(NA, 2, 0, 0))
Name Avg. Rating Num. Ratings Num. Movies
Christopher Nolan 4.06 100934 7
Shinobu Hashimoto 4.01 17981 7
Akira Kurosawa 3.96 32752 14
Sergio Leone 3.95 34759 6
Stanley Kubrick 3.94 144282 10
Orson Welles 3.91 24160 6
Enrico Medioli 3.90 4464 6
Ingmar Bergman 3.86 11020 12
Philippa Boyens 3.84 108280 6
Robert Bolt 3.84 23206 6
Hayao Miyazaki 3.83 37829 11
Ennio Flaiano 3.83 5883 8
Leigh Brackett 3.80 54146 7
Michael Wilson 3.80 26231 6
Satyajit Ray 3.80 1925 13
Martin Scorsese 3.79 49849 6
Park Chan-wook 3.79 9367 6
Kenneth Branagh 3.79 25636 6
Luis Buñuel 3.78 6333 11
Krzysztof Kieslowski 3.78 21336 6

Again, a lot of these names should probably be familiar. Having a great screenplay really shows, and these folks have done amazing things. People really love their Nolan films, and it’s no surprise why.

### Best Novel Adaptations

One interesting tidbit in the credits is that if a movie is adapted from a book, the author shows up in the credits as well. So we can easily find all the authors who have had really good (and really bad) adaptations of their work:

novel <- crew %>%
filter(job == 'Novel') %>%
group_by(person_id) %>%
summarise(
name=nth(name, 1),
avg_rating=mean(avg_rating, na.rm=TRUE),
num_movies=n(),
num_ratings=sum(num_ratings, na.rm=TRUE)) %>%
filter(num_ratings > 100, num_movies >= 3) %>%
arrange(desc(avg_rating))

select(name, avg_rating, num_ratings, num_movies) %>%
kable(
row.names=FALSE,
col.names=c('Name', 'Avg. Rating', 'Num. Ratings', 'Num. Movies'),
align='lccc',
digits=c(NA, 2, 0, 0))
Name Avg. Rating Num. Ratings Num. Movies
Bibhutibhushan Bandyopadhyay 4.10 1869 4
Marcel Pagnol 4.01 7002 4
Alan Le May 3.92 3319 3
Jane Austen 3.87 37980 12
Yasuhiko Takiguchi 3.84 520 3
Stieg Larsson 3.83 6914 4
José Giovanni 3.83 116 3
Jumpei Gomikawa 3.82 201 3
Joanne K. Rowling 3.77 72889 10
Joseph Kessel 3.75 315 3
Karen Blixen 3.75 6197 3
Mario Puzo 3.75 68814 3
Eiji Yoshikawa 3.73 339 3
Philip K. Dick 3.72 34806 4
Charlotte Brontë 3.71 3065 3
Bohumil Hrabal 3.68 252 3
Larry McMurtry 3.67 9404 4
Frances Hodgson Burnett 3.67 12800 4
Alberto Moravia 3.66 1693 4
William Shakespeare 3.65 5540 4

I had no idea who the top author was. Turns out that he was a Bengali author who wrote a novel that was later adapted into a trilogy of films frequently heralded as the greatest movies from Bollywood, The Apu Trilogy. Cool stuff! Apparently I’m not really up on my Bollywood films. Some of the others I’m familiar with—Jane Austen should need no introduction, Stieg Larsson and J.K. Rowling have more recent works that have been adapted, and there’s good ol’ Billy Shakespeare rounding out the Top 20. And of course, Mario Puzo wrote The Godfather, so of course his film adaptations are highly rated!

But yeah, let’s look at the authors with the worst film adaptations:

tail(novel, n=20) %>%
select(name, avg_rating, num_ratings, num_movies) %>%
kable(
row.names=FALSE,
col.names=c('Name', 'Avg. Rating', 'Num. Ratings', 'Num. Movies'),
align='lccc',
digits=c(NA, 2, 0, 0))
Name Avg. Rating Num. Ratings Num. Movies
James Michener 2.96 147 3
Friedrich Dürrenmatt 2.94 1516 3
Marvin H. Albert 2.93 109 6
Akira Toriyama 2.91 521 17
Ayn Rand 2.89 308 4
Jules Verne 2.87 11367 13
Joan McLeod 2.85 1304 4
Arthur Hailey 2.85 2800 4
Eric Knight 2.79 177 3
Jack Ketchum 2.78 354 4
Peter Benchley 2.77 24349 5
H. Rider Haggard 2.72 572 7
Stephenie Meyer 2.70 5127 6
Arturo Pérez-Reverte 2.64 3523 4
Dr. Seuss 2.63 1900 3
Arthur Conan Doyle 2.59 752 5
Edgar Rice Burroughs 2.57 5744 5
H.P. Lovecraft 2.43 457 4
Jonathan Swift 2.28 226 3
Sax Rohmer 2.17 296 4

Some of these make me sad. Jules Verne! Arthur Conan Doyle! Jonathan Swift! I love all these authors. But I guess it’s not their fault that somebody made terrible film adaptations of their excellent books. Then, of course, there are other authors like Ayn Rand and Stephenie Meyer…when the source material is that bad, it’s hard to make a good film adaptation, am I right?