By Jeff Hughes (July 22, 2017)

Predicting things with varying amounts of data is a common problem. If you are comparing two products, one of which has a 4.5 / 5 rating based on 3 reviews, and another with a 4.0 / 5 rating based on 1200 reviews, which is the better product? On the one hand, in an objective sense, 4.5 is higher than 4.0. On the other hand, one of those ratings is based on much less data. Maybe the three people who reviewed the first product all just happened to love it. Or maybe those ratings were given by friends of the seller!

In cases like these, it is important to weigh the data itself (i.e., the rating) by the amount or quality of the data you have. Implicit in this is that when we have little data, we should be more uncertain about the true value—and as such, we need to incorporate this uncertainty into our predictions. This is a situation where Bayesian models are an excellent choice.

So let’s talk about one of these scenarios in baseball. I am not a baseball fanatic. When I was young, I did enjoy keeping up with baseball. But to be honest, I think I got more enjoyment from updating the team statistics on my Excel spreadsheet than I did actually watching the games.1 But I haven’t really followed baseball in 15 years or more. That said, baseball is a game for nerds. There are lots of great statistics you can use in baseball—and people in the stands who actually sit there meticulously keeping track of them. The following post is inspired by a post on baseball batting averages by David Robinson on empirical Bayes, but applied to a different context.

The Problem

Let’s say we want to evaluate major league pitchers. One key statistic used for pitchers is their ERA: Earned Run Average. This is calculated by taking the number of earned runs2 they allowed while pitching and dividing it by the number of innings pitched.3 This value is then multiplied by 9 to extend the average over 9 innings. So essentially, the lower a pitcher’s ERA, the better a pitcher they are—they allow the other team to get fewer runs.

Pitchers already vary in the number of innings they pitch. Like the example of product reviews, an ERA calculated over fewer innings is likely to be a more unstable estimate of a pitcher’s true ability. So this would be a great case for using a Bayesian approach. However, let’s make it a bit more challenging still. What I’d like to do is take a pitcher’s ERA over just their first year of pitching, and see how well we can predict what their overall ERA will be over the rest of their entire career.4 This is quite a challenge—how accurately can we predict a pitcher’s career based on the dozen or so innings they pitched as a rookie? Well, let’s try it and find out.

Exploring the Data

I first load some packages and the data. You can see the code for this analysis by clicking on the buttons to the right, or go to the top of the page on the right and select “Show All Code”. The raw code is also available on Github. I’m using Sean Lahman’s baseball database, an excellent source of data with statistics all the way from 1871 to 2016. However, I’m going to select only data from 1945 onwards, as the game itself has evolved in ways that might introduce a lot more complexity. That still gives us plenty of data to work with.

library(knitr)
library(ggplot2)
library(dplyr)
theme_set(theme_light() + theme(panel.grid.major=element_blank(), panel.grid.minor=element_blank()))

master <- read.csv('Master.csv', stringsAsFactors=FALSE)
pitching <- read.csv('Pitching.csv', stringsAsFactors=FALSE)

pitching <- pitching %>%
    filter(yearID >= 1945, IPouts > 0) %>%
    group_by(playerID) %>%
    mutate(year=yearID - first(yearID) + 1) %>%
    ungroup()

Below you can see a plot of pitchers’ ERAs over each year of their career. You can see a couple things: First off, most pitchers have a career spanning less than 20 years. Second, while the average ERA stays relatively stable (as noted by the loess line in blue), there is much more variability in the early years. However, this may be due in part to calculating averages over fewer innings pitched. It’s easy to have an absurdly high (or low) ERA if you’ve only pitched a single inning!

ggplot(pitching, aes(x=year, y=ERA)) +
    geom_jitter(alpha=.1) +
    geom_smooth() +
    ylim(0, 100) +
    labs(x='Year of Career') +
    theme(panel.grid.major=element_blank(),
        panel.grid.minor=element_blank())

Now I’ll pull out the first year of data for each pitcher, and calculate their ERA over that year. On average, pitchers pitched about 40 innings in their first year, plus or minus around 46. However, there’s some skew to the data, so the graph below might give you a better picture.

first_year <- pitching %>%
    filter(year == 1)

first_year <- first_year %>%
    group_by(playerID) %>%
    summarize(ER=sum(ER), IP=(sum(IPouts) / 3)) %>%
    mutate(ERA=9 * ER/IP)

first_year <- master %>%
    select(playerID, nameFirst, nameLast) %>%
    inner_join(first_year, by='playerID') %>%
    arrange(desc(ERA))

cat('Mean:', mean(first_year$IP), '\nStandard Deviation:', sd(first_year$IP))
## Mean: 38.39035 
## Standard Deviation: 46.57904
ggplot(first_year, aes(x=IP)) +
    geom_histogram() +
    labs(x='Innings Pitched in First Year', y='Count')

So let’s make sure we actually have an issue here first. If we have an issue with minimal data for some players, we should see that the people with really high or really low ERAs tend to have little data (few IP). People with more data should tend to fall somewhere in the middle instead of at the extremes. Here are the 10 pitchers with the highest ERA:

first_year %>%
    select(nameFirst, nameLast, ER, IP, ERA) %>%
    head(n=10) %>%
    kable(
        row.names=FALSE,
        col.names=c('First Name', 'Last Name', 'ER', 'IP', 'ERA'),
        align='llccc',
        digits=c(NA, NA, 0, 2, 1))
First Name Last Name ER IP ERA
Joe Cleary 7 0.33 189.0
Jeff Ridgway 7 0.33 189.0
Tom Qualters 6 0.33 162.0
Fritz Fisher 4 0.33 108.0
Gene Bearden 3 0.33 81.0
Ricky Pickett 6 0.67 81.0
Terry Wilshusen 3 0.33 81.0
Manny Alexander 5 0.67 67.5
Larry Harlow 5 0.67 67.5
Akeel Morris 5 0.67 67.5

And here are the pitchers with the lowest ERA:

first_year %>%
    select(nameFirst, nameLast, ER, IP, ERA) %>%
    tail(n=10) %>%
    kable(
        row.names=FALSE,
        col.names=c('First Name', 'Last Name', 'ER', 'IP', 'ERA'),
        align='llccc',
        digits=c(NA, NA, 0, 2, 1))
First Name Last Name ER IP ERA
Fred Wenz 0 1.00 0
Ron Willis 0 3.00 0
Bill Wilson 0 1.00 0
Glenn Wilson 0 1.00 0
Josh Wilson 0 1.00 0
Dewayne Wise 0 1.67 0
Jason Wood 0 1.00 0
Ray Yochim 0 1.00 0
Jeff Zaske 0 5.00 0
Todd Zeile 0 1.00 0

As you can see, they all have very little data. One pitcher, Jeff Zaske, surprisingly pitched 5 innings without an earned run, but the rest are around 1 inning or less. Obviously, this isn’t too helpful for figuring out who the best pitchers are.

Constructing the Bayesian Model

To solve this problem, I am going to use empirical Bayes to estimate a “true 1st-year ERA” for pitchers. This is then applied to all pitchers in our data set to essentially “pull in” pitchers with little data toward the mean. The general idea is this: If we had absolutely no information about a pitcher, our best guess would be the mean. If we have just a little information about a pitcher, we can use that information, but we should still weight that along with the mean so that our predictions get pulled in closer to the mean. The more data we have, the more weight we give to that data and the less weight we give to the mean in our prediction.

ERA can be considered a rate variable, modelled by a Poisson process where earned runs occur at a certain rate per inning pitched. The prior probability for the rate can be modelled by a Gamma distribution, and because I’m using an empirical Bayes approach, I’m going to use the data to determine the hyperparameters of that Gamma distribution. I’m actually going to grab a subset of the data in order to provide a better estimate—I’ll restrict the data to those who have pitched more than 20 innings. This just ensures that we get rid of some of those extremes at either end. Here’s the distribution for those 20-inning-plus pitchers:

first_year_filtered <- first_year %>%
    filter(IP > 20)
ggplot(first_year_filtered, aes(x=ERA)) +
    geom_histogram() +
    labs(y='Count')

We can take the mean and variance of that distribution to determine the shape (\(\alpha_0\)) and rate (\(\beta_0\)) for our Gamma distribution. (If you’re not familiar with Bayesian approaches, that’s fine. The bottom line is that I’m using the data to figure out a general “average” ERA.)

cat('Mean:', mean(first_year_filtered$ERA), '\nVariance:', var(first_year_filtered$ERA))
## Mean: 4.314126 
## Variance: 2.314021
# shape / rate = mean
# shape / rate^2 = variance
# therefore, mean*rate - variance*rate^2 = 0
# we then solve for the roots of this equation to get the rate
rate <- as.numeric(polyroot(c(-var(first_year_filtered$ERA), mean(first_year_filtered$ERA))))
shape <- mean(first_year_filtered$ERA) * rate

If we assume the likelihood is generated from a Poisson distribution, then Gamma and Poisson are what are known as conjugate priors, which means we get another Gamma distribution out at the end. That just makes the math a little easier; if we wanted to model this differently, you could also use Markov Chain Monte Carlo (MCMC) methods to sample from the posterior distribution. I’ll return to that at the end when talking about some ways to extend this example. However, right now here’s what I’m doing: I just calculated a prior Gamma distribution above based on the data. Then, for each pitcher in the data set, I combine the prior distribution and their actual data to create a posterior Gamma distribution for each pitcher. The conjugate priors make this relationship nice and simple; the posterior shapes and rates for each pitcher (\(i\)) are calculated as follows:

\(\alpha_i = \alpha_0 + ER_i\)

\(\beta_i = \beta_0 + IP_i\)

This means we get a full probability distribution for each player, giving us the likely values for each player’s true ERA. I could leave that there if I wanted to, but given that I want to use this to predict their future career ERA, it’s useful to also grab some “best estimate” for each player. Below I calculate the median value; however, you could just as easily calculate the mean, or some other estimate of your choosing.

first_year$post_ERA_shape <- shape + first_year$ER
first_year$post_ERA_rate <- rate + first_year$IP
first_year$post_ERA_mdn <- qgamma(.5, shape=first_year$post_ERA_shape, rate=first_year$post_ERA_rate) * 9

first_year <- first_year %>%
    arrange(desc(post_ERA_mdn))

After sorting by this median estimate, then, here are the estimates for the pitchers with the highest ERA:

first_year %>%
    select(nameFirst, nameLast, ER, IP, ERA, post_ERA_mdn) %>%
    head(n=10) %>%
    kable(
        row.names=FALSE,
        col.names=c('First Name', 'Last Name', 'ER', 'IP', 'ERA', 'Median Est. ERA'),
        align='llcccc',
        digits=c(NA, NA, 0, 2, 1, 1))
First Name Last Name ER IP ERA Median Est. ERA
Joe Cleary 7 0.33 189.0 93.0
Jeff Ridgway 7 0.33 189.0 93.0
Tom Qualters 6 0.33 162.0 82.6
Fritz Fisher 4 0.33 108.0 61.9
Ricky Pickett 6 0.67 81.0 59.7
Manny Alexander 5 0.67 67.5 52.2
Larry Harlow 5 0.67 67.5 52.2
Akeel Morris 5 0.67 67.5 52.2
Jay Sborz 5 0.67 67.5 52.2
Gene Bearden 3 0.33 81.0 51.6

And here are the pitchers with the lowest estimates:

first_year %>%
    select(nameFirst, nameLast, ER, IP, ERA, post_ERA_mdn) %>%
    tail(n=10) %>%
    kable(
        row.names=FALSE,
        col.names=c('First Name', 'Last Name', 'ER', 'IP', 'ERA', 'Median Est. ERA'),
        align='llcccc',
        digits=c(NA, NA, 0, 2, 1, 1))
First Name Last Name ER IP ERA Median Est. ERA
Mike Hinckley 0 13.67 0.0 1.3
Jeremy Hernandez 0 14.33 0.0 1.2
Joel Johnston 1 22.33 0.4 1.2
Dave Ford 0 15.00 0.0 1.2
Bob McClure 0 15.33 0.0 1.1
Joba Chamberlain 1 24.00 0.4 1.1
Mike Norris 0 16.67 0.0 1.0
Fernando Valenzuela 0 17.67 0.0 1.0
Karl Spooner 0 18.00 0.0 1.0
Kevin Siegrist 2 39.67 0.5 0.9

Notice that for the highest ERAs, the estimates have been brought down, while for the lowest ERAs, the estimates have been pushed up. They’ve all been “regressed toward the mean” at least in part. But also notice that the predictions for the best pitchers (low ERA) are no longer based on people with practically no data. We’ve ended up with better estimates, because those pitchers who had 0 ERA with practically no data have been pushed up toward the mean. (On the high end, this doesn’t work so well, because ERA has a lower limit of 0, but no upper limit. But we’ve still brought in those high estimates quite a bit.)

Below I’ve taken a random sample of 20 pitchers in the data, and graphed their actual first-year ERA with blue dots. The intensity of the blue indicates how much data we had (i.e., how many innings they pitched). Alongside that, I’ve also graphed our Bayesian estimates for each pitcher, with a gray dot at the median, and error bars indicating the 95% credible interval (the middle 95% of the probability distribution). Notice that the pitchers for which we had little data (light blue) have had their estimates pulled toward the mean to a greater extent, and the probability distributions are very wide; for pitchers for which we had more data (dark blue), the estimate does not change as much, and the error bars are much smaller. In other words, we can be more confident about their true ERA.

set.seed(123)
sample_ERA <- first_year[sample(1:nrow(first_year), 20), ]
sample_ests <- data.frame(
    ERA=sample_ERA$ERA,
    IP=sample_ERA$IP,
    lower=qgamma(.025, shape=sample_ERA$post_ERA_shape, rate=sample_ERA$post_ERA_rate)*9,
    median=qgamma(.5, shape=sample_ERA$post_ERA_shape, rate=sample_ERA$post_ERA_rate)*9,
    upper=qgamma(.975, shape=sample_ERA$post_ERA_shape, rate=sample_ERA$post_ERA_rate)*9
)
sample_ests <- sample_ests %>%
    arrange(desc(median)) %>%
    mutate(k=1:20)

ggplot(sample_ests, aes(y=k)) +
    geom_point(aes(x=median), color='#595959') +
    geom_errorbarh(aes(x=median, xmin=lower, xmax=upper), color='#595959') +
    geom_point(aes(x=ERA, color=IP)) +
    scale_color_gradient(low='#97fcfa', high='#002dfc') +
    labs(x='ERA', y='Pitcher') +
    theme(axis.text.y=element_blank(), axis.ticks.y=element_blank(), panel.grid.major.y=element_blank(),
        panel.grid.minor.y=element_blank())

Predicting Career ERA

Okay, the real test of this model is to use it to predict pitchers’ ERA over the rest of their career. As I said earlier, this is a challenging task, given that I am trying to use people’s performance in their rookie year to predict their performance up to 20 years later! This is also a relatively simplistic model: I’m implicitly assuming that pitchers’ performance does not change over their career; that there is no difference between pitchers across the decades, across teams, etc. The only bits of information I’ve used is the ER in their first year, and the number of innings they pitched.

Below I’ve calculated the career ERA for all pitchers in the data set, then provided two estimates. Our baseline is the “naive estimate”, which simply uses their actual ERA from their first year as the prediction. The comparison is the “Bayesian estimate”, which is the median value we estimated for each pitcher. In the table below, I have calculated the squared error between the actual career ERA and these two estimates, and then sorted it based on the difference in the squared error.

Note that one complication here is that in some cases, the career ERA score is not based on much more data than the first-year ERA score was! Because there is such variability in the length of career, some pitchers’ career ERAs might not be much of a “true score” and might be very noisy estimates. I do try to restrict this a little by restricting to those who had pitched at least five innings after their first year, but in some cases the values to predict have almost as much noise as the predictions. However, this is more of a proof of concept (I don’t recommend using this simplistic a model for Sabermetrics!), so I’d rather err on the side of including more data in the career scores.

First, here are the top 20 cases, for which the Bayesian estimate did very well compared to baseline:

career <- pitching %>%
    filter(year != 1)

career <- career %>%
    group_by(playerID) %>%
    summarize(ER_crr=sum(ER), IP_crr=(sum(IPouts) / 3)) %>%
    filter(IP_crr > 5) %>%
    mutate(ERA_crr=9 * ER_crr/IP_crr)

career <- career %>%
    inner_join(first_year, by='playerID') %>%
    arrange(desc(ERA_crr))

career$naive_se <- (career$ERA_crr - career$ERA)^2
career$bayes_se <- (career$ERA_crr - career$post_ERA_mdn)^2
career$diff <- career$naive_se - career$bayes_se

career %>%
    arrange(desc(diff)) %>%
    select(nameFirst, nameLast, ERA_crr, IP_crr, ERA, IP, post_ERA_mdn, naive_se, bayes_se, diff) %>%
    head(n=20) %>%
    kable(
        row.names=FALSE,
        col.names=c('First Name', 'Last Name', 'Career ERA', 'Career IP', 'First Year ERA', 'First Year IP', 'Bayesian Estimate', 'Naive SE', 'Bayes SE', 'Difference in SE'),
        align='llcccccccc',
        digits=c(NA, NA, 1, 2, 1, 2, 1, 2, 2, 2))
First Name Last Name Career ERA Career IP First Year ERA First Year IP Bayesian Estimate Naive SE Bayes SE Difference in SE
Jeff Ridgway 3.7 9.67 189.0 0.33 93.0 34327.15 7962.48 26364.67
Tom Qualters 4.6 52.33 162.0 0.33 82.6 24761.13 6079.04 18682.09
Gene Bearden 3.9 788.00 81.0 0.33 51.6 5939.95 2270.85 3669.10
Joe Winkelsas 7.7 7.00 54.0 0.33 41.2 2142.37 1124.27 1018.10
Joey Eischen 3.6 295.67 54.0 0.67 44.8 2544.05 1697.84 846.21
Sandy Rosario 3.9 48.33 54.0 1.00 46.8 2508.97 1836.50 672.47
Vito Valentinetti 4.5 256.00 54.0 1.00 46.8 2446.77 1783.34 663.43
Whitey Wietelmann 8.1 6.67 54.0 1.00 46.8 2106.81 1494.96 611.85
Pete Craig 7.7 16.33 48.6 1.67 44.9 1671.64 1380.26 291.38
Sam Dyson 2.8 198.67 40.5 0.67 37.3 1420.63 1188.97 231.66
Jorge Rondon 14.0 18.00 0.0 1.00 11.7 196.00 5.47 190.53
Randy Miller 10.3 7.00 40.5 0.67 37.3 912.90 729.24 183.66
Red Witt 4.1 227.67 40.5 1.33 38.4 1324.14 1177.60 146.54
Sean Nolin 5.4 30.00 40.5 1.33 38.4 1232.01 1090.82 141.19
Aaron Brooks 6.7 55.33 43.9 2.67 42.1 1384.31 1255.15 129.16
Mike Rochford 12.4 8.00 0.0 2.33 6.2 153.14 37.60 115.54
Josh Wall 22.5 8.00 4.8 5.67 7.2 314.54 233.10 81.44
Gary Jones 9.0 14.00 0.0 2.00 7.1 81.00 3.75 77.25
Brian Looney 18.9 6.67 3.0 6.00 5.5 252.81 179.88 72.93
Joe Trimble 8.2 19.67 0.0 2.00 7.1 67.85 1.38 66.47

Of course, the model still doesn’t do fantastic: Most of these cases are those cases where there was scant data to work from in the first year. For the top case, our Bayesian estimate still predicted an ERA of 93! That’s still…excessive. But much less excessive than the naive estimate of 189.

Let’s look at the cases where the model performed poorly:

career %>%
    arrange(desc(diff)) %>%
    select(nameFirst, nameLast, ERA_crr, IP_crr, ERA, IP, post_ERA_mdn, naive_se, bayes_se, diff) %>%
    tail(n=20) %>%
    kable(
        row.names=FALSE,
        col.names=c('First Name', 'Last Name', 'Career ERA', 'Career IP', 'First Year ERA', 'First Year IP', 'Bayesian Estimate', 'Naive SE', 'Bayes SE', 'Difference in SE'),
        align='llcccccccc',
        digits=c(NA, NA, 1, 2, 1, 2, 1, 2, 2, 2))
First Name Last Name Career ERA Career IP First Year ERA First Year IP Bayesian Estimate Naive SE Bayes SE Difference in SE
Willis Roberts 4.6 258.33 13.5 1.33 19.2 79.23 212.77 -133.53
Windy McCall 4.1 252.33 20.2 1.33 24.0 259.62 394.29 -134.67
Ray Shore 8.1 61.33 18.0 1.00 23.3 98.59 233.39 -134.79
Fernando Rodriguez 4.3 231.00 27.0 0.67 29.8 515.94 651.83 -135.89
Gino Minutelli 5.2 39.67 9.0 1.00 17.5 14.30 150.82 -136.52
Bob Reynolds 3.1 253.33 20.2 1.33 24.0 295.66 438.43 -142.77
Paul Fletcher 6.8 14.67 0.0 0.33 20.6 45.56 191.79 -146.23
Al Closter 6.7 35.00 0.0 0.33 20.6 44.70 193.58 -148.88
Rick Waits 4.2 1426.00 9.0 1.00 17.5 22.59 175.61 -153.03
Tim Scott 4.1 313.00 9.0 1.00 17.5 23.89 179.23 -155.33
Bunky Stewart 5.9 186.33 18.0 1.00 23.3 145.42 302.99 -157.57
Tim Stoddard 3.9 728.67 9.0 1.00 17.5 25.60 183.86 -158.25
Fred Beene 3.6 287.00 9.0 1.00 17.5 29.09 193.02 -163.93
Josias Manzanillo 4.7 341.00 18.0 1.00 23.3 177.65 348.80 -171.15
Cal McLish 3.8 1524.00 18.0 1.00 23.3 202.22 382.91 -180.69
Rich Yett 4.9 414.00 27.0 0.33 30.9 486.87 674.88 -188.01
Danny Coombs 4.1 392.67 27.0 0.33 30.9 526.39 721.27 -194.88
Jeff Fulchino 4.9 178.00 0.0 0.33 20.6 23.56 247.90 -224.34
Erik Bedard 4.0 1303.00 13.5 0.67 22.3 90.53 337.19 -246.66
Billy Wagner 2.3 902.67 0.0 0.33 20.6 5.35 334.37 -329.02

Again, these are cases where we have little data to work from. However, in these cases, the first-year ERAs were much less extreme; it appears as though the model pulled them up a little too much. This could suggest that ERA tends to decline a little after the first year, which is not something the model accounts for. Notably, however, the differences between the naive and Bayesian estimates are much worse on the end where the naive estimate did more poorly: When the naive estimate does badly, it does really badly. The differences in the squared error terms (the “Difference in SE” column) are much smaller in those cases where the Bayesian estimate does more poorly.

Here’s the final challenge: How do the two estimates do overall? I calculate the root mean squared error (RMSE) below. We want these numbers to be as low as possible:

cat('Naive Estimate RMSE:', sqrt(mean((career$ERA_crr - career$ERA)^2)), '\nBayesian estimate RMSE:', sqrt(mean((career$ERA_crr - career$post_ERA_mdn)^2)))
## Naive Estimate RMSE: 5.979427 
## Bayesian estimate RMSE: 5.291347

Hooray! We’ve made a modest improvement over the naive estimate. It might not be anything to write home about, but this can be interpreted as an improvement of about .69 in our estimates of the ERA. Our Bayesian estimate gets us, on average, .69 earned runs per game closer to the actual career ERA. All this from just their first year of data.

Here I create a similar plot, sampling 20 random pitchers from our career data set. In red, I’ve plotted the actual career ERA. In blue, once again, is the first-year ERA (which was also the naive estimate). And in gray, with error bars, is our Bayesian estimate. It still looks like the model got thrown off on some of those points with very little data, but in general, the Bayesian estimate is closer to the actual career rate. (Also, what’s with that one red dot way up there? Turns out that’s a pitcher who had a great rookie year, but didn’t pitch much afterwards, so his career ERA is a pretty bad estimate.)

set.seed(222)
sample_ERA2 <- career[sample(1:nrow(career), 20), ]
sample_ests2 <- data.frame(
    ERA_crr=sample_ERA2$ERA_crr,
    ERA=sample_ERA2$ERA,
    IP=sample_ERA2$IP,
    lower=qgamma(.025, shape=sample_ERA2$post_ERA_shape, rate=sample_ERA2$post_ERA_rate)*9,
    median=qgamma(.5, shape=sample_ERA2$post_ERA_shape, rate=sample_ERA2$post_ERA_rate)*9,
    upper=qgamma(.975, shape=sample_ERA2$post_ERA_shape, rate=sample_ERA2$post_ERA_rate)*9
)
sample_ests2 <- sample_ests2 %>%
    arrange(desc(median)) %>%
    mutate(k=1:20)

ggplot(sample_ests2, aes(y=k)) +
    geom_point(aes(x=median), color='#595959') +
    geom_errorbarh(aes(x=median, xmin=lower, xmax=upper), color='#595959') +
    geom_point(aes(x=ERA_crr), color='red') +
    geom_point(aes(x=ERA, color=IP)) +
    scale_color_gradient(low='#97fcfa', high='#002dfc') +
    labs(x='ERA', y='Pitcher') +
    theme(axis.text.y=element_blank(), axis.ticks.y=element_blank(), panel.grid.major=element_blank(),
        panel.grid.minor=element_blank())

What is this model missing? (Or: Can we predict the greats with this model?)

As noted above, this Bayesian model is pretty simple. Looking at the estimates of the best rookie pitchers, it’s not going to do very well in terms of predicting the greatest pitchers of all time. Keep in mind that pitchers are evaluated on more than just ERA, for one thing. Some pitchers are skilled at getting strikeouts, while other pitchers might be better at getting batters to hit nice, easy fly balls that get caught. But to create a more complex model, we might want to add some additional variables.

First, it might be important to model change over time. This could come in two ways. The data I used spans baseball from 1945 to the present; it’s possible that the strategies used by pitchers, batters, and managers has changed over the years in ways that affects a pitcher’s ERA. This would mostly have the effect of adding noise to the data, unless there are really sharp changes (perhaps a result of rule changes). The second aspect to the time aspect is that pitchers’ skills may change over their career. Perhaps pitchers get better over time with experience, or perhaps age and potential for injury lead to decreasing performance.

There are other characteristics of pitchers that might be important for a more complete model. Pitchers in the National League also bat, while pitchers in the American League do not. (They have a “designated hitter” instead.) This could mean that there is a different selection criteria for what makes a good pitcher in these two leagues. NL managers might prioritize a decent pitcher who can also bat well, while AL managers don’t need to consider batting performance. The easiest way to account for this would be to create two Gamma distributions, one for each league, and update them separately. Other characteristics could include left- or right-handedness, the team a pitcher is on, any injuries that occur, etc. Pitchers can also specialize: some pitchers are starters, while others are relief pitchers or closers, and so on. I don’t even pretend to understand the complexity of what goes into that decision, but it seems reasonable to expect that the sort of skills that makes one an effective closing pitcher might differ from those that makes one a good starter.

The way to account for all these variables in a Bayesian framework would be to model them in some way: adding a linear trend for performance across career, for example, or creating separate priors for league, specialization, or handedness. The downside to this is that the complexity can mean that conjugate priors are no longer appropriate, if the priors are not all Gamma distributions. However, there are great packages that can handle these complex models using MCMC sampling: R2jags and RStan in R, and PyMC/PyMC3 in Python. They can get around much of the difficulties that arise when trying to model complex Bayesian problems, and I highly recommend them.

In summary, this is a nice, simple model that used empirical Bayes and pitchers’ first-year performance to predict their career performance. It was not a dramatic, Earth-shattering improvement, but we did gain some predictive accuracy over the naive estimate. With a more complex model, we might be able to increase that accuracy even further. However, this shows a fairly simple way of accounting for cases where the amount of information we have varies across our data. This could be useful in a variety of applications: ranking products according to ratings, determining the quality of Reddit comments, predicting top Facebook posts, etc. It allows us to pull in data closer to the average (and adjust our level of uncertainty) when there is little information, and update our predictions as more and more data is gathered. That’s a powerful tool, and it’s where Bayesian models shine.

Now let’s go outside and throw the ball around, shall we?


  1. Yes, I was a data nerd even from a young age.

  2. An earned run means there was no error on the play.

  3. Innings pitched is calculated based on the number of outs. So if a pitcher pitches two outs in the first inning, then gets replaced, they have pitched “0.67” innings. This complicates the analysis quite a bit—a pitcher could pitch against 15 batters, get zero outs, let in 12 runs, and then get replaced, and an IP of 0 would be recorded. If a summary statistic was recorded that tallied up the number of batters a pitcher faced, the problem could be modeled more intuitively as a Binomial process. But that would require going back to raw data to tally up the runs scored during each batter in each game.

  4. I subtract out the first year’s data from the career ERA calculation so we’re not double-counting data.