This year’s National League Cy Young race was pretty much a toss-up, with each of Jake Arrieta, Zack Greinke, and Clayton Kershaw putting up numbers we haven’t seen in a decade or more.
By now we know that Arrieta wins the award, but being the Cubs homer I am, I started digging into the data a few weeks ago in attempt to show that Arrieta should win the award. However, as is often the case when walking into an analysis with preconcieved notions of its findings, I was left unable to make my case with a straight face.
Unable to confidently make the case that any of the contenders were more deserving of the award than their peers, I decided to turn my work into an article highlighting the historic years each of them had. Unfortunately, the article never wound up published, but you can still read it here, though it’s obviously outdated now.
Since I tend to use this site more for technical posts, it seemed like a good idea to walk through a couple pieces of my work – if you’re interested in everything, I’ve pushed it up to GitHub.
Preprocessing
In order to show the stats I cared about and their progression throughout each pitcher’s season, I needed to do some preprocessing of the data. Specifically, I needed to calculate a variety of statistics that are not included in the game logs from Baseball Reference.
After loading the dataset and transforming the innings pitched (IP) field to a numeric value, you’ll see a fairly large section of code in the notebook that looks like this:
Here we’re adding new, cumulative statistics to each pitcher’s DataFrame (e.g. we can easily say what Arrieta’s ERA was after his fourth start, or what his batting average on balls in-play (BABIP) was in the second half of the season).
Visualizing their seasons
Now that we have various statistics on a rolling basis, we need a way to compare their performances throughout the season. Thankfully, this is a perfect use case for small multiples, which is a technique meant specifically for comparison.
To do so, we can create a dictionary where each pitcher is a key, and the value
is another dictionary containing that pitcher’s DataFrame, as well as a color
and line style which we’ll use in our plot. Then, we’ll create a grid of empty
subplots, which will be populated by looping through our PITCHERS
dictionary.
The resulting output is a 3 x 5 grid of charts, where each row corresponds to a pitcher, and each column is a statistic.
Again, this technique is meant for comparing different dimensions (people, cities, departments, etc.) against one another.
For instance, looking down the left-most column, we can see that batters put the ball in play (IP%) about equally against Arrieta and Greinke, but less so against Kershaw. Looking down the far right column, we can see that Kershaw was put in play less often due to his stronger ability to strike hitters out (K%).
Comparing batted ball exit velocity
With PITCHf/x installed in every MLB park, we can also look at data around each pitch made throughout the season. Baseball Savant is a great source of this data.
Since it still wasn’t clear who should win the award after looking at a variety of stats, it seemed interesting to answer the most basic question: Which pitcher was hit harder? We know there’s a significant relationship between a batted ball’s exit velocity and its likelihood to wind up a hit, so this should give us some indication of who was the more difficult pitcher to face.
Looking at the observed distributions of their batted ball exit velocity doesn’t tell us much – Arrieta’s mean exit velocity was 85.0 MPH, Greinke’s 88.4, and Kershaw’s 84.9. Those numbers are pretty close – so close that we shouldn’t assume they’re statistically significant, so let’s test that using the bootstrap.
With bootstrapping, we generate N random samples of our dataset (typically 1,000 or 10,000). Since we’re interested in speaking about the “average” batted ball exit velocity, we take the mean of each random sample, resulting in an approximation of the mean’s true distribution. From there, we can look at the 95% confidence intervals to test for significance.
While the above chart doesn’t explicitly show their 95% confidence intervals, it’s pretty clear that Greinke’s mean exit velocity is significant when compared to Arrieta and Kershaw – allowing us to say that, on average, Greinke was hit harder throughout the season than both Arrieta and Kershaw. We cannot confidently say there was a difference in exit velocity when comparing Arrieta and Kershaw to each other though.
The chart above is especially interesting in the context of our small multiples charts. In particular, that Greinke had the lowest ERA, batting average on balls in play (BABIP), and extra base hit rate (XBH%) of the three, despite allowing harder contact. This suggests that Greinke received a bit more help from his defense than Arrieta and Kershaw.
If you’re interested in more analysis on the season each of these three had, Dave Cameron at FanGraphs has an excellent write-up explaining the rationale behind his vote.
Hope you’ve enjoyed the post, and let me know if you have any questions.