U.S. Open Data — Gathering and Understanding the Data from 2018 Shinnecock

After losing in a playoff to make it out of the local qualifying for the 2018 US Open at Shinnecock, I’m stuck at my apartment watching everyone struggle, wondering how much I’d be struggling if I was there myself.

Besides on TV, the US Open website offers some other way to follow what players are doing. As shown here, they very generously give us information on everyone’s shots on different holes. We’re able to see where people hit the ball, on which shot, and what their resulting score was on the hole. For example, why in the world did Tony Finau, the currently second ranked longest hitter on tour, hit it short off the first tee, leave himself 230 yards to the hole where he makes bogey?

Why didn’t Tony rip D?

One of the cool things these images show is the groupings of all the shots on a hole, like the tee shots here. And when I see very specific and interactive data like we have here, I know it comes from somewhere that I’m able to see myself. So I figured I should grab that data and do some cluster analysis on different holes to see if there are certain spots that players like to hit it.

Here, I’ll go through the data we have, what the values and the numbers mean, and also the code I wrote to eat up the data and display the graphs. Once I have this part going, I’ll be able to perform further analysis to most things that come to mind.

Any questions, comments, concerns, trash talking, get in touch: twitter, contact.

Current Posts

Using Clustering Algorithms to Analyze Golf Shots

Finding the data

First step was to search for where the data for the hole insights page was coming from. As always, open the dev tools, click on the network tab, and find what’s getting called with a pretty name.


The file itself is quite dense and has all the information, which is really cool! It has IDs for all the players, all the shots they have on the hole, which include the starting distance from the flag and the ending distance to the flag.

First off, we’re given a list of Ps, meaning an array of player information, like this:

It looks like we have first name, player’s ID, whether or not they’re an amateur, last name, nationality, scoreboard name. The important part of this information is the ID, where we’ll be able to match players to shots.

Next, we’re given a few stats on the hole for the day:

Which match up nicely with what we’re shown on the image

Birdies, Pars, and Bogeys, oh my


See that HP tag above? That’s where the information on the shots players have hit are stored. Each shot in the JSON is listed like the following, which is Tiger’s tee shot on number 10.

Going through acronyms again, we have distance to pin, distance of shot itself, is the shot in the hole, X coordinate, Y coordinate, ZID which I don’t know what that means.

The first step for me was to go through and make sure I understand what those coordinates mean. To do this, I took a little time to try to match up the X and Y coordinates of the shot ending points, and where they tell us the flag is located.

In order to calculate the distance of the tee shot, or rather, the location of the flag, we’re going to do some math using the Pythagorean Theorem!

Neato, the math checks out. In terms of clustering though, it doesn’t make much of a difference other than knowing that we have correct data.

Cleaning and formatting

I keep saying that the main part of data work is gathering and formatting and cleaning the data before throwing the variables into a function that’s already been written. And that’s the case here too. The goal is to have an array of [x, y] coordinates for the result of the tee shot.

Clean looking code, and gets us the differences we’re looking for.

Scikit-Learn Cluster Algorithms

I’ll go a little further than only understanding the data, and also show the code that I use to find the clusters of the data.

To do this, we take that locs array and throw it into the sklearn algorithms.

Now that’s super easy. The resulting x_label variables are arrays of the same length as the locs variable, meaning 156 since there are 156 players. The number at each location in the array is what group the shot belongs to. So for example, if we’re using the K Means algorithm with 3 groupings, the array has values of 0, 1, and 2, like the following:

Graph time

The code that I’ll use to show the resulting clusters is the following.

I’m going to hide the cluster results until part two which will have the pretty pictures, but now I’ll show a graph that just lists the points.

To do this, all I’m going to do is set num_clusters = 1 and plotting_labels = km_labels, then it’ll spit out a graph with no heading, no x or y labels, and a bunch of blue dotted dots.

Blue dotted dots

Do you see some clusters?? I do, and now we want to see how those algorithms above find them. Look for more posts to come using this data.

Like this:

Like Loading…