Similarity Score Methodology

Introduction

“Buddy Hield is the next Stephen Curry”
“Brandon Ingram is a poor man’s Kevin Durant”
“Andrew Wiggins’ upside is Carmelo Anthony, but his floor is James Posey”

So often when we talk about NBA players, we do it through comparisons to other players, and with good reason—comparisons are a good way to quickly convey lots of information about a player. For example, if I tell you that a player had a Box Plus/Minus of 7.8 last season, you might get a vague idea of how good he is. If I then tell you that player had 19.5 points per game, you might have a slightly better idea, but it’s still far from the full picture. But if I claim that this player is the next Chris Paul, it immediately brings to mind an idea of not only how good he is, but also his strengths, weaknesses, and overall playing style. Maybe your mind also queues up a mental highlight reel of Chris Paul-like plays, for good measure. Comparisons quickly give a complete picture of a player which would otherwise require taking the time to slowly digest each number in his stat line.

Also, they’re lots of fun!

We’ve created our own set of similarity scores to make comparisons using math for the purpose of prospecting players coming out of college. Our goal is to produce a useful complement to our PNSP model. Where our PNSP model answers the question, “How valuable will this player be?”, our similarity scores aim to answer the question, “Who will this player be like?”

For a glimpse of some player comparisons, you can check out Similarity Scores for 2016 NBA Draftees, here.

Data

Our dataset consists of players who entered the league between 1997-2016. International players, high school players, and players with incomplete college statistics are excluded from the data. Since data record-keeping has become more reliable in recent years, our data contains more players from more recent draft classes. This is only an issue in that it limits the size of our training data; skewing more recent may help to better capture current trends in the NBA.

Methodology

We started by dividing our players by position, in order to create similarity score calculations differently for different types of players. After all, the things you care about when deciding whether two big men are similar is very different than when you’re comparing two point guards. Rather than doing this grouping using our intuition, we used k-means clustering to form groups based on box-score statistics. This technique allowed the data to determine how the players should be grouped, potentially identifying players who fit best in a group other than their nominal position. Ultimately, though, the three clusters we created ended up corresponding intuitively to three player archetypes: primary ball handler, wing player, and big man. Below is a chart showing these clusters, visualized with Principal Component Analysis. PCA is a statistical procedure with many interesting uses. Here, it’s just used as a visualization tool, so the important thing to know is that combines our many original variables into two new ones, which allow us to get a two-dimensional picture of as much of our data as possible. To get a player’s coordinates to find him on this chart, use our table below.

After we have bucketed players into the aforementioned player archetypes, we created a matrix of every possible player combination within each player archetype. Because we only paired players in our dataset within the same position group, wing players will only be compared to other wing players, big men will only be compared to other bigs, and primary ball handlers compared to primary ball handlers. Once we have created a matrix of every player combination, we calculated the similarity scores by taking the absolute differences between the players in each statistic and physical measurement, weighted, summed, and adjusted by Strength of Schedule. We made use of player-level basic box-score statistics (e.g. points, rebounds, assists , etc.) adjusted per 40 minutes, age / experience and physical measurements (e.g. Body Fat, wingspan, and vertical). Player-level box-score statistics were adjusted by strength of schedule to properly account for the level of competition in order to more granularly capture the differences in statistics produced by college basketball players. For example, Stephen Curry at Davidson faced a much different slate of opponents than Devin Booker at Kentucky. Even though Stephen Curry and Devin Booker both shot roughly 40 percent from three, these players accomplished that by facing differing levels of competition and circumstances. By adjusting for Strength of Schedule, we are making players like Stephen Curry and Devin Booker “less similar” since their respective paths to shooting 40 percent from 3 were different.

Below is the mathematical formula for calculating the college similarity score and NBA Similarity Score for a given set of players. w_{1}, w_{2},...,w_{n} represent the weights corresponding to each statistic p_{1}, p_{2},...,p_{n}; so, the college similarity score is the sum of the weighted w_{i} difference for each box score statistic p_{i} between player x and player y adjusted by e to the absolute value of the difference in player x and player y strength of schedule divided fifty. Note that we divide by 50 to reduce the impact of the strength of schedule adjustment. Age / Experience and physical measurements are not adjusted for strength of schedule. NBA similarity scores do not take into account age / experience and physical measurements or adjust for strength of schedule; therefore, they are simply the sum of the weighted w_{i} difference of each statistic p_{i} for each player x and player y.

\displaystyle SS_{college} = e^{\frac{1}{50}\ |SOS_{x}-SOS_{y}|} \sum_{i = 1}^{10} |w_{i}p_{i,x}-w_{i}p_{i,y}| + \sum_{i = 11}^{14} |w_{i}p_{i,x}-w_{i}p_{i,y}|

\displaystyle SS_{nba} = \sum_{i = 1}^{10} |w_{i}p_{i,x}-w_{i}p_{i,y}|

In order to get the weights for each statistic, we used a stochastic optimization algorithm known as simulated annealing. For each position archetype, we tried to find the set of college and NBA weights such that college similarity scores matched up most closely with NBA similarity scores. We measured this by the adjusted R-squared, or percent of variance explained, of the two sets of scores. Note that each position archetype will have a different set of weights. For further details on simulated annealing see https://en.wikipedia.org/wiki/Simulated_annealing.

Once we calculated the similarity score value for each player combination, we then took the percentile the player combination fell within their position grouping. This gives us a relative and more interpretable value between 0 and 100.

Results / Interpretation

The similarity score algorithm gives us some interesting takeaways. First off, as one might expect, superstars often do not have many comparable players. For example, Kevin Durant’s highest ranked comparable players were Anthony Bennett (81.5), Jabari Parker (81.0), and Keith Van Horn (79.7); contrast that with  a role player like Otto Porter Jr who’s top comparable players—Derrick Brown, Kelly Oubre, Stanley Johnson and Mike Miller—all received scores around 99. This makes sense, as obviously there are more role players in the NBA than superstars. But there are some superstars that have high comparison scores (e.g. Russell Westbrook who’s most similar players included Marquis Teague (99.8), Tyler Ennis (98.5), and D.J. Strawberry (98.3)) Conversely, there are role players without very many comparable players, such as, Kelly Olynyk, who only had 4 players score 80 or higher in similarity.

It is important to remember (for interpretation) that these 0 to 100 similarity values are being presented as a percentile and therefore are uniformly distributed between 0 and 100. The interpretation of the similarity score is simple: as an example, Kevin Durant and Anthony Bennett’s similarity score can be read as, “Kevin Durant and Anthony Bennett rank in the 81st percentile for player similarities at the wing.”

In addition to observing the similarity scores, the optimized weights by player archetype can also tell us important differences between position groupings and which college statistics translate to the NBA. Bigs saw 3-point percentage as the most heavily weighted statistic, while wings had a more even spread of weights among variables; this makes sense since wings often contribute more uniformly to box score statistics. Primary ball handlers saw free throw percentage, assists and steals as the highest weighted variables. The differences in which variables are mostly heavily weighted by position grouping shows some of the differences within each group. For instance, there are defensive Point Guards (i.e. Ricky Rubio, Patrick Beverly, etc.) that are a top the league in steals, playmaking point guards (i.e. Ricky Rubio (again), John Wall, etc.) who lead in assists and (efficient) scoring Point Guards like Stephen Curry and Damien Lillard who shot the ball well meaning they likely have a high Free Throw Percentage.

In order to help you better understand how we would use Similarity Scores in prospect evaluation, let’s take a look at the Minnesota Timberwolves 5th overall pick in the 2016 NBA draft, Kris Dunn. Prior to the 2016 NBA Draft, Kris Dunn profiled as an older player with good size, a good motor, playmaking ability, and defensive prowess, but questionable shooting ability and decision making. Maybe the biggest question surrounding Kris Dunn’s NBA potential was his ability to be an efficient scorer in the NBA. While Kris Dunn shot 37.2 percent from 3 his Junior year at Providence College, the questions around Kris Dunn’s shot-making ability were drawn from his inconsistent shooting form and poor free throw shooting (69.5% his junior year). If we take a look at Kris Dunn’s top ten most similar college players (in the table below), we might be able to garner a better idea of Kris Dunn’s likelihood of becoming a good NBA shooter. The first thing we notice is that, while this is certainly an impressive group of comparable players for Kris Dunn, a majority of these players have been unsuccessful in becoming good NBA shooters. But, although poor shooters, most of these players have still been able to carve out productive (even extremely productive, e.g. John Wall) NBA careers, which bodes well for Kris Dunn. Based solely on this list of comparable players, we would not expect Kris Dunn to become a good NBA shooter, but rather (most likely) an effective distributor and defender in the NBA (eerily similar to a point guard already in Minnesota).

Conclusion

The similarity score algorithm is another tool that could be used to evaluate college basketball prospects. Coupled with the PNSP model and scouting, we can start to paint a broader picture of how a given player’s game will translate to the NBA level.

Check out Similarity Scores for 2016 NBA Draftees, here.

Written by Marc Richards and Jack Werner