As with many things in sports analytics, it started with Bill James.
James was looking for a way to translate a baseball team’s runs scored and runs allowed into wins and losses. He noticed that the relationship could be expressed as follows:
(W / L) = (RS2 / RA2)
The above could then be used to express a team’s won-lost percentage as a function of runs scored and runs allowed:
WL% = RS2 / (RS2 + RA2))
Because of the squared terms in the equation above, James chose to dub this the Pythagorean formula. James found that when he used this formula to try to predict a team’s win total given its runs scored and runs allowed, he would usually get within plus or minus four wins of the team’s actual win total.
As it turns out, James’ formula is actually a logit model:
ln(WL% / (1 – WL%)) = β1 × ln(RS / RA)
Solving the expression above for WL% yields:
WL% = RSβ1 / (RSβ1 + RAβ1)
Where — in the case of baseball — β1 has a value of approximately 2.
Of course, James’ finding eventually bled over into sports*, including basketball.
* The person often credited with starting this movement into other sports is former STATS researcher and current Houston Rockets general manager Daryl Morey, although that narrative may be apocryphal. Morey definitely did do work in this area, but whether or not he was the first is up for debate.
One of the earliest online references I could find for using James’ structure to predict team wins in the NBA was Dean Oliver’s old website Journal of Basketball Studies. Oliver recommended an exponent of 16.5 to convert points scored and points allowed into wins:
E(W) = G × (PS16.5 / (PS16.5 + PA16.5))
In the first edition of Pro Basketball Prospectus (2002), John Hollinger used a simple linear model to predict teams wins:
E(W) = (G / 2) + 2.7 * (Avg. Pt. Diff)
But by the 2005 edition of his annual — now titled Pro Basketball Forecast — Hollinger settled on a Pythagorean model with an exponent of 16.5, the same as Oliver.
When I launched Basketball-Reference.com back in 2004, I found that an exponent of 14 produced the lowest root mean-square error (RMSE) across all team seasons in the NBA:
E(W) = G × (PS14 / (PS14 + PA14))
The table below presents the RMSEs (per 82 games) for these three models by decade, as well as across all seasons:
|Simple Linear||Exponent = 14||Exponent = 16.5|
1947-1949: RMSE = 7.29
1947-2013: RMSE = 3.64
1947-1949: RMSE = 4.95
1947-2013: RMSE = 3.24
1947-1949: RMSE = 6.58
1947-2013: RMSE = 3.68
A few notes about these results:
- An exponent of 14 is superior in five decades (1940s, 1950s, 1970s, 1990s, and 2010s).
- The simple linear model is superior in two decades (1960s and 1980s).
- The simple linear model and an exponent of 14 produce essentially the same result in the 2000s.
- An exponent of 16.5 does not produce the lowest RMSE in any decade.
From the above it’s clear that although an exponent of 14 does the best overall, it does not do as well in extreme eras. Here’s a chart of average points scored per game by decade:
Since the shot clock was not introduced until the 1954-55 season, scoring levels were way down for the 1940s and the first half of the 1950s. On the other hand, scoring levels were never higher than they were in the 1960s, as the average team scored 115 points per game.* Not coincidentally, these three decades have the highest RMSEs when using an exponent of 14.
* To help put that into perspective, the last NBA team to average at least 115 points per game was the Golden State Warriors way back in the 1991-92 season.
To try to fix this problem I decided to blend the best of the simple linear model and the Pythagorean method by incorporating average point differential into a logit model. This resulted in the following formula to predict team wins:
E(W) = G * (1 / (1 + e-0.13959 * (Avg. Pt. Diff)))
For example, the 2012-13 San Antonio Spurs had an average point differential of +6.40 points. The predicted win total for the Spurs using the model above is:
E(W) = 82 * (1 / (1 + e-0.13959 * 6.40)) = 58.2
In this case the Spurs actually won 58 games, so the formula was right on.
Here’s how the logit model incorporating average point differential performs compared to a Pythagorean model with an exponent of 14:
|Logit (Avg. Pt. Diff)||Exponent = 14|
1947-1949: RMSE = 3.45
1947-2013: RMSE = 3.12
1947-1949: RMSE = 4.95
1947-2013: RMSE = 3.24
As you can see, the RMSEs in the two lowest scoring decades (1940s and 1950s) and two highest scoring decades (1960s and 1980s) have been reduced, in some cases dramatically so. And with the minor exception of the 1970s, the RMSEs in the other decades were almost identical.
So what is the upshot of all of this? I think there are two important things to note:
- Unless there is a major change in the way the game is played, using a Pythagorean model with an exponent of 14 is just fine for the modern era, in particular because the structure is already ingrained in the minds of most analysts.
- Websites or studies with a historical bent should use a different structure, something like the logit model above that incorporates average point differential. That’s not to say the model above is the best — there may be another model that’s superior — but it’s clearly better than the basic Pythagorean model in extreme eras.