Saturday, November 24, 2012

The Death of an Advanced Statistic: Plus Minus Injected with Dirty Box Score Stats

Recently, the highly respected adjusted plus-minus model by Jeremias Engelmann was tossed out in favor of a hybrid method that uses not just plus-minus but conventional box score stats too (along with height and as the author has indicated more non-conventional stats like charges in the future). You can see the new stats and layout here, but the old stats are lost in the internet's foggy past.

For decades, quantifying individual basketball players was limited to the simple box score, and for most of that duration the focus was mainly on the major counting stats of points, rebounds, and assists where the secondary stats were blocks, steals, and field-goal percentage (technically a derived stat from field goal attempts and makes). There are also tertiary stats: free throws including percentage, three pointers, offensive and defensive rebounds on their own, fouls, minutes, and turnovers. What's obvious, however, is how arbitrary some of the stats are. No reasonable person would conclude you can fully account for individual defense from rebounds, blocks, steals, and fouls (though some do anyway.) And why don't passes to lead to a player drawing a shooting foul count as assists? There's also a family of pseudo box score stats, ones that aren't included in traditional box scores: charges, hockey assists, and-1's, dunks made, shots challenged, etc.

There are obviously problems only looking at the box score, and the thinking of the basketball community was dominated by the primary stats, especially scoring, and to lesser extents the secondary and tertiary stats. Having the greatest amount of an arbitrary stat was tantamount to undisputed greatest. Your name was etched forever by collecting the most assists in a specified time period, and that was that. You could have a fluke game or perform in the luckiest circumstances, grabbing rebounds against an undersized player with a hurt ankle, but it didn't matter as long as you hit a nice round number. 

The basketball statistics revolution took these numbers further, delving into them and creating complicated outputs like the all in one measures of PER, Win Shares, and Wins Produced. There are also clever but intuitive measures like true-shooting percentage, which is like field-goal percentage but adjusted for free throws and three pointers, and rebounding percentage, which is how many rebounds you grab out of how many there were available and is simpler to understand than total rebounds with no bias for high pace games or opponents who keep missing, creating more opportunities. But those complicated stats are still dependent on the box score, and as such they have blind spots. There are guys who are great at collecting stats but don't seem to help their team win and their weaknesses aren't recorded by the box score like Jose Calderon or Carlos Boozer. Then there are guys who don't really tally up big numbers but appear to be very valuable like Shane Battier and Nick Collison. Coaches and their greatest cheerleaders of conventional wisdom, most NBA "analysts", complained that people focused too much on stats and lost the important details, the nitty-gritty like a good screen, boxing out, proper double teaming, or the effect of floor spacing. Overall, box score models worked pretty well and aligned with previous thinking because they relied on what people were focusing on anyway, but the coaches were right. The game is more nuanced.

Plus minus is simple. It's about how your team scores and gives up points when you're on the court versus off the court. It's a direct measure of what people really want to know and it bypasses the middle-men of the box score stats: winning. The NBA game is a billion-dollar business, but it all reduces to a score at the end of a game. You can estimate who accounts for a win by who takes the shots, but this is imprecise and is like measuring the amount of rainfall by how many clouds are in the sky. Plus minus focuses on the only thing that matters: outscoring your opponent.

There are obviously problems with plus minus. The first is who else is on the court. If you're a fifth Beatle in a great starting lineup, you'll seem better than you are, while feasting against weak competition in garbage time doesn't necessitate greatness. But with modern computers an adjustment is possible. This is the typical adjusted plus minus model. The second problem is harder to get around: some players tend to only play with certain players, and in the case of a starting center and a backup center if they only replace each other it's impossible to estimate their own values. As a result, the plus minus numbers even when adjusted are noisy with strange answers and larger numbers. A technique to reduce the variation and the extreme values is ridge regression, where instead of just minimizing the sum of the squared errors the coefficients are also minimized. (The coefficients are basically the player ratings or an adjusted plus minus.) Engelmann's model, using ridge regression, became popular for its intriguing and reasonable results. It was the latest generation in basketball statistics.

However, that's been replaced by a method now called xRAPM that uses both adjusted plus minus and a box score metric. With more information prediction can obviously be improved, but the problem is this new way to look at the game is discarded for what people have been essentially doing for decades. It's not completely rejected because it's a hybrid model, but the basketball community already has enough statistics based on the box score. The new model isn't perfect, and the same flaws exist as in other box score metrics like Al Jefferson being a positive on defense because he grabs rebounds and picks up blocks but forgets to defend. Adjusted plus minus is a new perspective, and now the most publicly available method is gone.

There's also an issue in how models are evaluated. As I've written before with Wins Produced, the problem with testing individual player models by the outcome of a team event is that you're somewhat arbitrarily assigning a value to a discrete event like a rebound. You're also falling into the trap of accidentally testing the efficacy of offensive and defensive efficiency, which is already known: the points you score per possession, and the number of your possessions along with the opponents totals, will predict wins with 95% accuracy over a season. You can calculate your offensive and defensive efficiency with box score stats, so creating a model with box score stats to predict wins is in effect building a model for team efficiency. But that's the problem: it's team efficiency. Who gets the credit for a rebound? The player who defends the star player well, causing a miss, or the player rebounding? How much credit should each have? Obviously, there is no concrete answer, and it depends on the context of how the shot was taken and where the rebound was.

Players who are good at collecting box score stats have been generally overrated for years. This is compounded by the fact that a player without a team is useless and a fit or a role is just as important as who the player is. Teams know this, and the smart ones at least won't let a bunch of stat-grabbers on the court at the same time without a, say, competent defender who doesn't call for the ball each possession. Without caring about team synergy you get the Washington Wizards of the past couple seasons.

One method is test the validity of a model is to see how a team does when a player is removed for various reasons from injuries to a trade. If those rebounds can't be grabbed by someone else, you'll be able to see it; if the scoring is indeed beneficial the offense should suffer; and if a glue guy is actually important structural glue the numbers should bear it out.

Fortunately, sciences work with incremental improvements and fits and starts. Someone could fill in the void with a different model. The data are available. A new technique could shed light on notoriously hard to track players like rookies, fringe players with little minutes, and the outliers. I've already looked at Jose Calderon breaks the Wins Produced model, and I'm planning a companion piece not with one player but the now common slightly undersized power forwards who can score and pick up rebounds but offer little resistance on defense and rely on put-backs for their high-percentage plays. Perhaps this new xRAPM method needs to be tested with its weaknesses located and battered mercilessly. It's the only way toward growth.

Monday, November 12, 2012

Defense, Offense, and Pace: An Evaluation

A common lament about fast-paced, high-scoring teams is they have a ceiling; you can't win unless you focus on defense. But is there any truth behind team wins and pace? And does pushing the pace even help the offense?

Looking at seasons 1999-2000 to last season -- focusing on the post-Jordan (Bulls) era because it's a good a border as any and ignoring the weird lockout transitional season of 1999 -- there's a fairly sizable number of seasons with which to work. Pace is the number of possessions per game. Offensive efficiency is points scored per 100 possessions, and it's needed to compare the fast teams like the Nash-led Phoenix Suns of and Billups' snail-slow Pistons. Playing slow doesn't mean you're a bad offensive team. The same is true of defensive efficiency where people regularly conflate pace with ability as slow-paced teams are often called great defensively when they're mediocre, and this happens even with professionals analyzing basketball on TV. The Phoenix Suns never had a chance at being respected as a defensive unit because with their quick scoring the other team had more opportunities to score. But there is a further question of the correlation of defensive and offensive efficiency with pace, and the plots below explore this question.

The mess of dots is an indication of a weak association of defense and pace, if there's any association. The fastest paced team was Golden State in 2010, a crazy Don Nelson team that typically played without a center and scored 109 a game ... while giving up 112. The 2008 Nuggets were the second fastest, while the 2000 Kings ranked third. The Phoenix Suns, who are more notable for five of the seven best offensive seasons, rank behind teams like the 2008 Pacers or the 2000 Magic in pace. But the third best offensive season? The Roy-led 2009 Blazers, obviously. And as another surprise the Bobcats weren't even the worst offensive team: that label of honor belongs to the forgotten 2003 Nuggets, who had decent defense.

The slowest team is the Portland Blazers in 2004, which featured Zach Randolph and a traded-midseason Rasheed Wallace. The Pistons, however, were only sixth slowest, as the Jazz, Knicks, and Grizzlies beat them out. As for defense, the three best seasons all occurred in 2004 with the Duncan Spurs edging out the Wallace-brothers Pistons and the Artest Pacers. Perhaps a better method is to adjust efficiency for the season, but it's fine when looking at pace because the question is how teams respond when playing faster or slower. A pair of svelte shooting guards were featured on the two worst defensive teams: Kevin Martin and the 2009 Kings, and Ray Allen and the 2006 Sonics.


Putting the two components together in basketball-reference's SRS, which is basically how much you outscore another team adjusted for the strength of schedule, there's even less of an association. The outlier at the bottom is the 2012 Bobcats, arguably the worst team ever. What's amazing is that 397 out of 398 teams were between -11 and +9, and the Bobcats were 3.3 standard deviations from the mean. If we assume a normal distribution, the Bobcats are in the 0.05% percentile. You'd expect a team that bad out of every 2000 teams, which in a 30 team league is a once every 67 years event. They were truly historically bad. The strongest team by SRS? Strangely, the 2009 Cavs followed closely by the 2000 Lakers.

Plotting three variables at once in the figure below, there still doesn't appear to be any strong pattern with pace, defense, and offense. The color is used for pace where the fastest teams have the darkest color. Some of the ghostly points are hard to see with the background, but they're fewer in number anyway. What's important is the distribution of the dark points as they appear to be all over the place. Graphs are fine for data exploration, but an easy inquiry can be made into the correlation of both with pace via linear regression.

The results of the linear regression tests are in the table below. It appears that from 2000 to 2012, there's a statistically significant correlation between offensive efficiency and pace, where a higher pace equals a better offense, and the same is true with a stronger response with defense. But it is, however, not true with SRS, which is a strong proxy for wins. The coefficient was negative, but it's fairly small and the p-value is 0.15. This means there's a 15 percent chance there's no correlation of pace and SRS. The R^2 values indicate that very little of the variation in offensive and defensive efficiency is explained by pace; even though pace is significant the effect is quite small. An R^2 of 1 is ideal, meaning a perfect fit, and anything below 0.10 is very low. 


Coefficient
Intercept
p-value
R^2
Offensive efficiency
0.2769
80.22
0.00019
0.03469
Defensive efficiency
0.4414
65.16
8.9*10^-10
0.09056
SRS
-0.1262
11.55
0.146
0.005337

There's a small correlation of offense and defense with pace where the faster you play, the better your offense is and the worse your defense is. But overall, there's no proof a higher pace means you win less games. You can line up and slice the data in a multitude of ways, however; what's important is the interpretation. A correlation is simply a correlation. One explanation is that coaches who play faster prefer smaller players, and smaller players result in a poorer defense -- the Don Nelson effect. Pace could be correlated with another important explanatory variable. For example, business managers live longer than the average citizen, but that's because they make more money on average and have access to better health care. 

A further question is, does defense win championships? The average defensive efficiency of NBA champions since 2000 is +4.0, and the average offensive efficiency is only +2.6. Perhaps it's easier to win with defense because it's more consistent or they're able to control the offense of elite teams more than offensive teams are able to attack elite defenses. A study from Neil Paine of basketball-reference found that defense indeed is more important than offense in winning a championship, though not by a huge amount. Maybe a better study is seeing which teams outperform their regular season results in the playoffs, but that's a different study entirely. For now, there's a weak positive association of pace and offense, and a weak negative association of pace and defense. Point differential, i.e. win percentage? No pattern there -- teams that play fast don't appear to lose more often.

Sunday, November 4, 2012

Lakers' Season Start: Whose Head to Place on a Pole

Only a couple games into the season, people nonetheless are drawing conclusion, and obviously there are some scary and strange conclusions out there. Take a random sample of two to three consecutive games during a season, and you can find some really strange statistics like Andre Miller leading the league in points, Jordan in blocks, Manute Bol the three-point sniper, or Jamaal Tinsley the shotblocking king. Last season, people were excited to see DeMar DeRozan's development as an outside shooter, but by the end of the season he had slumped to 26% from the three-point line. Are we really going to live in a world where Brandon Jennings is the assist leader, Durant in rebounds, Hawes in blocks, and Chandler at 100% from the field? Of course not.

The Lakers are being trashed right now for not winning anything, even a preseason game, while the hiccups are becoming serious from a Nash injury to Bryant's dead foot. With a new offense and an expectation of greatness from the mere presence of the Canadian overlord, the fact that they aren't scoring 140 points a game has Lakers fans readying with pitchforks to storm the castle. My prediction of the Lakers' success this year looks dumber by the minute, and I should have included more of a penalty for team fit and coherence along with Dwight's injury. But in actuality, looking at the stupidly small three game sample, their offense has been fine; it's their defense.


Los Angeles Lakers
League average 2013
League average 2011
Offensive efficiency
104.4
102.3
107.3
Defensive efficiency
114.4
102.3
107.3
Pace
93.2
94.0
92.1
TS%
56.8
52.0
54.1
Effective field-goal %
53.9
48.0
49.8
FT/FTA
0.267
0.214
0.229
FT %
63.7
74.5
76.3
Turnover %
18.7
14.5
13.4
Offensive rebound %
31.9
26.5
26.4

I included the data from the 2011 season because lockouts typically have reduced, messier offenses. The Lakers' offense is ranked 11th (according to basketball-reference), but looking closer the problem is turnovers. Their shooting percentages are stellar, and although their free-throw percentage is low (because of Dwight Howard) they're getting to the line a lot (also because of Dwight Howard), and shooting 63.7% is still better than an average possession shooting from the field. They're also offensively rebounding very well (again, Dwight Howard), pulling their numbers further up. The turnovers, however, are stunting the offense, which should be near the top four or three in the league. With Nash at the helm and Bryant still a major force, someone who can take a large number of shots without losing the ball, this shouldn't happen. An easy explanation is that they're trying to incorporate new players onto the team and they're using a new offensive system, and that's reasonable enough. Strangely though, the Heat's defense is rated as the worst, an indication we shouldn't put faith into early season numbers, but we can analyze how the Lakers are losing.

The defense is a complete wreck. They're second from last in the league giving up an effective field-goal percentage of 52.0 while never forcing turnovers (second from the bottom in that respect too). They're also letting the opponent get to the line at a high rate. In summary, their defense is failing in multiple respects, and it's hard to blame that on their schedule: Dallas was without Nowitzki, Portland's relying on rookies, and the Clippers are the only formidable team.

So who's to blame? Mike Brown is the obvious scapegoat for fans, but in terms of players Kobe Bryant's name is coming up too often.



Steve Nash
Kobe Bryant
Pau Gasol
Dwight Howard
Points/36 mins
6.5
28.8
14.8
21.3
Assists/36 mins
5.8
1.2
3.3
2.9
Rebound %
7.3
6.8
18.5
17.8
Turnover %
20.0
18.9
9.7
15.9
TS%
37.5
71.3
48.8
61.4
PER
7.4
26.1
19.6
24.5
Win Shares/48 mins
-0.056
0.160
0.121
0.162
+/-
-3
+1
+1
-18


No one saw this coming, but as much as people want Pau Gasol executed for his shooting percentages he's leading the foursome in turnover % by a landslide, and he's even outrebounding Howard. Dwight's basic stats seem alright until you get to his plus/min: -18 in three games, and considering their bench this is impressive. Nash has been complete trash and he'll be out a few games, where he's not even hitting his jumpers, but at least the Lakers played decently when he was on the court. Kobe Bryant, however, has a sterling record: an amazing 71.3 TS% and a plus/minus of +1, which on an 0-3 team is not easy to tally when you play big minutes. His turnovers, however, are historically high, and it would be silly to assume both he and Nash will continue to turn the ball over like they have been. Even if just those two control the turnovers, and Nash shoots near his career averages, the offense should be spectacular. Controlling turnovers will also help the defense by limiting opponent's transition plays.

Some people will try to blame Kobe's defense since he's been playing hurt, and it's hard to find evidence to support or deny the conclusion, especially so early in the season. Trying to go deeper into the numbers, when Kobe Bryant is off the court the opponents average 130 points per 100 possessions (offensive efficiency), which compared to the Lakers' average of 114 and the league is an absurdly high number. During one stretch the Blazers killed the Lakers on the offensive boards, which isn't Kobe Bryant's job to prevent, but even when that's excluded from the numbers it's 120 points per 100 possessions. I will stress that one should not hold these plus/minus numbers with any confidence because it's only been three games, but the evidence available suggests Kobe isn't the problem on offense.

There's another player pushing through an injury: Dwight Howard, who's replacing Andrew Bynum, not exactly a Defensive Player of the Year candidate. Healing from his back injury, his recoveries are slow and opposing guards are zipping to the basket unimpeded. Last year the Lakers were middle of the pack in defense with the Magic barely ahead, but the Howard drama zapped both the team and the center of the effort necessary to dominate; the season before the Magic were ranked third despite really only possessing one good defensive player: Dwight. Given that the Lakers are decent without Howard, if Defensive Player of the Year Howard shows up at a game and the Lakers control their turnovers, this team should easily be one of the best in the league. He's only 26 and should regain his mojo eventually, but we don't know when that will be, and Los Angeles will have to wait patiently.