Last week Josh Hermsmeyer introduced to us air yards per target (aYPT), which aside from charting differences, is identical to Pro Football Focus’ aDOT. However, he was able to further break down receiving into its constituent components: air yards, whether the ball was caught or not, and if caught, the yards created after the catch.
Using these constituent components, plus some other important variables from the RotoViz Screener,1 I created a machine learning model that predicts year N+1 PPR points with a cross-validated R-squared of 0.93 for the wide receiver position2 on in-sample data, and 0.60 on out-of-sample data. In other words, this new information by Hermsmeyer allows us to make a massive improvement over previous wide receiver models.
I’ll break down my methodology, the logic behind it, and then of course list the 2016 PPR projections for the WR position.
The data set I built the model with consists of data from 2009 to present. To be eligible for the model, the receiver had to have either 30 or more targets, or play in 8+ games in year N, and then play in at least 8 games in year N+1. The goal is to take the model inputs and predict the following season’s PPR points per game for each WR that meets the criteria.
Building, Training, and Testing the Model
To identify the optimal model inputs, I first used a couple of screening methods to whittle the variables down into a subset of variables with the most predictive potential. This gave me a subset of 17 potential inputs. From there, I used some techniques to narrow the list down further, mainly by eliminating highly correlated inputs. As a results, on first pass I ended up with the following list:
- Weight (WT)
- Age (AGE)
- NFL Draft Position (Draft)
- PPR points per game the prior year3 (PPR.PPG.N)
- Receiving TD Rate (reTDRT)
- Receiving average margin4 (reAVGMGN)
- Rushing attempt market share (ruATTMS)
- Air yards per game (aYPG)
- Completed air yards per target (CaYPT)
- Incomplete air yards per target (IaYPT)
- Yards after catch per target (YACPT)
The nice thing about this is that the only two variables that were even remotely close to being highly correlated were PPR points per game and air yards per game.5 Even then, air yards completely describe PPR points. Everything else was mostly independent from each other, meaning essentially every variable was adding information to the model.
From there I clustered the wide receivers using all of the variables except AGE and reAVGMGN. The reason I didn’t choose AGE is because I took an average of the stats where each receiver played at least 8+ games, so I didn’t want to average over AGE.6 I also chose not to include reAVGMGN, because the quality of team a player plays on can vary greatly throughout a career.
Using the 11 variables listed plus the clusters as inputs, I then used an ensemble of machine learning models to come up with a predictive model for PPR points per game. The cross validation R-squared on in-sample data was 0.835.
However, when I looked at the 2016 predicted results, a few things struck me as odd. Odell Beckham, Jr. was projected as the 6th WR and Dez Bryant was way down in the mid 30s. I figured this would get people up in arms about why a few players were so far off from ADP. Then it hit me — I also needed to incorporate ADP into the model! That’s because ADP incorporates a wisdom of the crowds (WOTC) element that can’t be determined through a player’s stats alone. Bryant is the perfect example, we know he is better than a WR in the 30s, because last year his stats suffered due to injury both to himself and Tony Romo. How do you model that statistically? With WOTC!
Once I added in positional ADP the projections made more sense, and the predictions got better. This time a cross-validated R-squared of 0.93 on in-sample data. For a random subset of data8 that I withheld, the R-squared was a solid 0.60 on this out-of-sample set. The RMSE on the out-of-sample data was 3.67 meaning about 95 percent of players will end up with a PPR within +/- 6.6 points.9 It’s also interesting to note that these 13 variables let us backfit, or explain, 93 percent of all past performances in PPR points per game. These models are a big performance improvement compared to most regression models, which give a back-fit R-squared of around 0.5-0.6 for WRs, but only explain out-of-sample data in the 0.3-0.4 range.
We can also use one of the algorithms to look at the importance of each variable in making these predictions.
Notice the clusters really do help the model prediction (this is validated by the R-squared and RMSE values of the model without the cluster term, both of which were far worse). The reason the clusters are the most important is because this best describes the type of player we are looking at, and has all these stats rolled into one. The model then figures out how each of the individual stats interacts with each other, and with each of the the player types. I’ll get more into the clusters in a future article. Notice also Pos.ADP significantly helped the model, so that WOTC effect was very important. Finally, it’s cool to see the different air yards stats (air yards per game, complete air yards per target, and incomplete air yards per target, and yards after catch per target) were more important in making predictions than things like reTDRT. This is a tip of the cap to Hermsmeyer for pulling those individual stats together and really breaking down the components of receiving.
2016 PPR Projections
Here are the projections for 2016 as given by the model:
|Odell Beckham Jr.||18.53||0.082||-1.2||198||23||12||21.3||121.9||0||5.5||6.2||3.8||2||1.D|
A few things to note:
- Visually the projections look pretty solid. The big three hold three of the top four spots, for example, and Dez Bryant is now at WR20 instead of buried in the mid 30s. I suspect this is still too low, but hey, we also didn’t expect him to have 9.9 PPR points per game last year.
- Players like Jordy Nelson, Kelvin Benjamin, Kevin White, and Breshad Perriman aren’t listed due to injury. Rookies also aren’t listed, but you can find rookie WR projections here.
- The model likes receivers who consistently get targets but suffered in efficiency to rebound, such as Randall Cobb and Jarvis Landry. That makes sense, since volume is more predictive of future success than efficiency is.
- I believe ensembling these results even further with our staff projections from the Projection Machine, as well as with composite projections from fantasyfootballanalytics.net may yield even stronger R-squared and RMSE values.
- Stefon Diggs is a RotoViz favorite, and the model loves him too.
- Anquan Boldin might be the Detroit receiver to target.
I’ll continue to work to improve this model and make it even more predictive – but the performance of this model should be quite strong. Let me know in the comments what else you notice from the projections, or ideas to improve the model.
- Which I’ll get to in a minute. (back)
- For WRs with 8+ games or 30+ targets the prior year (back)
- From here the inputs will be from the prior year. (back)
- The amount a team was ahead or behind by when the player was targeted. (back)
- Volume, baby (back)
- This is mostly to see how career arcs play out for the clusters, but that’s for a later article. (back)
- I chose this instead of k-means for a number of reasons — the least of being which it gave the best results. (back)
- From the same time frame. (back)
- To calculate, just take 1.96 times the RMSE (back)