Man, I Love Football

Rohit Pasumarthi, Tejas Vij, Manan Bhatt

Introduction

In the world of football and sports betting, predicting winners is extremely important in a multi-billion industry. Many times, sports bettors are forced to use instincts to decide which team to bet on to win, but we want to get rid of instincts and use a machine learning model to give us a better prediction on winners for a game.

Our first step in this tutorial was to tidy up the data we collected from Pro Football Focus. We went through the six CSV files, three of which were for offense and other three for defense. We tidied up by removing unnecessary columns that we believe won’t help us get the stats we need to predict scores. The next step was to visualize the data we gathered, and we chose specific stats to let us understand what information we should feed to our machine learning model.

The final step before getting scores is to build the machine learning model. By gathering stats from previous weeks, we built our machine learning model to predict scores for a certain week. Our machine learning model will be taking the stats, which includes offensively: yards rushing, passing, touchdowns scored, and defensively: yards rushing, passing, touchdowns allowed of our home teams and the stats of our away team which gather the same stats.

Getting Started with the Data

We make use of Python 3 along with a few imported libraries: pandas, numpy, matplotlib, seaborn, and more.

Preprocessing the Data

The three offensive CSVs we chose came from Pro Football Focus Offensive Stats . You can gather whichever years data you would like by replacing the four numbers at the end. For “Team Offense”, “Passing Offense”, and “Rushing Offense”, you can press “Share & Export” and select “Get table as CSV (for Excel)”. Copy the information from here to a CSV you want to read from. You can do the same for the Pro Football Focus Defensive Stats here.

Reading the Data

The way the data is read is by using .read_csv(). This resulted in the header becoming the first row so we named the columns the first row, and dropped the first column containing the header name.

Tidying and Modifying the Data

By following the steps above, you will have six dataframes that have a lot of information, a lot which we can remove. By tidying, we removed a lot of the unnecessary information. Follow the code below to see which columns that we will remove from the offense and defense dataframes.

We removed nine columns from the original offensive stat dataframe, since based on our research we found high levels of deviation that would affect the final result we would get from our predicted scores. Stats like penalties, rank, first downs, first down percentage, and fumbles lost were categories we felt we covered with other stats we are crunching. Stats like turnover percentage and sack percentage were best fit by getting numbers from the opposing defense. The stats we need to focus on are: 'Team', 'Games', 'Total Points For', 'Total Yards', 'Total Plays', 'Total Yards/Play', 'Total Turnovers', 'Passing Completions', 'Passing Attempts', 'Passing Yards', 'Passing TD', 'Passing Int', 'Net Yards/Attempt', 'Rushing Attempts', 'Total Rushing Yards', 'Rushing TDs', 'Yards/Attempt'

Similar to the offense dataframe we previously covered, we tidied up the data we need to use by focusing on the stats we will feed to our machine learning model.

Accounting for passing, it is important to categorize and tidy up the running game for NFL teams. We did that by removing rank, expected points (which we are attempting to prove), and the longest run a team had.

To get to the model itself, we need to make sure our defense is also tidied up. Our machine learning model takes into account both the offense and defense to make a prediction on the outcome of a certain weeks matchups are. Here, we want to get the points allowed, yards allowed, total defensive snaps, yards allowed per play, total turnovers created, net yards per rush, and the yards per attempt allowed by defenses.

To figure out who can or cannot hold with the pass offenses of the NFL, we want to make use of the pass defensive stats we gathered from Pro Football Focus. Here, we can see the stats allowed by defenses. We use this information in the machine learning model, since we know that not all pass offenses are able to do what they do on a weekly basis, and the strength of the defense matters.

Similar to what we did with passing stats, we want to do to rushing stats so we can gather the information we need to feed our machine learning model.

Exploratory Data Analysis

Based on the data we have tidied up, we want to visualize our collected data to make sure we know what we are doing. This tutorial is meant to show how we came up with our machine learning model and it started with the visualizations we did under.

Points Allowed Per Game vs Points Scored Per Game

On our first visualization, what we are looking for is how teams stack up in Points Allowed Per Game vs Points Scored Per Game. What we want to see here is the quartet in which each team is placed. The bottom right being the best (also where most of the current playoff contenders are) and the top left is where teams have let the most points without scoring near that (these tend to be teams higher up the draft)

Defensive Passing Efficiency vs Offensive Passing Efficiency

Our second visualization is to help us see the Defensive Passing Efficiency vs Offensiving Passing Efficiency among NFL teams. Having a high Passing Efficiency tells us a team tends to take advantage of their pass game. The Defensiving Passing Efficiency helps us tell how teams matchup on the other side of the field. Teams with a lower passing efficiency tend to let their opponents use their pass game efficiently. Your team would want to be in the top right corner, and your team needs to help their offense and defense pass games if you are in the bottom left.

Defensive Rushing Efficiency vs Offensive Rushing Efficiency

Similar to the visualization we saw with passing, we will do the same with the run game. We are comparing Defensive Rushing Efficiency vs Offensive Rushing Efficiency. Teams that are able to run the well efficiently as well as be able to defend the run are found on the bottom right corner of the graph. Teams that don't have a productive run offense and literally and figuratively get "run" on are found on the top left corner. It seems the Houston Texans are by far the worst team when it comes to having a run game and defending the run.

Run Plays vs Pass Plays

While the linear line doesn't make much sense looking at it at first, what we see here is teams who run the ball more compared to other teams on the top left side of the line and teams who on average pass the ball more on the bottom right side of the line. This graph will make a lot more sense once we put it into our regression and machine learning model.

How Dominant is Jonathan Taylor?

From this analysis, we wanted to compare the rushing stats of teams compared to the rushing leader in the NFL. We can conclude that the Jags, Rams, Bucs, Giants, Falcons, Steelers, Raiders, Jets, Dolphins, and Texans have less total rush yards than Jonathan Taylor alone which is a surprising stat ESPN might show you.

Predicting Winners

Play Style Per Team

This dataframe contains the percentages of run plays and pass plays that an offense plays, as well as the number of plays a defense goes against run and pass plays. We calculated the above stats during our data visualization and all of this information will be fed to our model to predict winners.

The Formula

We developed a formula which weights efficiency of both offense and defense and factors in turnover percentage. We used this formula to calculate values, which will help us simulate who wins each game.

The formula is as follows: value = ((Rush Percent Rush Efficiency) + (Pass Percent Pass Efficiency)) - ((Rush Percent RushDef Efficiency) - (Pass Percent PassDef Efficiency)) - (rank * (Offensive Turnover% - Defensive Turnover%) + 1.0)

Win Probabilities

So now we can use these values to predict the likelihood of one team beating another. In this new df we made it so that the percent of team X beating team Y is now on display. When X and Y represent the same team, the probability is 0.5.

So with this new Data Frame, we can now plot each team's probability of winning against all of the other teams, and we can see which teams yield the highest win probability against (which team gives the most free wins) and which teams yield the lowest win probability against (which team needs to be upset).

Based on this graph, we can see that the Falcons, the Panthers, the Bears, the Lions, the Texans, the Jaguars, the Giants, the Jets, and the Washington Football Team are the easiest to beat. We can also see that the Cardinals, the Bills, the Patriots, the Cowboys, the Packers, the Rams, the Chiefs, the Chargers, and the Buccaneers are the hardest teams to beat.

Least Squares Regression

Getting our Training Data

Now it's time to do some Machine Learning! The first thing we did here was split our data frame into training and test data. 75% of our data is in training, and 25% of our data is in test. I also set a value column for the X training set as the predicted Y values.

The Regression

With the magic formula we created before, we pass that into the OLS function provided by the Statsmodels.Formula.API library. We compute this formula using values from our training data, and fit them onto our Least Squares Regression model.

What does this Regression tell us?

The R^2 value of 0.888 means that 88.8% of the variation in the value can be explained by variation in the columns (Rush Efficiency, Rush Percent, etc.). If we predict our test values, how would they come out?

Let's Check Accuracy!

So now that we have our prediction, let's see how it fares to the actual values.

So it looks like our predictions are overshooting a bit, which is fine. It isn't by too much, so we can say this OLS Model is a good predictor of score values.

Can we use our formulas and models to simulate games and predict winners?

Below, we do exactly that! This function uses the values calculated from our magic formula from earlier, and takes in both teams and predicts who wins and the score.

Now we read in the matchups from a CSV File, and we drop what's unnecessary. This reads everything for this current week of football into the simulations, and we can see which of the elapsed games are correct and which aren't.

The simulation correctly predicts a Colts, Cowboys, Bills, Texans, Packers, 49ers, and Bengals win! However, the simulation incorrectly predicted a Chargers, Cardinals (I'm sure NOBODY saw that coming!), Titans, Jets, and Buccaneers win. You win some and lose some in ML, so this model needs some random variability to account for upsets. We also need to take more factors into account, like injuries. That being said, no model can be exact, so this is a good place to start!

Conclusion and Further Exploration

At the conclusion of this tutorial, we believe our code well written, well documented, reproducible, and does it help the reader understand the tutorial. The data we gathered, the methodology we used, the visualizations we made to help our readers see our vision, and the machine learning model we produced to help readers and users get an idea on how we can predict future games. While there is a lot of data being crunched, we hope our tutorial was beneficial to you.

While we are able to predict many games ahead, our machine learning model makes some "controversial calls." Controversial calls do make the NFL fun and interesting because we can see teams who are last beat teams who are the best in the league on any given week. It would be interesting to gather more data from more different sources in our further exploration. We believe our machine learning model is great, but nothing is perfect the first time. We hope to work on and improve on our model post our CMSC320 class, and possibly make some money (legally of course)!