In the world of football and sports betting, predicting winners is extremely important in a multi-billion industry. Many times, sports bettors are forced to use instincts to decide which team to bet on to win, but we want to get rid of instincts and use a machine learning model to give us a better prediction on winners for a game.
Our first step in this tutorial was to tidy up the data we collected from Pro Football Focus. We went through the six CSV files, three of which were for offense and other three for defense. We tidied up by removing unnecessary columns that we believe won’t help us get the stats we need to predict scores. The next step was to visualize the data we gathered, and we chose specific stats to let us understand what information we should feed to our machine learning model.
The final step before getting scores is to build the machine learning model. By gathering stats from previous weeks, we built our machine learning model to predict scores for a certain week. Our machine learning model will be taking the stats, which includes offensively: yards rushing, passing, touchdowns scored, and defensively: yards rushing, passing, touchdowns allowed of our home teams and the stats of our away team which gather the same stats.
We make use of Python 3 along with a few imported libraries: pandas, numpy, matplotlib, seaborn, and more.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
import statsmodels.formula.api as smf
import scipy.stats as stat
import sklearn.model_selection
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
import seaborn as sns
import warnings
header_row = 0
The three offensive CSVs we chose came from Pro Football Focus Offensive Stats . You can gather whichever years data you would like by replacing the four numbers at the end. For “Team Offense”, “Passing Offense”, and “Rushing Offense”, you can press “Share & Export” and select “Get table as CSV (for Excel)”. Copy the information from here to a CSV you want to read from. You can do the same for the Pro Football Focus Defensive Stats here.
offense_stats = pd.read_csv('offense_stats.csv')
defense_stats = pd.read_csv('defense_stats.csv')
passing_stats = pd.read_csv('passing_stats.csv')
passdef_stats = pd.read_csv('passdef_stats.csv')
rushing_stats = pd.read_csv('rushing_stats.csv')
rushdef_stats = pd.read_csv('rushdef_stats.csv')
offense_stats.columns = offense_stats.iloc[header_row]
offense_stats = offense_stats.drop(header_row)
defense_stats.columns = defense_stats.iloc[header_row]
defense_stats = defense_stats.drop(header_row)
The way the data is read is by using .read_csv(). This resulted in the header becoming the first row so we named the columns the first row, and dropped the first column containing the header name.
By following the steps above, you will have six dataframes that have a lot of information, a lot which we can remove. By tidying, we removed a lot of the unnecessary information. Follow the code below to see which columns that we will remove from the offense and defense dataframes.
offense_stats = pd.read_csv('offense_stats.csv') # Offensive Stats
offense_stats.columns = offense_stats.iloc[header_row]
offense_stats = offense_stats.drop(header_row)
offense_stats = offense_stats.drop(columns=['Pen'])
offense_stats = offense_stats.drop(columns=['TO%'])
offense_stats = offense_stats.drop(columns=['Sc%'])
offense_stats = offense_stats.drop(columns=['Rk'])
offense_stats = offense_stats.drop(columns=['1stPy'])
offense_stats = offense_stats.iloc[: , :-2]
offense_stats = offense_stats.drop(columns=['1stD'])
offense_stats = offense_stats.drop(columns=['FL'])
offense_stats.columns = ['Team', 'Games','Total Points For','Total Yards', 'Total Plays', 'Total Yards/Play', 'Total Turnovers', 'Passing Completions', 'Passing Attempts', 'Passing Yards',\
'Passing TD', 'Passing Int', 'Net Yards/Attempt', 'Rushing Attempts', 'Total Rushing Yards', 'Rushing TDs', 'Yards/Attempt']
offense_stats = offense_stats.drop([33, 34,35])
offense_stats = offense_stats.sort_values(by=['Team']).reset_index(drop=True)
offense_stats.head()
Team | Games | Total Points For | Total Yards | Total Plays | Total Yards/Play | Total Turnovers | Passing Completions | Passing Attempts | Passing Yards | Passing TD | Passing Int | Net Yards/Attempt | Rushing Attempts | Total Rushing Yards | Rushing TDs | Yards/Attempt | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Arizona Cardinals | 13 | 366 | 4872 | 845 | 5.8 | 13 | 304 | 420 | 3279 | 22 | 10 | 7.3 | 394 | 1593 | 21 | 4.0 |
1 | Atlanta Falcons | 13 | 245 | 4113 | 805 | 5.1 | 20 | 307 | 461 | 2929 | 17 | 14 | 6.0 | 318 | 1184 | 8 | 3.7 |
2 | Baltimore Ravens | 13 | 304 | 5044 | 921 | 5.5 | 20 | 304 | 467 | 3161 | 17 | 14 | 6.1 | 407 | 1883 | 14 | 4.6 |
3 | Buffalo Bills | 13 | 363 | 4978 | 853 | 5.8 | 18 | 330 | 502 | 3405 | 28 | 12 | 6.5 | 329 | 1573 | 13 | 4.8 |
4 | Carolina Panthers | 13 | 257 | 4038 | 838 | 4.8 | 23 | 262 | 447 | 2612 | 11 | 17 | 5.5 | 359 | 1426 | 15 | 4.0 |
We removed nine columns from the original offensive stat dataframe, since based on our research we found high levels of deviation that would affect the final result we would get from our predicted scores. Stats like penalties, rank, first downs, first down percentage, and fumbles lost were categories we felt we covered with other stats we are crunching. Stats like turnover percentage and sack percentage were best fit by getting numbers from the opposing defense. The stats we need to focus on are: 'Team', 'Games', 'Total Points For', 'Total Yards', 'Total Plays', 'Total Yards/Play', 'Total Turnovers', 'Passing Completions', 'Passing Attempts', 'Passing Yards', 'Passing TD', 'Passing Int', 'Net Yards/Attempt', 'Rushing Attempts', 'Total Rushing Yards', 'Rushing TDs', 'Yards/Attempt'
passing_stats = pd.read_csv('passing_stats.csv') # Passing Stats
passing_stats = passing_stats.drop(columns=['4QC'])
passing_stats = passing_stats.drop(columns=['GWD'])
passing_stats = passing_stats.drop(columns=['Sk%'])
passing_stats = passing_stats.drop(columns=['Rk'])
passing_stats = passing_stats.drop(columns=['EXP'])
passing_stats = passing_stats.drop(columns=['Yds.1'])
passing_stats = passing_stats.drop(columns=['Int%'])
passing_stats = passing_stats.drop(columns=['TD%'])
passing_stats = passing_stats.drop(columns=['Rate'])
passing_stats = passing_stats.drop(columns=['Sk'])
passing_stats = passing_stats.drop(columns=['Lng'])
passing_stats = passing_stats.rename(columns={"Tm": "Team", "G": "Games", "Ply": "Total Ply", "Y/P":"Total Y/P", 'Cmp':'Completions', 'Cmp%':'Completion %'})
passing_stats.rename(columns={passing_stats.columns[5]: "Total Yards" }, inplace = True)
passing_stats = passing_stats.drop([32, 33,34])
passing_stats = passing_stats.sort_values(by=['Team']).reset_index(drop=True)
passing_stats.head()
Team | Games | Completions | Att | Completion % | Total Yards | TD | Int | Y/A | AY/A | Y/C | Y/G | NY/A | ANY/A | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Arizona Cardinals | 13.0 | 304.0 | 420.0 | 72.4 | 3279.0 | 22.0 | 10.0 | 8.4 | 8.4 | 11.6 | 252.2 | 7.3 | 7.2 |
1 | Atlanta Falcons | 13.0 | 307.0 | 461.0 | 66.6 | 2929.0 | 17.0 | 14.0 | 6.8 | 6.1 | 10.2 | 225.3 | 6.0 | 5.4 |
2 | Baltimore Ravens | 13.0 | 304.0 | 467.0 | 65.1 | 3161.0 | 17.0 | 14.0 | 7.3 | 6.7 | 11.2 | 243.2 | 6.1 | 5.6 |
3 | Buffalo Bills | 13.0 | 330.0 | 502.0 | 65.7 | 3405.0 | 28.0 | 12.0 | 7.1 | 7.1 | 10.8 | 261.9 | 6.5 | 6.5 |
4 | Carolina Panthers | 13.0 | 262.0 | 447.0 | 58.6 | 2612.0 | 11.0 | 17.0 | 6.3 | 5.1 | 10.7 | 200.9 | 5.5 | 4.3 |
Similar to the offense dataframe we previously covered, we tidied up the data we need to use by focusing on the stats we will feed to our machine learning model.
rushing_stats = pd.read_csv('rushing_stats.csv')
rushing_stats = rushing_stats.drop(columns=['Rk'])
rushing_stats = rushing_stats.drop(columns=['EXP'])
rushing_stats = rushing_stats.drop(columns=['Lng'])
rushing_stats = rushing_stats.rename(columns={"Tm": "Team", "G": "Games"})
rushing_stats.rename(columns={rushing_stats.columns[3]: "Total Yards" }, inplace = True)
rushing_stats = rushing_stats.drop([32,33,34])
rushing_stats = rushing_stats.sort_values(by=['Team']).reset_index(drop=True)
rushing_stats.head()
Team | Games | Att | Total Yards | TD | Y/A | Y/G | Fmb | |
---|---|---|---|---|---|---|---|---|
0 | Arizona Cardinals | 13.0 | 394.0 | 1593.0 | 21.0 | 4.0 | 122.5 | 25.0 |
1 | Atlanta Falcons | 13.0 | 318.0 | 1184.0 | 8.0 | 3.7 | 91.1 | 18.0 |
2 | Baltimore Ravens | 13.0 | 407.0 | 1883.0 | 14.0 | 4.6 | 144.8 | 16.0 |
3 | Buffalo Bills | 13.0 | 329.0 | 1573.0 | 13.0 | 4.8 | 121.0 | 19.0 |
4 | Carolina Panthers | 13.0 | 359.0 | 1426.0 | 15.0 | 4.0 | 109.7 | 15.0 |
Accounting for passing, it is important to categorize and tidy up the running game for NFL teams. We did that by removing rank, expected points (which we are attempting to prove), and the longest run a team had.
defense_stats = pd.read_csv('defense_stats.csv', index_col=False) # Defensive Stats
defense_stats.columns = defense_stats.iloc[header_row]
defense_stats = defense_stats.drop(header_row)
defense_stats = defense_stats.drop(columns= ["FL"])
defense_stats = defense_stats.drop(columns= ["1stD"])
defense_stats = defense_stats.drop(columns= ["TO%"])
defense_stats = defense_stats.drop(columns= ["Sc%"])
defense_stats = defense_stats.drop(columns= ["EXP"])
defense_stats = defense_stats.drop(columns= ["Pen"])
defense_stats = defense_stats.drop(columns= ["1stPy"])
defense_stats = defense_stats.drop(columns= ["Int"])
defense_stats = defense_stats.drop(columns= ["TD"])
defense_stats = defense_stats.drop(columns= ["Cmp"])
defense_stats = defense_stats.drop(columns= ["Att"])
# defense_stats = defense_stats.drop(columns= defense_stats.columns[[16]], axis=1)
defense_stats = defense_stats.iloc[:,~defense_stats.columns.duplicated()]
defense_stats = defense_stats.rename(columns={"Tm": "Team", "G": "Games", "Yds": "Total Yards Allowed", "Ply": "Total Ply", "NY/P":"Net Yards per Pass Allowed", "Y/A":"Rushing Yards Allowed"\
, "TO":"Total TOs", "PA":"Points Allowed"})
defense_stats = defense_stats.drop([33, 34,35])
defense_stats = defense_stats.sort_values(by=['Team']).reset_index(drop=True)
defense_stats.head()
Rk | Team | Games | Points Allowed | Total Yards Allowed | Total Ply | Y/P | Total TOs | NY/A | Rushing Yards Allowed | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 4 | Arizona Cardinals | 13 | 254 | 4181 | 789 | 5.3 | 23 | 5.7 | 4.7 |
1 | 28 | Atlanta Falcons | 13 | 353 | 4739 | 848 | 5.6 | 15 | 6.6 | 4.2 |
2 | 10 | Baltimore Ravens | 13 | 284 | 4570 | 773 | 5.9 | 11 | 7.0 | 3.9 |
3 | 3 | Buffalo Bills | 13 | 229 | 3756 | 787 | 4.8 | 26 | 5.3 | 4.1 |
4 | 8 | Carolina Panthers | 13 | 282 | 3809 | 783 | 4.9 | 15 | 5.6 | 4.1 |
To get to the model itself, we need to make sure our defense is also tidied up. Our machine learning model takes into account both the offense and defense to make a prediction on the outcome of a certain weeks matchups are. Here, we want to get the points allowed, yards allowed, total defensive snaps, yards allowed per play, total turnovers created, net yards per rush, and the yards per attempt allowed by defenses.
passdef_stats = pd.read_csv('passdef_stats.csv') # Passing Defense Stats
passdef_stats = passdef_stats.drop(columns= ["Rk"])
passdef_stats = passdef_stats.drop(columns= ["QBHits"])
passdef_stats = passdef_stats.drop(columns= ["PD"])
passdef_stats = passdef_stats.drop(columns= ["Rate"])
passdef_stats = passdef_stats.drop(columns= ["Sk"])
passdef_stats = passdef_stats.drop(columns= ["Yds.1"])
passdef_stats = passdef_stats.drop(columns= ["TFL"])
passdef_stats = passdef_stats.drop(columns= ["Sk%"])
passdef_stats = passdef_stats.drop(columns= ["EXP"])
passdef_stats = passdef_stats.drop([32,33, 34])
passdef_stats = passdef_stats.sort_values(by=['Tm']).reset_index(drop=True)
passdef_stats.head()
Tm | G | Cmp | Att | Cmp% | Yds | TD | TD% | Int | Int% | Y/A | AY/A | Y/C | Y/G | NY/A | ANY/A | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Arizona Cardinals | 13.0 | 288.0 | 442.0 | 65.2 | 2728.0 | 19.0 | 4.3 | 12.0 | 2.7 | 6.6 | 6.3 | 10.2 | 209.8 | 5.7 | 5.4 |
1 | Atlanta Falcons | 13.0 | 321.0 | 468.0 | 68.6 | 3217.0 | 26.0 | 5.6 | 8.0 | 1.7 | 7.2 | 7.5 | 10.5 | 247.5 | 6.6 | 7.0 |
2 | Baltimore Ravens | 13.0 | 281.0 | 465.0 | 60.4 | 3459.0 | 21.0 | 4.5 | 6.0 | 1.3 | 7.8 | 8.2 | 13.0 | 266.1 | 7.0 | 7.4 |
3 | Buffalo Bills | 13.0 | 245.0 | 417.0 | 58.8 | 2334.0 | 10.0 | 2.4 | 16.0 | 3.8 | 5.9 | 4.7 | 10.0 | 179.5 | 5.3 | 4.1 |
4 | Carolina Panthers | 13.0 | 254.0 | 383.0 | 66.3 | 2310.0 | 18.0 | 4.7 | 8.0 | 2.1 | 6.7 | 6.7 | 10.1 | 177.7 | 5.6 | 5.6 |
To figure out who can or cannot hold with the pass offenses of the NFL, we want to make use of the pass defensive stats we gathered from Pro Football Focus. Here, we can see the stats allowed by defenses. We use this information in the machine learning model, since we know that not all pass offenses are able to do what they do on a weekly basis, and the strength of the defense matters.
rushdef_stats = pd.read_csv('rushdef_stats.csv')
rushdef_stats = rushdef_stats.drop(columns= ["EXP"])
rushdef_stats = rushdef_stats.drop(columns= ["Rk"])
rushdef_stats = rushdef_stats.drop([32,33, 34])
rushdef_stats = rushdef_stats.sort_values(by=['Tm']).reset_index(drop=True)
rushdef_stats.head()
Tm | G | Att | Yds | TD | Y/A | Y/G | |
---|---|---|---|---|---|---|---|
0 | Arizona Cardinals | 13.0 | 312.0 | 1453.0 | 8.0 | 4.7 | 111.8 |
1 | Atlanta Falcons | 13.0 | 364.0 | 1522.0 | 12.0 | 4.2 | 117.1 |
2 | Baltimore Ravens | 13.0 | 282.0 | 1111.0 | 10.0 | 3.9 | 85.5 |
3 | Buffalo Bills | 13.0 | 347.0 | 1422.0 | 14.0 | 4.1 | 109.4 |
4 | Carolina Panthers | 13.0 | 368.0 | 1499.0 | 11.0 | 4.1 | 115.3 |
Similar to what we did with passing stats, we want to do to rushing stats so we can gather the information we need to feed our machine learning model.
Based on the data we have tidied up, we want to visualize our collected data to make sure we know what we are doing. This tutorial is meant to show how we came up with our machine learning model and it started with the visualizations we did under.
offense_stats['Points Per Game'] = offense_stats['Total Points For'].astype(float) / offense_stats['Games'].astype(float)
ppg = offense_stats['Points Per Game']
defense_stats['Points Allowed Per Game'] = defense_stats['Points Allowed'].astype(float) / defense_stats['Games'].astype(float)
papg = defense_stats['Points Allowed Per Game']
plt.scatter(offense_stats['Points Per Game'], defense_stats['Points Allowed Per Game'])
plt.xlabel("Points Per Game")
plt.ylabel("Points Allowed Per Game")
m, b = np.polyfit(offense_stats['Points Per Game'].head(32), defense_stats['Points Allowed Per Game'], 1)
plt.plot(offense_stats['Points Per Game'], m*offense_stats['Points Per Game'] + b)
for i, row in offense_stats.iterrows():
plt.annotate(row.Team, (ppg[i], papg[i]))
On our first visualization, what we are looking for is how teams stack up in Points Allowed Per Game vs Points Scored Per Game. What we want to see here is the quartet in which each team is placed. The bottom right being the best (also where most of the current playoff contenders are) and the top left is where teams have let the most points without scoring near that (these tend to be teams higher up the draft)
passing_stats['Passing Efficiency'] = ((passing_stats["Completion %"].astype(float) * (passing_stats["Total Yards"].astype(float)/passing_stats["Att"].astype(float))) + \
(passing_stats["Completion %"].astype(float) * (passing_stats["TD"].astype(float)/passing_stats["Att"].astype(float))*100*6) - \
(passing_stats["Completion %"].astype(float) * (passing_stats["Int"].astype(float)/passing_stats["Att"].astype(float))*100*3)) / 100
passdef_stats['Defensive Passing Efficiency'] = (passdef_stats["Int%"].astype(float))*3 - \
(passdef_stats["TD%"].astype(float)/passing_stats["Att"].astype(float)*100)*(6) - \
(passdef_stats["Y/C"].astype(float)/passing_stats["Att"].astype(float)*100*3)
plt.scatter(passing_stats['Passing Efficiency'], passdef_stats['Defensive Passing Efficiency'])
plt.xlabel("Passing Efficiency")
plt.ylabel("Defensive Passing Efficiency")
m, b = np.polyfit(passing_stats['Passing Efficiency'].head(32), passdef_stats['Defensive Passing Efficiency'], 1)
plt.plot(passing_stats['Passing Efficiency'], m*passing_stats['Passing Efficiency'] + b)
for i, row in passing_stats.iterrows():
plt.annotate(row.Team, (passing_stats['Passing Efficiency'][i], passdef_stats['Defensive Passing Efficiency'][i]))
Our second visualization is to help us see the Defensive Passing Efficiency vs Offensiving Passing Efficiency among NFL teams. Having a high Passing Efficiency tells us a team tends to take advantage of their pass game. The Defensiving Passing Efficiency helps us tell how teams matchup on the other side of the field. Teams with a lower passing efficiency tend to let their opponents use their pass game efficiently. Your team would want to be in the top right corner, and your team needs to help their offense and defense pass games if you are in the bottom left.
rushing_stats['Rushing Efficiency'] = (rushing_stats["Y/A"].astype(float))\
+ (rushing_stats["TD"].astype(float) /rushing_stats["Att"].astype(float)) * 100*6\
- (rushing_stats["Fmb"].astype(float) /rushing_stats["Att"].astype(float)) * 100*3
rushdef_stats['Defensive Rushing Efficiency'] = (rushdef_stats["Y/A"].astype(float)) - ((rushdef_stats["TD"].astype(float) /rushdef_stats["Att"].astype(float)) * 100)*6
plt.scatter(rushing_stats['Rushing Efficiency'], rushdef_stats['Defensive Rushing Efficiency'])
plt.xlabel("Rushing Efficiency")
plt.ylabel("Defensive Rushing Efficiency")
m, b = np.polyfit(rushing_stats['Rushing Efficiency'].head(32), rushdef_stats['Defensive Rushing Efficiency'], 1)
plt.plot(rushing_stats['Rushing Efficiency'], m*rushing_stats['Rushing Efficiency'] + b)
for i, row in rushing_stats.iterrows():
plt.annotate(row.Team, (rushing_stats['Rushing Efficiency'][i], rushdef_stats['Defensive Rushing Efficiency'][i]))
Similar to the visualization we saw with passing, we will do the same with the run game. We are comparing Defensive Rushing Efficiency vs Offensive Rushing Efficiency. Teams that are able to run the well efficiently as well as be able to defend the run are found on the bottom right corner of the graph. Teams that don't have a productive run offense and literally and figuratively get "run" on are found on the top left corner. It seems the Houston Texans are by far the worst team when it comes to having a run game and defending the run.
# Run vs Pass
offense_stats['Total Plays(Only Run and Pass'] = offense_stats['Passing Attempts'].astype(float) + offense_stats['Rushing Attempts'].astype(float)
offense_stats['Pass Percent'] = offense_stats['Passing Attempts'].astype(float)/offense_stats['Total Plays(Only Run and Pass'].astype(float)
offense_stats['Run Percent'] = offense_stats['Rushing Attempts'].astype(float)/offense_stats['Total Plays(Only Run and Pass'].astype(float)
plt.scatter(offense_stats['Pass Percent'], offense_stats['Run Percent'])
plt.xlabel("Pass Percent")
plt.ylabel("Run Percent")
m, b = np.polyfit(offense_stats['Pass Percent'].head(32), offense_stats['Run Percent'], 1)
plt.plot(offense_stats['Pass Percent'], m*offense_stats['Pass Percent'] + b)
for i, row in offense_stats.iterrows():
plt.annotate(row.Team, (offense_stats['Pass Percent'][i], offense_stats['Run Percent'][i]))
While the linear line doesn't make much sense looking at it at first, what we see here is teams who run the ball more compared to other teams on the top left side of the line and teams who on average pass the ball more on the bottom right side of the line. This graph will make a lot more sense once we put it into our regression and machine learning model.
jt_df = rushing_stats.loc[rushing_stats['Total Yards'] < 1348]
jt_df
Team | Games | Att | Total Yards | TD | Y/A | Y/G | Fmb | Rushing Efficiency | |
---|---|---|---|---|---|---|---|---|---|
1 | Atlanta Falcons | 13.0 | 318.0 | 1184.0 | 8.0 | 3.7 | 91.1 | 18.0 | 1.813208 |
12 | Houston Texans | 13.0 | 310.0 | 1008.0 | 6.0 | 3.3 | 77.5 | 16.0 | -0.570968 |
14 | Jacksonville Jaguars | 13.0 | 288.0 | 1335.0 | 11.0 | 4.6 | 102.7 | 19.0 | 7.725000 |
16 | Las Vegas Raiders | 13.0 | 288.0 | 1100.0 | 11.0 | 3.8 | 84.6 | 18.0 | 7.966667 |
18 | Los Angeles Rams | 13.0 | 311.0 | 1264.0 | 8.0 | 4.1 | 97.2 | 10.0 | 9.887781 |
19 | Miami Dolphins | 13.0 | 311.0 | 1030.0 | 9.0 | 3.3 | 79.2 | 22.0 | -0.558521 |
23 | New York Giants | 13.0 | 303.0 | 1225.0 | 8.0 | 4.0 | 94.2 | 16.0 | 4.000000 |
24 | New York Jets | 13.0 | 277.0 | 1094.0 | 9.0 | 3.9 | 84.2 | 11.0 | 11.481227 |
26 | Pittsburgh Steelers | 13.0 | 308.0 | 1149.0 | 8.0 | 3.7 | 88.4 | 15.0 | 4.674026 |
29 | Tampa Bay Buccaneers | 13.0 | 292.0 | 1248.0 | 14.0 | 4.3 | 96.0 | 11.0 | 21.765753 |
From this analysis, we wanted to compare the rushing stats of teams compared to the rushing leader in the NFL. We can conclude that the Jags, Rams, Bucs, Giants, Falcons, Steelers, Raiders, Jets, Dolphins, and Texans have less total rush yards than Jonathan Taylor alone which is a surprising stat ESPN might show you.
df = pd.DataFrame(offense_stats['Team'])
df['RushPercent'] = offense_stats['Run Percent']
df['RushEfficiency'] = rushing_stats['Rushing Efficiency']
df['PassPercent'] = offense_stats['Pass Percent']
df['PassEfficiency'] = passing_stats['Passing Efficiency']
df['DRushEfficiency'] = rushdef_stats['Defensive Rushing Efficiency']
df['DPassEfficiency'] = passdef_stats['Defensive Passing Efficiency']
df['DefenseRank'] = defense_stats['Rk']
df['OffensiveTO%'] = offense_stats['Total Turnovers'].astype(float)/offense_stats['Total Plays'].astype(float)
df['DefensiveTO%'] = defense_stats['Total TOs'].astype(float)/defense_stats['Total Ply'].astype(float)
df.insert(10, "Value", 1)
df.head()
Team | RushPercent | RushEfficiency | PassPercent | PassEfficiency | DRushEfficiency | DPassEfficiency | DefenseRank | OffensiveTO% | DefensiveTO% | Value | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | Arizona Cardinals | 0.484029 | 16.944162 | 0.515971 | 23.235229 | -10.684615 | -5.328571 | 4 | 0.015385 | 0.029151 | 1 |
1 | Atlanta Falcons | 0.408216 | 1.813208 | 0.591784 | 12.899597 | -15.580220 | -9.021475 | 28 | 0.024845 | 0.017689 | 1 |
2 | Baltimore Ravens | 0.465675 | 13.445209 | 0.534325 | 12.770473 | -17.376596 | -10.232762 | 10 | 0.021716 | 0.014230 | 1 |
3 | Buffalo Bills | 0.395909 | 11.182979 | 0.604091 | 21.732042 | -20.107493 | 2.555378 | 3 | 0.021102 | 0.033037 | 1 |
4 | Carolina Panthers | 0.445409 | 16.534819 | 0.554591 | 5.390676 | -13.834783 | -6.787248 | 8 | 0.027446 | 0.019157 | 1 |
This dataframe contains the percentages of run plays and pass plays that an offense plays, as well as the number of plays a defense goes against run and pass plays. We calculated the above stats during our data visualization and all of this information will be fed to our model to predict winners.
We developed a formula which weights efficiency of both offense and defense and factors in turnover percentage. We used this formula to calculate values, which will help us simulate who wins each game.
The formula is as follows: value = ((Rush Percent Rush Efficiency) + (Pass Percent Pass Efficiency)) - ((Rush Percent RushDef Efficiency) - (Pass Percent PassDef Efficiency)) - (rank * (Offensive Turnover% - Defensive Turnover%) + 1.0)
df['Value'] = ((df['RushPercent'] * df['RushEfficiency']) + (df['PassPercent'] * df['PassEfficiency'])) - ((df['RushPercent'] * df['DRushEfficiency']) - (df['PassPercent'] * df['DPassEfficiency'])) - (df['DefenseRank'].astype(float) * (df['OffensiveTO%'] - df['DefensiveTO%']) + 1.0)
df
Team | RushPercent | RushEfficiency | PassPercent | PassEfficiency | DRushEfficiency | DPassEfficiency | DefenseRank | OffensiveTO% | DefensiveTO% | Value | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | Arizona Cardinals | 0.484029 | 16.944162 | 0.515971 | 23.235229 | -10.684615 | -5.328571 | 4 | 0.015385 | 0.029151 | 21.667515 |
1 | Atlanta Falcons | 0.408216 | 1.813208 | 0.591784 | 12.899597 | -15.580220 | -9.021475 | 28 | 0.024845 | 0.017689 | 8.194912 |
2 | Baltimore Ravens | 0.465675 | 13.445209 | 0.534325 | 12.770473 | -17.376596 | -10.232762 | 10 | 0.021716 | 0.014230 | 14.634055 |
3 | Buffalo Bills | 0.395909 | 11.182979 | 0.604091 | 21.732042 | -20.107493 | 2.555378 | 3 | 0.021102 | 0.033037 | 26.095793 |
4 | Carolina Panthers | 0.445409 | 16.534819 | 0.554591 | 5.390676 | -13.834783 | -6.787248 | 8 | 0.027446 | 0.019157 | 11.686067 |
5 | Chicago Bears | 0.487316 | 5.321918 | 0.512684 | 7.360391 | -17.128571 | -14.615625 | 24 | 0.027813 | 0.013871 | 5.886256 |
6 | Cincinnati Bengals | 0.457143 | 15.079545 | 0.542857 | 24.079514 | -18.425597 | -6.749282 | 17 | 0.025894 | 0.020656 | 23.635434 |
7 | Cleveland Browns | 0.488918 | 17.800000 | 0.511082 | 14.376714 | -17.138938 | -8.730612 | 14 | 0.017478 | 0.019925 | 19.002153 |
8 | Dallas Cowboys | 0.422145 | 7.878689 | 0.577855 | 21.550491 | -18.000000 | 0.623952 | 12 | 0.020179 | 0.032847 | 22.890181 |
9 | Denver Broncos | 0.447570 | 17.457143 | 0.552430 | 17.011493 | -9.104823 | -5.094444 | 2 | 0.018405 | 0.020487 | 17.475842 |
10 | Detroit Lions | 0.402581 | 7.484615 | 0.597419 | 11.436870 | -17.597810 | -9.032397 | 29 | 0.022277 | 0.016726 | 10.373204 |
11 | Green Bay Packers | 0.428390 | 5.875148 | 0.571610 | 27.425721 | -13.089577 | -4.535920 | 7 | 0.012270 | 0.027743 | 20.316645 |
12 | Houston Texans | 0.422343 | -0.570968 | 0.577657 | 9.506925 | -27.652941 | -5.312264 | 30 | 0.025940 | 0.024648 | 12.822191 |
13 | Indianapolis Colts | 0.474010 | 23.115666 | 0.525990 | 21.102125 | -10.406832 | -5.752941 | 9 | 0.019277 | 0.036432 | 23.117910 |
14 | Jacksonville Jaguars | 0.380952 | 7.725000 | 0.619048 | 4.739989 | -23.055703 | -8.579487 | 27 | 0.032010 | 0.007238 | 7.680290 |
15 | Kansas City Chiefs | 0.384342 | 9.129630 | 0.615658 | 20.523971 | -14.880519 | -2.987283 | 6 | 0.026528 | 0.028931 | 19.039124 |
16 | Las Vegas Raiders | 0.367347 | 7.966667 | 0.632653 | 15.893520 | -18.100000 | -9.038710 | 31 | 0.020859 | 0.014994 | 12.730422 |
17 | Los Angeles Chargers | 0.378079 | 21.012378 | 0.621921 | 24.301259 | -22.851020 | -4.621782 | 26 | 0.017878 | 0.021403 | 27.914547 |
18 | Los Angeles Rams | 0.396178 | 9.887781 | 0.603822 | 28.982943 | -22.648673 | -1.679747 | 18 | 0.017370 | 0.021940 | 28.458764 |
19 | Miami Dolphins | 0.383005 | -0.558521 | 0.616995 | 13.657060 | -14.992605 | -5.137126 | 13 | 0.023725 | 0.023310 | 9.779691 |
20 | Minnesota Vikings | 0.428741 | 6.062050 | 0.571259 | 25.204557 | -15.468067 | -6.958004 | 25 | 0.012791 | 0.018648 | 18.800785 |
21 | New England Patriots | 0.490956 | 16.831579 | 0.509044 | 20.303787 | -6.282493 | -0.505584 | 1 | 0.020050 | 0.032541 | 20.438639 |
22 | New Orleans Saints | 0.485388 | 14.309424 | 0.514612 | 20.440346 | -14.104154 | -5.144444 | 11 | 0.019729 | 0.020606 | 20.672709 |
23 | New York Giants | 0.391473 | 4.000000 | 0.608527 | 9.590505 | -11.904348 | -4.620382 | 21 | 0.021197 | 0.021493 | 8.256796 |
24 | New York Jets | 0.354673 | 11.481227 | 0.645327 | 8.717179 | -31.680905 | -10.157143 | 32 | 0.030600 | 0.010551 | 12.737657 |
25 | Philadelphia Eagles | 0.520860 | 19.563107 | 0.479140 | 14.849248 | -17.910112 | -9.514512 | 16 | 0.015971 | 0.015719 | 21.070359 |
26 | Pittsburgh Steelers | 0.378378 | 4.674026 | 0.621622 | 16.249581 | -18.013699 | -6.454150 | 22 | 0.017773 | 0.015606 | 13.625938 |
27 | San Francisco 49ers | 0.486486 | 18.585714 | 0.513514 | 20.210088 | -24.102367 | -10.461654 | 20 | 0.022388 | 0.020253 | 24.730429 |
28 | Seattle Seahawks | 0.439060 | 13.530100 | 0.560940 | 24.162094 | -12.700000 | -9.779058 | 5 | 0.013889 | 0.013845 | 18.584389 |
29 | Tampa Bay Buccaneers | 0.341520 | 21.765753 | 0.658480 | 27.559247 | -15.749254 | -1.298046 | 19 | 0.018349 | 0.029940 | 29.324844 |
30 | Tennessee Titans | 0.494019 | 19.554237 | 0.505981 | 12.149064 | -20.573379 | -5.817021 | 15 | 0.024055 | 0.021924 | 21.995745 |
31 | Washington Football Team | 0.465432 | 2.608488 | 0.534568 | 16.074651 | -13.880795 | -10.419630 | 23 | 0.025000 | 0.017456 | 9.524130 |
So now we can use these values to predict the likelihood of one team beating another. In this new df we made it so that the percent of team X beating team Y is now on display. When X and Y represent the same team, the probability is 0.5.
new_df = pd.DataFrame()
i = 0
for index in range(len(df.Team.unique())):
arr = []
for index2 in range(len(df.Team.unique())):
total = df["Value"][index]+df["Value"][index2]
arr.append(df["Value"][index]/total)
new_df = new_df.append(pd.DataFrame([arr]))
new_df.columns = df.Team.unique()
new_df.index = df.Team.unique()
new_df
Arizona Cardinals | Atlanta Falcons | Baltimore Ravens | Buffalo Bills | Carolina Panthers | Chicago Bears | Cincinnati Bengals | Cleveland Browns | Dallas Cowboys | Denver Broncos | ... | New Orleans Saints | New York Giants | New York Jets | Philadelphia Eagles | Pittsburgh Steelers | San Francisco 49ers | Seattle Seahawks | Tampa Bay Buccaneers | Tennessee Titans | Washington Football Team | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Arizona Cardinals | 0.500000 | 0.725578 | 0.596875 | 0.453644 | 0.649631 | 0.786372 | 0.478280 | 0.532768 | 0.486280 | 0.553543 | ... | 0.511748 | 0.724077 | 0.629775 | 0.506986 | 0.613924 | 0.466993 | 0.538298 | 0.424917 | 0.496241 | 0.694658 |
Atlanta Falcons | 0.274422 | 0.500000 | 0.358970 | 0.238983 | 0.412199 | 0.581977 | 0.257456 | 0.301316 | 0.263628 | 0.319231 | ... | 0.283879 | 0.498119 | 0.391491 | 0.280022 | 0.375554 | 0.248894 | 0.306017 | 0.218416 | 0.271439 | 0.462492 |
Baltimore Ravens | 0.403125 | 0.641030 | 0.500000 | 0.359296 | 0.556003 | 0.713150 | 0.382395 | 0.435069 | 0.389989 | 0.455749 | ... | 0.414483 | 0.639297 | 0.534642 | 0.409867 | 0.517836 | 0.371758 | 0.440540 | 0.332903 | 0.399512 | 0.605760 |
Buffalo Bills | 0.546356 | 0.761017 | 0.640704 | 0.500000 | 0.690696 | 0.815951 | 0.524737 | 0.578647 | 0.532720 | 0.598917 | ... | 0.557978 | 0.759646 | 0.671993 | 0.553274 | 0.656965 | 0.513432 | 0.584057 | 0.470868 | 0.542628 | 0.732618 |
Carolina Panthers | 0.350369 | 0.587801 | 0.443997 | 0.309304 | 0.500000 | 0.665027 | 0.330849 | 0.380800 | 0.337980 | 0.400731 | ... | 0.361141 | 0.585977 | 0.478472 | 0.356756 | 0.461681 | 0.320900 | 0.386055 | 0.284950 | 0.346955 | 0.550965 |
Chicago Bears | 0.213628 | 0.418023 | 0.286850 | 0.184049 | 0.334973 | 0.500000 | 0.199388 | 0.236506 | 0.204551 | 0.251958 | ... | 0.221630 | 0.416194 | 0.316059 | 0.218360 | 0.301671 | 0.192256 | 0.240544 | 0.167170 | 0.211113 | 0.381967 |
Cincinnati Bengals | 0.521720 | 0.742544 | 0.617605 | 0.475263 | 0.669151 | 0.800612 | 0.500000 | 0.554333 | 0.508009 | 0.574914 | ... | 0.533433 | 0.741103 | 0.649805 | 0.528688 | 0.634315 | 0.488680 | 0.559818 | 0.446286 | 0.517967 | 0.712779 |
Cleveland Browns | 0.467232 | 0.698684 | 0.564931 | 0.421353 | 0.619200 | 0.763494 | 0.445667 | 0.500000 | 0.453595 | 0.520921 | ... | 0.478947 | 0.697098 | 0.598685 | 0.474194 | 0.582386 | 0.434508 | 0.505557 | 0.393200 | 0.463491 | 0.666128 |
Dallas Cowboys | 0.513720 | 0.736372 | 0.610011 | 0.467280 | 0.662020 | 0.795449 | 0.491991 | 0.546405 | 0.500000 | 0.567066 | ... | 0.525451 | 0.734909 | 0.642480 | 0.520698 | 0.626851 | 0.480678 | 0.551909 | 0.438383 | 0.509963 | 0.706175 |
Denver Broncos | 0.446457 | 0.680769 | 0.544251 | 0.401083 | 0.599269 | 0.748042 | 0.425086 | 0.479079 | 0.432934 | 0.500000 | ... | 0.458100 | 0.679131 | 0.578412 | 0.453374 | 0.561892 | 0.414058 | 0.484629 | 0.373410 | 0.442745 | 0.647254 |
Detroit Lions | 0.323751 | 0.558657 | 0.414808 | 0.284439 | 0.470242 | 0.637980 | 0.305017 | 0.353126 | 0.311851 | 0.372480 | ... | 0.334125 | 0.556801 | 0.448845 | 0.329899 | 0.432232 | 0.295502 | 0.358221 | 0.261303 | 0.320468 | 0.521336 |
Green Bay Packers | 0.483912 | 0.712576 | 0.581294 | 0.437741 | 0.634841 | 0.775359 | 0.462245 | 0.516716 | 0.470218 | 0.537584 | ... | 0.495657 | 0.711033 | 0.614645 | 0.490894 | 0.598559 | 0.451009 | 0.522265 | 0.409267 | 0.480158 | 0.680835 |
Houston Texans | 0.371769 | 0.610084 | 0.467005 | 0.329467 | 0.523178 | 0.685369 | 0.351701 | 0.402905 | 0.359041 | 0.423202 | ... | 0.382810 | 0.608293 | 0.501654 | 0.378319 | 0.484805 | 0.341446 | 0.408264 | 0.304225 | 0.368264 | 0.573794 |
Indianapolis Colts | 0.516193 | 0.738289 | 0.612363 | 0.469745 | 0.664232 | 0.797055 | 0.494465 | 0.548857 | 0.502475 | 0.569494 | ... | 0.527919 | 0.736833 | 0.644751 | 0.523168 | 0.629164 | 0.483150 | 0.554356 | 0.440822 | 0.512437 | 0.708225 |
Jacksonville Jaguars | 0.261699 | 0.483792 | 0.344186 | 0.227388 | 0.396579 | 0.566120 | 0.245253 | 0.287841 | 0.251232 | 0.305305 | ... | 0.270881 | 0.481913 | 0.376154 | 0.267134 | 0.360472 | 0.236968 | 0.292419 | 0.207547 | 0.258804 | 0.446414 |
Kansas City Chiefs | 0.467715 | 0.699093 | 0.565409 | 0.421827 | 0.619658 | 0.763845 | 0.446147 | 0.500486 | 0.454077 | 0.521406 | ... | 0.479432 | 0.697508 | 0.599152 | 0.474679 | 0.582859 | 0.434986 | 0.506043 | 0.393663 | 0.463974 | 0.666560 |
Las Vegas Raiders | 0.370093 | 0.608374 | 0.465217 | 0.327882 | 0.521386 | 0.683818 | 0.350065 | 0.401178 | 0.357389 | 0.421450 | ... | 0.381115 | 0.606580 | 0.499858 | 0.376631 | 0.483011 | 0.339833 | 0.406530 | 0.302707 | 0.366595 | 0.572037 |
Los Angeles Chargers | 0.562997 | 0.773054 | 0.656063 | 0.516837 | 0.704902 | 0.825855 | 0.541505 | 0.594981 | 0.549448 | 0.614988 | ... | 0.574524 | 0.771731 | 0.686667 | 0.569860 | 0.671984 | 0.530241 | 0.600327 | 0.487681 | 0.559294 | 0.745607 |
Los Angeles Rams | 0.567741 | 0.776423 | 0.660406 | 0.521657 | 0.708902 | 0.828614 | 0.546294 | 0.599625 | 0.554223 | 0.619550 | ... | 0.579237 | 0.775115 | 0.690807 | 0.574586 | 0.676226 | 0.535048 | 0.604950 | 0.492506 | 0.564048 | 0.749252 |
Miami Dolphins | 0.310988 | 0.544084 | 0.400581 | 0.272601 | 0.455595 | 0.624264 | 0.292673 | 0.339787 | 0.299349 | 0.358815 | ... | 0.321147 | 0.542217 | 0.434318 | 0.317007 | 0.417835 | 0.283386 | 0.344791 | 0.250091 | 0.307775 | 0.506619 |
Minnesota Vikings | 0.464581 | 0.696436 | 0.562311 | 0.418758 | 0.616685 | 0.761565 | 0.443036 | 0.497337 | 0.450956 | 0.518262 | ... | 0.476289 | 0.694844 | 0.596123 | 0.471539 | 0.579793 | 0.431892 | 0.502894 | 0.390661 | 0.460843 | 0.663754 |
New England Patriots | 0.485407 | 0.713800 | 0.582751 | 0.439215 | 0.636228 | 0.776400 | 0.463734 | 0.518211 | 0.471710 | 0.539072 | ... | 0.497153 | 0.712261 | 0.616062 | 0.492391 | 0.599997 | 0.452492 | 0.523758 | 0.410716 | 0.481653 | 0.682135 |
New Orleans Saints | 0.488252 | 0.716121 | 0.585517 | 0.442022 | 0.638859 | 0.778370 | 0.466567 | 0.521053 | 0.474549 | 0.541900 | ... | 0.500000 | 0.714589 | 0.618751 | 0.495237 | 0.602727 | 0.455315 | 0.526598 | 0.413474 | 0.484496 | 0.684598 |
New York Giants | 0.275923 | 0.501881 | 0.360703 | 0.240354 | 0.414023 | 0.583806 | 0.258897 | 0.302902 | 0.265091 | 0.320869 | ... | 0.285411 | 0.500000 | 0.393285 | 0.281541 | 0.377320 | 0.250303 | 0.307617 | 0.219703 | 0.272929 | 0.464363 |
New York Jets | 0.370225 | 0.608509 | 0.465358 | 0.328007 | 0.521528 | 0.683941 | 0.350195 | 0.401315 | 0.357520 | 0.421588 | ... | 0.381249 | 0.606715 | 0.500000 | 0.376764 | 0.483153 | 0.339960 | 0.406667 | 0.302827 | 0.366726 | 0.572176 |
Philadelphia Eagles | 0.493014 | 0.719978 | 0.590133 | 0.446726 | 0.643244 | 0.781640 | 0.471312 | 0.525806 | 0.479302 | 0.546626 | ... | 0.504763 | 0.718459 | 0.623236 | 0.500000 | 0.607280 | 0.460044 | 0.531345 | 0.418102 | 0.489256 | 0.688698 |
Pittsburgh Steelers | 0.386076 | 0.624446 | 0.482164 | 0.343035 | 0.538319 | 0.698329 | 0.365685 | 0.417614 | 0.373149 | 0.438108 | ... | 0.397273 | 0.622680 | 0.516847 | 0.392720 | 0.500000 | 0.355246 | 0.423030 | 0.317245 | 0.382518 | 0.588592 |
San Francisco 49ers | 0.533007 | 0.751106 | 0.628242 | 0.486568 | 0.679100 | 0.807744 | 0.511320 | 0.565492 | 0.519322 | 0.585942 | ... | 0.544685 | 0.749697 | 0.660040 | 0.539956 | 0.644754 | 0.500000 | 0.570946 | 0.457503 | 0.529263 | 0.721960 |
Seattle Seahawks | 0.461702 | 0.693983 | 0.559460 | 0.415943 | 0.613945 | 0.759456 | 0.440182 | 0.494443 | 0.448091 | 0.515371 | ... | 0.473402 | 0.692383 | 0.593333 | 0.468655 | 0.576970 | 0.429054 | 0.500000 | 0.387908 | 0.457968 | 0.661166 |
Tampa Bay Buccaneers | 0.575083 | 0.781584 | 0.667097 | 0.529132 | 0.715050 | 0.832830 | 0.553714 | 0.606800 | 0.561617 | 0.626590 | ... | 0.586526 | 0.780297 | 0.697173 | 0.581898 | 0.682755 | 0.542497 | 0.612092 | 0.500000 | 0.571405 | 0.754842 |
Tennessee Titans | 0.503759 | 0.728561 | 0.600488 | 0.457372 | 0.653045 | 0.788887 | 0.482033 | 0.536509 | 0.490037 | 0.557255 | ... | 0.515504 | 0.727071 | 0.633274 | 0.510744 | 0.617482 | 0.470737 | 0.542032 | 0.428595 | 0.500000 | 0.697837 |
Washington Football Team | 0.305342 | 0.537508 | 0.394240 | 0.267382 | 0.449035 | 0.618033 | 0.287221 | 0.333872 | 0.293825 | 0.352746 | ... | 0.315402 | 0.535637 | 0.427824 | 0.311302 | 0.411408 | 0.278040 | 0.338834 | 0.245158 | 0.302163 | 0.500000 |
32 rows × 32 columns
So with this new Data Frame, we can now plot each team's probability of winning against all of the other teams, and we can see which teams yield the highest win probability against (which team gives the most free wins) and which teams yield the lowest win probability against (which team needs to be upset).
from matplotlib.pyplot import figure
figure(figsize=(32, 16), dpi=80)
for i in range(len(df.Team.unique())):
plt.plot(df.Team.unique(), new_df.iloc[i], label = df.Team[i])
plt.title('Winning Percentage Against Team')
plt.xlabel('Team')
plt.ylabel('Winning Percent')
plt.legend()
<matplotlib.legend.Legend at 0x1c531b1aa00>
Based on this graph, we can see that the Falcons, the Panthers, the Bears, the Lions, the Texans, the Jaguars, the Giants, the Jets, and the Washington Football Team are the easiest to beat. We can also see that the Cardinals, the Bills, the Patriots, the Cowboys, the Packers, the Rams, the Chiefs, the Chargers, and the Buccaneers are the hardest teams to beat.
Now it's time to do some Machine Learning! The first thing we did here was split our data frame into training and test data. 75% of our data is in training, and 25% of our data is in test. I also set a value column for the X training set as the predicted Y values.
X = df.drop(columns=['Value', 'Team'], axis=1)
Y = df['Value']
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=.25, random_state=0)
X_train['Value'] = Y_train
With the magic formula we created before, we pass that into the OLS function provided by the Statsmodels.Formula.API library. We compute this formula using values from our training data, and fit them onto our Least Squares Regression model.
from statsmodels.formula.api import ols
formula_str = "Value ~ ((RushPercent * RushEfficiency) + (PassPercent * PassEfficiency)) - ((RushPercent * DRushEfficiency) - (PassPercent * DPassEfficiency)) - (DefenseRank * (OffensiveTO% - DefensiveTO%))"
mod = ols(formula=formula_str, data=X_train).fit()
warnings.filterwarnings('ignore')
mod.summary()
Dep. Variable: | Value | R-squared: | 0.888 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.857 |
Method: | Least Squares | F-statistic: | 28.56 |
Date: | Mon, 20 Dec 2021 | Prob (F-statistic): | 5.80e-08 |
Time: | 13:13:08 | Log-Likelihood: | -54.120 |
No. Observations: | 24 | AIC: | 120.2 |
Df Residuals: | 18 | BIC: | 127.3 |
Df Model: | 5 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | -3.7730 | 21.186 | -0.178 | 0.861 | -48.284 | 40.738 |
RushEfficiency | -1.2920 | 0.963 | -1.342 | 0.196 | -3.314 | 0.730 |
RushPercent:RushEfficiency | 4.0381 | 2.221 | 1.818 | 0.086 | -0.629 | 8.705 |
PassPercent | 7.7221 | 36.425 | 0.212 | 0.834 | -68.803 | 84.247 |
PassEfficiency | -1.3890 | 1.249 | -1.112 | 0.281 | -4.014 | 1.236 |
PassPercent:PassEfficiency | 3.6349 | 2.177 | 1.670 | 0.112 | -0.939 | 8.208 |
Omnibus: | 3.125 | Durbin-Watson: | 1.351 |
---|---|---|---|
Prob(Omnibus): | 0.210 | Jarque-Bera (JB): | 2.494 |
Skew: | 0.778 | Prob(JB): | 0.287 |
Kurtosis: | 2.725 | Cond. No. | 1.99e+03 |
The R^2 value of 0.888 means that 88.8% of the variation in the value can be explained by variation in the columns (Rush Efficiency, Rush Percent, etc.). If we predict our test values, how would they come out?
predicted_model = mod.predict(X_test)
So now that we have our prediction, let's see how it fares to the actual values.
warnings.filterwarnings('ignore')
f, ax = plt.subplots(figsize=(13,10))
plt.title('Data Distribution for Actual and Predicted')
sns.distplot(Y_test, hist=False, label="Actual", ax=ax)
sns.distplot(predicted_model, hist=False, label="Linear Regression Predictions", ax=ax)
<AxesSubplot:title={'center':'Data Distribution for Actual and Predicted'}, xlabel='Value', ylabel='Density'>
So it looks like our predictions are overshooting a bit, which is fine. It isn't by too much, so we can say this OLS Model is a good predictor of score values.
Below, we do exactly that! This function uses the values calculated from our magic formula from earlier, and takes in both teams and predicts who wins and the score.
def simulate_game(name1, name2):
print("It's the " + str(name1) + " vs the " + str(name2) + "!")
val1 = df[df["Team"] == name1]["Value"].item()
val2 = df[df["Team"] == name2]["Value"].item()
if val1 > val2:
print("The " + str(name1) + " will beat the " + str(name2) + "!")
print("The score is " + str(round(val1)) + " " + str(round(val2)))
elif val2 > val1:
print("The " + str(name2) + " will beat the " + str(name1) + "!")
print("The score is " + str(round(val2)) + " " + str(round(val1)))
else:
print("The " + str(name1) + " and the " + str(name2) + "will tie !")
print("The score is " + str(val1) + " " + str(val2))
print('\n')
Now we read in the matchups from a CSV File, and we drop what's unnecessary. This reads everything for this current week of football into the simulations, and we can see which of the elapsed games are correct and which aren't.
weekmatchups = pd.read_csv('week15matchups.csv')
weekmatchups = weekmatchups.drop(columns=["AwayTD"])
weekmatchups = weekmatchups.drop(columns=["HomeTD"])
for idx, matchups in weekmatchups.iterrows():
simulate_game(matchups[0], matchups[1])
It's the Kansas City Chiefs vs the Los Angeles Chargers! The Los Angeles Chargers will beat the Kansas City Chiefs! The score is 28 19 It's the Las Vegas Raiders vs the Cleveland Browns! The Cleveland Browns will beat the Las Vegas Raiders! The score is 19 13 It's the New England Patriots vs the Indianapolis Colts! The Indianapolis Colts will beat the New England Patriots! The score is 23 20 It's the Dallas Cowboys vs the New York Giants! The Dallas Cowboys will beat the New York Giants! The score is 23 8 It's the Arizona Cardinals vs the Detroit Lions! The Arizona Cardinals will beat the Detroit Lions! The score is 22 10 It's the Washington Football Team vs the Philadelphia Eagles! The Philadelphia Eagles will beat the Washington Football Team! The score is 21 10 It's the Carolina Panthers vs the Buffalo Bills! The Buffalo Bills will beat the Carolina Panthers! The score is 26 12 It's the Tennessee Titans vs the Pittsburgh Steelers! The Tennessee Titans will beat the Pittsburgh Steelers! The score is 22 14 It's the Houston Texans vs the Jacksonville Jaguars! The Houston Texans will beat the Jacksonville Jaguars! The score is 13 8 It's the New York Jets vs the Miami Dolphins! The New York Jets will beat the Miami Dolphins! The score is 13 10 It's the Green Bay Packers vs the Baltimore Ravens! The Green Bay Packers will beat the Baltimore Ravens! The score is 20 15 It's the Atlanta Falcons vs the San Francisco 49ers! The San Francisco 49ers will beat the Atlanta Falcons! The score is 25 8 It's the Cincinnati Bengals vs the Denver Broncos! The Cincinnati Bengals will beat the Denver Broncos! The score is 24 17 It's the Seattle Seahawks vs the Los Angeles Rams! The Los Angeles Rams will beat the Seattle Seahawks! The score is 28 19 It's the New Orleans Saints vs the Tampa Bay Buccaneers! The Tampa Bay Buccaneers will beat the New Orleans Saints! The score is 29 21 It's the Minnesota Vikings vs the Chicago Bears! The Minnesota Vikings will beat the Chicago Bears! The score is 19 6
The simulation correctly predicts a Colts, Cowboys, Bills, Texans, Packers, 49ers, and Bengals win! However, the simulation incorrectly predicted a Chargers, Cardinals (I'm sure NOBODY saw that coming!), Titans, Jets, and Buccaneers win. You win some and lose some in ML, so this model needs some random variability to account for upsets. We also need to take more factors into account, like injuries. That being said, no model can be exact, so this is a good place to start!
At the conclusion of this tutorial, we believe our code well written, well documented, reproducible, and does it help the reader understand the tutorial. The data we gathered, the methodology we used, the visualizations we made to help our readers see our vision, and the machine learning model we produced to help readers and users get an idea on how we can predict future games. While there is a lot of data being crunched, we hope our tutorial was beneficial to you.
While we are able to predict many games ahead, our machine learning model makes some "controversial calls." Controversial calls do make the NFL fun and interesting because we can see teams who are last beat teams who are the best in the league on any given week. It would be interesting to gather more data from more different sources in our further exploration. We believe our machine learning model is great, but nothing is perfect the first time. We hope to work on and improve on our model post our CMSC320 class, and possibly make some money (legally of course)!