Calculating Pythagorean Wins for NFL Teams Using Python

Around the mid-point of the 2020 NFL Season, I started seeing posts and tweets about predictions of total wins for NFL Teams this season. That was something I had always wanted to look into, and as it turns out, it’s actually rather easy to do with the right data.

In my research to find how to predict total wins, I found the Pythagorean Expectation. The Pythagorean Expectation is based on a Win Ratio originally calculated for baseball:

\[\textit{Win Ratio} = \frac{(\textit{runs scored})^2}{(\textit{runs scored})^2 + (\textit{runs allowed})^2}\]

You can then multiply the Win Ratio by the number of games to played or to play to get the theoretical projected wins.

For Professional Football, the exponent of 2.37 was originally used by Football Outsiders. You can then multiply by 16 to get the projected number of wins for a full season:

\[\textit{Pythagorean Wins} = \frac{(\textit{Points For})^{2.37}}{(\textit{Points For})^{2.37} + (\textit{Points Against})^{2.37}} \times 16\]

The exponent of 2.37 is, of course, a static exponent. In 2011, Football Outsiders introduced an equation (again originally derived for baseball) to calculate a more dynamic exponent. This dynamic exponent is different for each team with some teams scoring more points than others. The equation originally used by Football Outsiders to find this exponent is the following:

\[\textit{Exponent} = 1.50 \times log( \frac{\textit{Points For} + \textit{Points Against}}{\textit{Games Played}})\]

If you calculate the exponent for each team based on the points they scored and the points they’ve given up, you can then plug that exponent back into the Pythagorean Wins Formula:

\[\textit{Pythagorean Wins} = \frac{(\textit{Points For})^{\textit{Exponent}}}{(\textit{Points For})^{\textit{Exponent} }+ (\textit{Points Against})^{\textit{Exponent}}} \times \textit{Games Remaining}\]

Now that we have our equations, we just need the right data to use Python to calculate the projected wins for an NFL Team in the season. I primarily get all or most of my data from nflfastr. Of course, all of that code is written in R, and this blog post is for Python. Thankfully, the maintainers of the nflfastr package regularly update the data from their package and post it on Github which is where we can grab that information.

Now, if you go to the Github data page, you can find two different types of data that the nflfastr package scrapes of the NFL website as seen in Figure 1.

Figure 1: Screencap of nflfastr data respository

Under the “data” folder you will find all of the NFL Play-by-Play data by season. Under the “schedules folder” is a smaller repository which shows game results along with other information like the weekday a game was played, the time, the score of the game, stadium, etc. This is the data that we will pull off their repository to make the calculations that we want to use for Pythagorean Wins.

Figure 2: Screen Cap of Schedules Data

However, there is one small issue. For schedules, the files that are uploaded to the Github Repository are rds files as seen in Figure 2. These files aren’t exactly pandas friendly out of the box, but thankfully there is a package for that called pyreadr. To use that package simply pip install pyreadr as shown on the documentation page linked.

Now let’s get started coding!

First off, we have our basic imports.

import pandas as pd
import numpy as np
import pyreadr

from decimal import Decimal

We will need pandas, numpy, and pyreadr. I also import Decimal to clean up numerical results that wasn’t easily fixed by using the round() method. We will see that later in this post.

Next we need to grab the data and turn it into a dataframe that we can use for our calculations:

url = "https://github.com/guga31bb/nflfastR-data/blob/master/schedules/sched_2020.rds?raw=true"
dst_path = "./data/sched_2020.rds"
dst_path_again = pyreadr.download_file(url, dst_path)
result = pyreadr.read_r(dst_path)
df = result[None]
df.head()

If you’ve used pandas before to grab data off a website, you know that you can typically just pull hosted csv or text files directly from a url. However, pyreadr requires a file to read off your local machine. So we set the url variables and dst_path variables then use pyreadr’s download_file method to get the file off the web and put it on our local machine at that dst_path location. We can then use the read_r method to put in our result variable and take the object in the result variable making it our dataframe.

Having done that, you should see the following output for 2020:

Figure 3: Raw Dataframe from RDS file

The dataframe currently holds the entire 2020 regular season schedule for the NFL. However, we really just want to look at the games that have already been played. Games that have not been played have NA values for the home_score, away_score, and home_result columns, but we really only need to filter on one column to get a dataframe with just the games played.

To do that we use the notnull method on any one of those columns, then do a check of the resulting dataframe using tail to see the end of the dataframe.

df = df[df.home_score.notnull()]
df.tail()
Figure 4: Filtered Dataframe of Games Played

As of this writing, Week 13 has not yet started so the last game played was the Baltimore-Pittsburgh game that has been played on a Wednesday due to COVID concerns.

Now we have the scores for each game in 2020, so how do we get the total points for by Team as well as the total points against by Team? That will require a for-loop with some data manipulation. Let’s break that apart.

Our loop will calculate total points for and total points against for each team in the filtered dataframe, the total games played, the total games left to play, as well as the total wins so far. Let’s further break down how we’re going to do each calculation for each team.

First we will create a list of the 32 teams, and then initialize several empty lists that will be populated throughout the loop for each team.

teams = list(set(df.home_team.values))
wins = []
py_wins = []
total_points_for_list = []
total_points_against_list = []

Now we will iterate over each team in the dataframe and begin our calculations:

for team in teams:

    data_home = df[df.home_team == team]
    data_home['win'] = np.where(data_home.home_result > 0, 1, 0)
    home_count = data_home.home_team.count()
    win_home = data_home.win.sum()
    points_for_home = data_home.home_score.sum()
    points_against_home = data_home.away_score.sum()

In this first code block in the loop, we want to get all the points for, the points against, and the number of wins at home for each team. To do that we perform the following steps:

  1. Filter the dataframe to include data for each team when they play a home game
  2. Create a new column called “win” that takes information from the home_result column and gives us a “1” if the team won at home or a “0” if the team lost at home. This can be done using np.where() which is a very nice feature of numpy to use to segment your data. You can see from the original dataframe that a negative number in the home_result column denotes a loss.
  3. Get a count of the total number of home games played so far in the dataframe (could essentially use the len() method as well).
  4. Calculate the sum of the win column to get total number of wins at home
  5. Calculate the total points for scored at home
  6. Calculate the total points against scored at home

Since we did that for when a team plays at home, we now perform nearly the same calculations for when the team plays away from home:

    data_away = df[df.away_team == team]
    data_away['win'] = np.where(data_away.home_result < 0, 1, 0)
    away_count = data_away.away_team.count()
    win_away = data_away.win.sum()
    points_for_away = data_away.away_score.sum()
    points_against_away = data_away.home_score.sum()

The only differences here is, of course, filtering on the away_team column for our new dataframe, and then denoting that a negative home result when away is considered a win for our away team. Everything else is essentially the same.

Now we can finish out our calculations in the loop for each team and start populating the empty lists we initialized right before we started our loop.

    total_points_for = int(points_for_home + points_for_away)
    total_points_for_list.append(total_points_for)

    total_points_against = int(points_against_home + points_against_away)
    total_points_against_list.append(total_points_against)

In this code block we take the home points for and away points for to get the total points for scored by the team and populate that list. We do the same for total points against.

    total_games = home_count + away_count
    total_games_left = 16 - total_games
    total_wins = win_home + win_away
    wins.append(total_wins)

In this code block, we can calculate the total games played by each team so far. With a 16 game season, we can then calculate the total games left. Then we get the total wins from the count of home wins and away wins we calculated earlier.

Now the fun part!

    exponent = 1.5 * np.log10((total_points_for + total_points_against)/total_games)

This is the calculation for the dynamic exponent that was mentioned earlier. If you wanted to use the static exponent you can just set exponent = 2.37 that was taken from the Wiki link above.

With the exponent, we can now calculate the projected remaining Pythagorean wins for the season and add those to the py_wins list we initialized earlier:

    pythagorean_wins = round(Decimal(total_games_left*(total_points_for**exponent)/((total_points_for**exponent) + (total_points_against**exponent))), 2)
    py_wins.append(pythagorean_wins)

And that is it for our for loop. Here is the full code block for that loop:

teams = list(set(df.home_team.values))
wins = []
py_wins = []
total_points_for_list = []
total_points_against_list = []

for team in teams:

    data_home = df[df.home_team == team]
    data_home['win'] = np.where(data_home.home_result > 0, 1, 0)
    home_count = data_home.home_team.count()
    win_home = data_home.win.sum()
    points_for_home = data_home.home_score.sum()
    points_against_home = data_home.away_score.sum()

    data_away = df[df.away_team == team]
    data_away['win'] = np.where(data_away.home_result < 0, 1, 0)
    away_count = data_away.away_team.count()
    win_away = data_away.win.sum()
    points_for_away = data_away.away_score.sum()
    points_against_away = data_away.home_score.sum()

    total_points_for = int(points_for_home + points_for_away)
    total_points_for_list.append(total_points_for)
    
    total_points_against = int(points_against_home + points_against_away)
    total_points_against_list.append(total_points_against)
    
    total_games = home_count + away_count
    total_games_left = 16 - total_games
    total_wins = win_home + win_away
    wins.append(total_wins)
    
    # Dynamic Exponent from Football Outsiders.  Static exponent would be 2.37 from Wiki
    # https://www.footballoutsiders.com/dvoa-ratings/2011/week-13-dvoa-ratings
    exponent = 1.5 * np.log10((total_points_for + total_points_against)/total_games)

    pythagorean_wins = round(Decimal(total_games_left*(total_points_for**exponent)/((total_points_for**exponent) + (total_points_against**exponent))), 2)
    py_wins.append(pythagorean_wins)

Now we just want to create a dataframe with our newly populated lists to eventually make a nice styled dataframe.

projected_wins = pd.DataFrame(list(zip(teams, total_points_for_list, total_points_against_list, py_wins, wins)), columns =['Team', 'Points For', 'Points Against', 
                                                                                                                          'Projected_Wins', 'Current_Wins'])
projected_wins['Total_Projected_Wins'] = projected_wins.Projected_Wins + projected_wins.Current_Wins

projected_wins.head()

First we create the dataframe from our populated lists with appropriate column names. The we create a new column with the total project wins for the season and check our new dataframe.

Figure 5: New Dataframe with Projected Wins

Now we have each team listed with Points For, Points Against, Projected Wins left for the rest of the season, the Current Wins, and the Total Projected Wins for the season.

That is pretty much it except that I like to make my dataframes “pretty” using the style method which allows us to add a little color to our dataframes.

First let’s create two separate dataframes for the NFC and AFC.

afc = ['PIT', 'KC', 'BAL', 'BUF', 'TEN', 'MIA', 'IND', 'LV', 'CLE', 'NE', 'LAC', 'CIN', 'DEN', 'HOU', 'JAX', 'NYJ']
nfc = ['NO', 'GB', 'SEA', 'LA', 'ARI', 'TB', 'CHI', 'SF', 'MIN', 'PHI', 'DET', 'CAR', 'ATL', 'WAS', 'NYG', 'DAL']
afc_wins = projected_wins[projected_wins.Team.isin(afc)].sort_values('Total_Projected_Wins', ascending = False).reset_index(drop = True)
nfc_wins = projected_wins[projected_wins.Team.isin(nfc)].sort_values('Total_Projected_Wins', ascending = False).reset_index(drop = True)

There is a lot going on here:

  1. Create two different lists for AFC and NFC teams
  2. Create two separate dataframes for each conference first by filtering on which teams are in, using isin(), each conference list
  3. Then we sort the new dataframes by Total Projected Wins to give us the teams with the highest amount of projected wins listed first in our dataframe and in descending order.
  4. Then reset the index.
afc_wins = afc_wins[['Team', 'Total_Projected_Wins']]
afc_wins.columns = ['Team', 'Wins']

nfc_wins = nfc_wins[['Team', 'Total_Projected_Wins']]
nfc_wins.columns = ['Team', 'Wins']

In this code block, we further filter down our dataframe to just include two columns, and then we rename those columns.

Now it’s time to style!

But first, we need some color, and thankfully the folks at nflfastr have also given us the hex codes for each NFL team which we can store as a dictionary:

COLORS = {'ARI':'#97233F','ATL':'#A71930','BAL':'#241773','BUF':'#00338D','CAR':'#0085CA','CHI':'#00143F',
          'CIN':'#FB4F14','CLE':'#FB4F14','DAL':'#B0B7BC','DEN':'#002244','DET':'#046EB4','GB':'#24423C',
          'HOU':'#C9243F','IND':'#003D79','JAX':'#136677','KC':'#CA2430','LA':'#002147','LAC':'#2072BA',
          'MIA':'#0091A0','MIN':'#4F2E84','NE':'#0A2342','NO':'#A08A58','NYG':'#192E6C','NYJ':'#203731',
          'LV':'#C4C9CC','PHI':'#014A53','PIT':'#FFC20E','SEA':'#7AC142','SF':'#C9243F','TB':'#D40909',
          'TEN':'#4095D1','WAS':'#FFC20F'}

We then create a function that we will apply these colors to for each team in both of the conference dataframes.

def highlight_cols(s, coldict):
    return ['background-color: {}'.format(COLORS[v]) if v else '' for v in afc_wins.Team.isin(COLORS.keys())*afc_wins.Team.values]

Finally we can style the AFC dataframe using that function:

(afc_wins.style
.set_caption('Projected Wins for AFC Teams')
.hide_index()
.apply(highlight_cols, coldict=COLORS)
.set_properties(**{'color':'white'})
)

Again, lots going on here:

  1. Use the style method on our dataframe to get things started
  2. Use set_caption to set the title for our styled dataframe
  3. Hide the index for our final output
  4. Apply the function to add Team colors by row
  5. Set the color of the text in the dataframe to white so that it can contrast against the different team colors

We now do the same thing for the NFC dataframe:

def highlight_cols(s, coldict):
    return ['background-color: {}'.format(COLORS[v]) if v else '' for v in nfc_wins.Team.isin(COLORS.keys())*nfc_wins.Team.values]

(nfc_wins.style
.set_caption('Projected Wins for NFC Teams')
.hide_index()
.apply(highlight_cols, coldict=COLORS)
.set_properties(**{'color':'white'})
)

We now should have two very nice stylized dataframes that we can use for blog posts or tweets or whatever you would like:

And that’s it! In this post we covered several topics:

  1. Grabbing R-generated files from the web using pyreadr
  2. Manipulating NFL schedule data to calculate various metrics including total points for and against as well as projected remaining wins and total projected wins
  3. Styling a dataframe by team color.