Forecasting Oil Futures with SARIMAX: A Simple and “Crude” Approach

Source: finance.yahoo.com

Introduction

A few weeks ago, I published a post on predicting Avocado Prices that, I thought, turned out pretty well using the SARIMAX algorithm in the Statsmodels library in Python.  Shortly afterward, while on a trip to Houston, I was talking about that blog post with a friend of mine who works in the oil drilling industry.  He suggested that I take a look at West Texas Intermediate (WTI) crude oil prices and see what kind of predictions/forecasts that I could make out of that.  Challenge accepted!! So what follows is my rather crude (pun intended) and simplistic approach to making WTI and Brent Crude Oil Futures projections for the next six months starting today, October 29, 2018.  As you’ll see I did like how the model came out in terms of short-term forecasts, so this may be something I revisit at the end of year to make projections to compare the model’s forecast to the real prices in November and December.  Then I can project out for another three months (or one quarter) beyond that. One more thing I wanted to mention, do not make investment decisions solely based on the predictions/forecast presented in this blog/model.  There is much more that goes into the price movement of both of these commodities.  Just like any other investment, somebody sneezes on the other side of the world, and the price will go up or down.  I would highly recommend doing your own research before making ANY investment decisions.

Data

The data for this model was downloaded from Investing.com and represents daily prices off the New York Mercantile Exchange.  This data had the following information:

  1. Date – Daily based on Business Days
  2. Price – Daily Closing Price
  3. Open – Daily Opening Price
  4. High – Intraday Maximum Price
  5. Low – Intraday Minimum Price
  6. Volume – # of futures traded
  7. % Change – Percent change from previous day’s closing price

Code and Analysis

Incidentally, if you’d like to go straight to my code, you can find it on Github. I’m only showing code for the WTI prediction/forecast, however, both are very similar. The imports for the time-series analysis I wanted are rather simple:

# Importing libraries used for the analysis
import pandas as pd
import numpy as np

# Plotly imports
import plotly.offline as pyo
import plotly.graph_objs as go

# Using latest version of statsmodels 0.9.0 (otherwise get errors during SARIMAX fit)
from statsmodels.tsa.arima_model import ARIMA
import statsmodels.tsa.api as smt
import statsmodels.api as sm
from sklearn.metrics import mean_squared_error
from matplotlib import pyplot as plt
%matplotlib inline

# Not recommended, but using these to keep a clean notebook. Warnings will most likely occur during SARIMAX
# fit due to frequency errors (i.e. data is not sequential by day since futures are not traded in the United States
# on weekends and holidays)
import warnings
warnings.filterwarnings("ignore")

From here, we can upload our data and clean it up for our time-series analysis. I initially wanted to take prices going back from January, 1985 through the present.  With Investing.com there seems to be a limit to the number of rows of data you can download, so I had two data sets to encompass that entire period.  After that, I appended both dataframes, converted the Date column to a datetime index dropping all the columns except for the closing price denoted as Price in the data set.

# Read in historical data for visualization
# There are two files because Investing.com seems to have a limit to how many rows you can download.
history = pd.read_csv('./data/Crude Oil WTI Futures Historical Data.csv')
history2 = pd.read_csv('./data/Crude Oil WTI Futures Historical Data2.csv')

# Append the two above dateframes together
historical_data = history2.append(history)

# Set Date column to datetime for time series
historical_data['Date'] = pd.to_datetime(historical_data['Date'], format = '%b %d, %Y')

# Data file is sorted from most recent date to the past so sorting by date to go the other way around
historical_data = historical_data.sort_values(by='Date', ascending=True)

# Resetting the index due to the sort_value change
historical_data = historical_data.reset_index(drop = True)

# Dropping all columns except for the Closing Price
historical_data = historical_data.drop(columns=['Open', 'High', 'Low', 'Vol.', 'Change %'])

# Setting the Date as the index
historical_data = historical_data.set_index('Date')

Plotting our dataframe, we get the following graph.  As you can see, WTI oil prices have been fairly volatile the last 20 years or so.  I think many of us can remember how the price of oil seemed to be astronomical in the summer of 2008.  I was visiting relatives in Los Angeles, CA in July of that year driving around in a rented Jeep Liberty from LA to San Diego and back.  My fuel bill was not very pretty!! Of course, then the Great Recession occurred and oil prices plummeted just a year later.  OIl prices never did quite get back up to those levels, although they did reach above $100 for a few years before coming back down the last few months of 2014.

I originally wanted to run my model based on the entirety of this data.  However, the SARIMAX algorithm apparently can be “computationally expensive” (I really like that expression by the way).   I couldn’t run the model using the parameters I wanted without encountering memory errors.  I tried on Google Colab as well; the model fit would work there, but then I would get memory errors with the forecast method.  Unfortunately when you get memory errors on Colab, the entire kernel resets, and you lose all your variables.  That was not the greatest learning experience, but something to consider for the future. I ended up cutting up the data taking WTI oil prices from 2009 when prices hit their lowest point that year.  However, once again, memory errors, and the model’s forecast didn’t make much intuitive sense.  I went back and just took the data from January 2016 where the lowest price occurred before going back up to the levels it is at today.  Finally I could get a model and predictions that make good, intuitive sense. To cut up the dataframe to get price information from January 1st, 2016, we run the following line of code:

# Grabbing data from January 1st, 2016
df = historical_data.loc['2016-01-01':]

As you can see both graphs are similar although the price of Brent Crude has been slightly higher than WTI. Now we are essentially ready to model.  As with the Avocado Price blog post, I used a grid search of ARIMA Model Hyperparameters from the blog post, How to Grid Search ARIMA Model Hyperparameters with Python.  This took over three hours on my desktop to run.  It runs through different p, d, q parameters for the ARIMA model based on a training set of 67% of the data.  Then it takes the rest of the data to make predictions and compare them against the real values from the testing set (33% of the data).  It stores the parameters with the lowest mean squared error (MSE), and spits those out for use.  The optimal parameters found were 0, 1, 0.  From here, we instantiate the model.

# Instantiating the model using SARIMAX, the optimal p, d, q values, and a seasonal order based on 365 days/year.
model = sm.tsa.statespace.SARIMAX(df['Price'],
    order=(0, 1, 0),
    seasonal_order=(0, 1, 0, 365),
    enforce_stationarity=True,
    enforce_invertibility=False)

# Fitting the model
SARIMAX_results = model.fit()

We can now get the predictions from the model and create a dataframe with both the real values from January, 2016 and compare them against the predictions made by the model:

# Getting model's predictions of the in-sample data, rounding to two decimal places for price.
SARIMAX_predictions = round(SARIMAX_results.predict(), 2)

# Creating a dataframe of the date index and predictions
SARIMAX_preds = pd.DataFrame(list(zip(list(SARIMAX_predictions.index),list(SARIMAX_predictions))),
columns=['Date','PredictedPrice']).set_index('Date')

# Merging the original dataframe with predictions for comparison
SARIMAX_predicted_df = pd.merge(df[1:], SARIMAX_preds, left_index=True, right_index=True)

We can calculate the MSE and root mean squared error (RMSE) to get 1.91 and 1.34 respectively:

print("\tMean Squared Error:", mean_squared_error(SARIMAX_predicted_df['Price'], 
    SARIMAX_predicted_df['PredictedPrice']))

print("\tRoot Mean Squared Error:", np.sqrt(mean_squared_error(SARIMAX_predicted_df['Price'], 
    SARIMAX_predicted_df['PredictedPrice'])))

Now we can create our forecast from the model.  I looked at the next 120 trading days to take the forecast through April 14, 2019.

# Getting 120 days (a little more than 6 months in business days) for forecasts
SARIMAX_forecast = round(SARIMAX_results.forecast(steps = 120), 2)

# Creating an index from 10/29/2018 to six months out, frequency indicates business day which eliminates weekends and
# US holidays, then putting it all together into a SARIMAX_forecast dataframe
idx = pd.date_range('2018-10-29', '2019-04-14', freq='B')

SARIMAX_forecast = pd.DataFrame(list(zip(list(idx),list(SARIMAX_forecast))),
columns=['Date','ForecastPrice']).set_index('Date')

Plotting those forecasts for both WTI and Brent Crude, we get the following:

If we zoom in to each from July 1st, 2018, we can see that, for the WTI Futures, the model forecasts prices to go down for the next 10 days or so before slowly climbing back up.

Doing the same for the Brent Crude model, we see the forecast going down for the next three weeks before coming back up.  You’ll notice that the forecast pattern is very similar for both models beyond those three weeks.

Conclusion

I do like how the models turned out using SARIMAX, which is quickly becoming one of my favorite time-series tools to use.  I know there are several different ones out there including using a neural network to analyze time-series.  I’ve not yet looked into using that method, but it is something that I would like to consider for the future.  Of course, it’s important that a model makes sense and produces results that would help understand the “real” world.  Most of the other models I ran using data from 2009 or even from 1985 were producing models that were extremely bullish.  By that I mean, they didn’t predict any kind of drop in price, they just wanted to keep going up and up and up. As mentioned earlier, this is a rather simplistic approach to the question in regards to predicting oil prices.  I certainly wouldn’t expect anybody to make investment decisions based solely on this forecast (I would highly recommend against it for sure!).

As always, if anybody has any suggestions or improvements, feel free to contact me and let me know.