Recently, I’ve been looking at datasets to help study up on Time-Series Analysis. Time-Series was something that was taught late in the Data Science Bootcamp I attended earlier this year, and since we were knee deep in our own capstones, it wasn’t something I spent much time concentrating on at the time. Last month, I did a simple ARIMA model to forecast scores during the Drum Corps International Championship Weekend that seemed to do pretty well. I was rather proud of that little achievement.
I wanted to find a dataset that would fulfill my goal of going from data to model quick and dirty without much EDA and visualization of the dataset itself. There is all sorts of analysis that can be done with time-series, but I wanted to see what kind of model would work best with data that was readily available and fairly clean.
One of the data sets I found on Kaggle had to do with Avocado Prices in the United States originally obtained from the Hass Avocado Board. I mean, who doesn’t like avocados? They seem to go great with just about everything…as a spread on your sandwich, as a dip with tortilla chips, or just something you want to eat straight up on a nice summer day. As for me, I usually like a good dip every now and again; just don’t ask me to make the dip itself. I’m not all that good at it as simple as it really is to do.
I went ahead and stayed with the Kaggle dataset which has weekly average avocado prices for both conventional and organic avocados from January 4, 2015 through March 25, 2018. As of this writing, the Hass Avocado Board has prices through July 15, 2018. People are more than welcome to either obtain and clean the data themselves or append the data from April 1, 2018 to July 15, 2018 (and beyond).
The Avocado Price Dataset has the following data:
- Date – Weekly observation dates from 2015-2018
- Average Price – Average price of a single avocado
- Total Volume – Total # of Avocados sold during that week
- 4046/4225/4770 – Total volume of avocados sold by 3 different PLUs
- Total Bags – Total bags sold
- Small/Large/XLarge Bags – Total bags sold by size
- Type – Organic or Conventional
- Year – Year of observation
- Region – City or Region of the observation
Incidentally, if you’d like to go straight to my code, you can find it on Github.
The imports for the time-series analysis I wanted are rather simple:
import pandas as pd import numpy as np from statsmodels.tsa.arima_model import ARIMA import statsmodels.api as sm from sklearn.metrics import mean_squared_error
From here we can import our dataset (setting the index_col = 0 lets Pandas know which column to use as the row labels which in this case is Date or the first column in the original csv file), convert the Date column to datetime using Pandas, and then set the index of our DataFrame by date. The last line of code below is used to sort the index from oldest date to newest date. The original dataset provided on Kaggle did not have dates arranged this way. They started from the end of 2015 and went backward for all of 2015, then 2016, and so on. This was something I didn’t discover until I was trying to model later in my Jupyter Notebook. It’s best to add it here:
df = pd.read_csv('./data/avocado.csv', index_col=0) df['Date'] = pd.to_datetime(df['Date']) df.set_index('Date', inplace=True) df= df.sort_index()
As mentioned above there are two types of avocados in the dataset as well as several different regions represented. This allows you to do all sorts of analysis for different areas of the United States, specific cities, or just the overall United States on either type of avocado. I chose to do my analysis on conventional avocados with average prices for the entire United States. Below I show how I did that as well as drop any columns really don’t really need for the time-series analysis to give us a clean DataFrame:
conventional = df[(df.region == 'TotalUS')&(df.type == 'conventional')] conventional = conventional.drop(columns = ['Total Volume', '4046', '4225', '4770', 'Total Bags', 'Small Bags', 'Large Bags', 'XLarge Bags', 'type', 'year', 'region'])
Plotting the average prices by week from 2015 to 2018 yields the following:
As you can see from the plot, avocado prices were relatively flat for 2015 and much of 2016. It wasn’t until the middle of 2016 that there were “large” swings in prices that were duplicated again in 2017. This should prove to be an excellent study for our time-series.
To get ready to do an ARIMA model, I found some code from a blog post by Jason Brownlee. He has some of the more interesting blog posts on Data Science around, and I would highly encourage anyone to follow him and read his posts. The following code comes directly from How to Grid Search ARIMA Model Hyperparameters with Python:
# evaluate an ARIMA model for a given order (p,d,q) def evaluate_arima_model(X, arima_order): # prepare training dataset train_size = int(len(X) * 0.66) train, test = X[0:train_size], X[train_size:] history = [x for x in train] # make predictions predictions = list() for t in range(len(test)): model = ARIMA(history, order=arima_order) model_fit = model.fit(disp=0) yhat = model_fit.forecast()[0] predictions.append(yhat) history.append(test[t]) # calculate out of sample error error = mean_squared_error(test, predictions) return error # evaluate combinations of p, d and q values for an ARIMA model def evaluate_models(dataset, p_values, d_values, q_values): dataset = dataset.astype('float32') best_score, best_cfg = float("inf"), None for p in p_values: for d in d_values: for q in q_values: order = (p,d,q) try: mse = evaluate_arima_model(dataset, order) if mse < best_score: best_score, best_cfg = mse, order print('ARIMA%s MSE=%.3f' % (order,mse)) except: continue print('Best ARIMA%s MSE=%.3f' % (best_cfg, best_score)) # evaluate parameters p_values = range(0, 4) d_values = range(0, 4) q_values = range(0, 4) warnings.filterwarnings("ignore") evaluate_models(conventional.values, p_values, d_values, q_values)
You can read more about the above code at the link provided, but it essentially iterates over several different ARIMA models using 66% of your data as a train set, and the other 34% as a test set. It finds the model with the best p, d, q-values that yield the lowest MSE. I’ve also seen code before that iterates ARIMA models trying to find the lowest Akaike Information Criterion (AIC) metric.
Using these functions inputting the range of values shown in the code above, the best model has a best ARIMA order of (1, 0, 0) with an MSE of 0.007. We can then pass that into a conventional ARIMA model with our entire dataset to compare the model’s predictions with the actual values. The new DataFrame created below, predicted_df, now has the Date, the actual values, and the predicted values of the model.
# instantiate the ARIMA model model = ARIMA(conventional['AveragePrice'], order = (1, 0, 0)) # fit the model results_ARIMA = model.fit() # collect the predicted results, rounding to two to indicate dollars and cents predictions = round(results_ARIMA.predict(), 2) # put the predictions into a DataFrame with Date and Predicted Price columns preds = pd.DataFrame(list(zip(list(predictions.index),list(predictions))),columns=['Date', 'PredictedPrice']).set_index('Date') # combine the original data set with the predicted data predicted_df = pd.merge(conventional[1:], preds, left_index=True, right_index=True)
From here we can now score the model…
print("\tMean Squared Error:", mean_squared_error(predicted_df['AveragePrice'], predicted_df['PredictedPrice'])) print("\tRoot Mean Squared Error:", np.sqrt(mean_squared_error(predicted_df['AveragePrice'], predicted_df['PredictedPrice'])))
…and we get an MSE of 0.004 and respective RMSE of 0.066 which would indicate that model matches fairly well with the actual values.
ARIMA has a .plot_predict() method that will automatically make a plot that shows the actual values, the forecast price going out as far as you want it to go as well as the confidence level. That bit of code is easy and is the following using the start data from our original data and forecasting out to the end of 2018:
results_ARIMA.plot_predict(start='2015-01-11', end = '2018-12-30')
ARIMA also has a .forecast() method that can be used to get future predictions. However, it takes a little bit of extra code to get them into a DataFrame that I would like to use:
# grab the forecast from the model out 40 steps to 2018-12-30, and create a Series out of the data ARIMA_forecast = pd.Series(results_ARIMA.forecast(steps = 40)[0]) # create an index from the end of the data out to the length of the forecast on a weekly basis idx = pd.date_range('2018-04-01', '2018-12-30', freq='W') # create a DataFrame combining the index above and the forecast prices ARIMA_forecast = pd.DataFrame(list(zip(list(idx),list(ARIMA_forecast))),columns=['Date','ForecastPrice']).set_index('Date')
From here, I can create a my own plot similar to what can be generated using the .plot_predict() method. Doing that, we notice something odd:
The forecast price is flat and doesn’t follow any specific pattern similar to what we saw in 2016 and 2017. Is this really correct? Intuitively I would say definitely not. Here is where one would start doing tests for stationarity by doing any number of things like calculating the log of the price, moving average, or exponential weight moving average to list a few, assuming you wanted to stick with an ARIMA model.
Another model that does take seasonality into account is called SARIMAX. It has an additional argument for seasonal order.
# instantiate the model using the ARIMA order we had earlier mod = sm.tsa.statespace.SARIMAX(conventional['AveragePrice'], order=(1, 0, 0), seasonal_order=(1, 0, 0, 52), enforce_stationarity=False, enforce_invertibility=False) # fit the model SARIMAX_results = mod.fit() # store the predictions from the model rounding to two for dollars and cents SARIMAX_predictions = round(SARIMAX_results.predict(), 2) # create a DataFrame with Date and Predicted Price SARIMAX_preds = pd.DataFrame(list(zip(list(SARIMAX_predictions.index),list(SARIMAX_predictions))), columns=['Date','PredictedPrice']).set_index('Date') # merge the original DataFrame with the predictions SARIMAX_predicted_df = pd.merge(conventional[1:], SARIMAX_preds, left_index=True, right_index=True)
Once again we calculate the MSE and RMSE:
print("\tMean Squared Error:", mean_squared_error(SARIMAX_predicted_df['AveragePrice'], SARIMAX_predicted_df['PredictedPrice'])) print("\tRoot Mean Squared Error:", np.sqrt(mean_squared_error(SARIMAX_predicted_df['AveragePrice'], SARIMAX_predicted_df['PredictedPrice'])))
We get an MSE of 0.006 and RMSE of 0.078.
The SARIMAX .forecast() method works a little differently in that it gives you the date index as well as the forecasted values so creating a nice DataFrame takes just one line of code:
SARIMAX_forecast = pd.DataFrame(round(SARIMAX_results.forecast(steps = 40), 2), columns = ['Forecasted Price'])
Unfortunately, SARIMAX doesn’t have a .plot_predict() method, but now that we have our DataFrames we can do it ourselves and get the following:
Here we get a plot that makes more sense intuitively. The forecasted price matches the rise and fall of avocado prices seen in 2016 and 2017. The fact that the range of the min and max of that swing I feel can be interpreted by 2015 being so flat weighing down what the predicted forecast would be. If there had been a swing in prices in 2015 as seen in 2016 and 2017, then we would more than likely get a larger swing in prices for the forecast through the end of 2018. I would like to think that the one pop there around January 2016 is the model trying to match what happened later in 2016 and 2017.
Obviously I realize that this is a very simplistic approach to this particular problem. We didn’t do all the normal stationarity checks that one would do with time series analysis as well as plotting autocorrelation and partial autocorrelation graphs. There is so much more you can do with this dataset. However, my goal here was simply to take the data given and find a quick and dirty model that intuitively made some sense. The SARIMAX model helped me do that.
I would encourage anybody reading this to try different variations on this data:
- Organic vs. conventional – Create a classification model and determine if you can predict if an avocado is organic or conventional given all the other variables.
- Region – Predict region with classification model
- Try out the different regions and map out the various differences in price and/or forecasts in each one.
- As of this posting, the Hass Avocado Board has prices through 7/15/2018. One could either append the data from 4/1/2018 to further their forecast, or do a .fit with the original dataset through 3/25/2018, and run a .forecast through 7/15/2018 then do their error calculations based on that information now that we have it.
There is much that can be done with this particular dataset. As always, any feedback would be greatly appreciated! Good luck!