Forecasting Beer Sales with ARIMA

Forecasting Beer Sales with ARIMA in Python

ARIMA (Auto Regressive Integrated Moving Average) is a useful statistical modelling technique for time series forecasting. Compared to machine learning, ARIMA is a classical modeling technique that is very strong especially when the time series to be analyzed follows a clear pattern, for example due to seasonality. For this reason, ARIMA is still used a lot in various economic and scientific application areas, such as forecasting weather data, order quantities, and sales figures.

In this post, you’ll learn how ARIMA works and how you can use ARIMA in Python to forecast the volume of beer sales in the US.

Cheers 🙂

The rest of this article is divided into two parts. The first part covers the theory. You will learn the basics of how ARIMA works, what stationarity means, and when it is appropriate to use ARIMA. The second part is the practical implementation of an auto-ARIMA model in Python. We load a set of sample time series data, then check if the data is stationary and train an ARIMA prediction model.

Part A) Introduction to ARIMA

Working with ARIMA requires a good understanding of the concept of stationarity as well as knowledge of transformations that can be used to make time series data stationary. The concept of stationarity is therefore first on our agenda.

The Concept of Stationarity

Stationarity is an important concept of stochastic processes that describes the nature of a time-series. A time series is considered strictly stationary if its statistical properties do not change over time. In this case, summary statistics such as the mean and the variance do not change over time. However, the time series data that we encounter in the real world is often characterized by a trend or major infrequent fluctuations, which makes them non-stationary or only weakly stationary.

So why is stationarity so important? When a time-series is stationary, we can assume that the past values of the time-series are highly informative about the future development of the time series. In other words, a stationary time series shows consistent behavior that makes it predictable. On the other hand, a non-stationary time-series is characterized by some sort of random behavior, which will be difficult to capture in the modelling process. Indeed, if the past was characterized by random movements, there is a high probability that the future will be no different.

Fortunately, in many cases it is possible to transform a time series that is non-stationary to a more stationary from, and in this way, to build better prediction models

Testing for Stationarity

There are different ways how we can determine whether a time-series is stationary.

  • Plotting: We can plot the time-series and check visually if it shows consistent behavior or is changing over a longer period of time.
  • Summary statistics: We can split the time series in different periods and calculate the summary statistics such as the variance. If these metrics are subject to major changes, this is an indication that the time series is non-stationary. However, the results will also depend on the respective periods, which can lead to false conclusions.
  • Statistic tests: There are various tests to determine the stationary of a time series, such as Kwiatkowski–Phillips–Schmidt–Shin, Augmented Dickey-Fuller or Phillips–Perron. These tests systematically check a time series and measure the results against the null hypothesis, in this way providing an indicator on the trustworthiness on the results

What is an (S)ARIMA Model?


As the name implies, ARIMA uses autoregression (AR), integration (differencing) and moving averages (MA) in order to fit a linear regression model to a time series. The ARIMA(p, d, q) modelling function has three parameters:

d (differencing): In the case of a non-stationary time series, there is a chance to remove a trend from the data by differencing once or several times, thus bringing the data to a more stationary state. The order of the differentiations is determined by the model parameter d. However, integration is only needed if the time series is not stationary. If differentiation (d = 0), the ARIMA model is simplified to an ARMA model, because it lacks the integration aspect.

p (order of the AR terms): The autoregressive process describes the dependent relationship between an observation and a number of lagged observations (lags). Predictions are then based on past data from the same time series using linear functions. A value of p = 1 means that the model uses values that lag by one period.

q (order of the MA terms): In contrast to the AR process, the MA process assumes that values at a future point in time depend on the errors made by predictions at current and past points in time. This means that it is not previous events that determine the predictions, but rather the previous estimation or prediction errors that are used in the calculation of the next value of a time series. The parameter q determines the number of lagged forecast errors in the prediction equation.


In the real world many time series are characterized by seasonal effects. Examples are monthly retail sales figures, temperature reports, or weekly airline passenger data, etc. To take this into account, we can specify a seasonal range (e.g. m=12 for monthly data) and additional seasonal AR or MA components for our model that deal with the seasonality. Such a model is also called a SARIMA model and it is defined as model(p, d, q)(P, D, Q)[m].


When working with ARIMA, we can decide to set the model parameters manually, or we can use auto-ARIMA and let the model search for the optimal parameters itself. This is done through varying the parameter and then testing against stationarity. With the seasonal option enabled, the process also tries to identify the optimal hyperparameters for the seasonal components of the model. Auto-ARIMA works by conducting differencing tests to determine the order of differencing, d, and then fitting models with parameters in defined ranges, e.g. start_pmax_p, as well as start_qmax_q. If our model has seasonality, we can also define parameter ranges for the seasonal part of the model.

Part B) Time-Series Forecasting with ARIMA in Python

Now that we went through the theoretical context, it’s time to get our hands dirty and implement an ARIMA model in Python.


If you don’t have an environment set up yet, you can follow this tutorial to setup the Anaconda environment. In this tutorial, we will be working with the following packages: pandas, scikit-learn, numpy, matplotlib and the pmdarima package.

You can install the packages using the console command:

  • pip install <package name>
  • or, if you are using the anaconda packet manager: conda install <package name> ()

1 Load the Data

In the following, we will work with a dataset from The data contains monthly beer sales in the United States (in million US $) between 1992 and 2018. We will start by loading the data into our project. We can do this by running the following code, which will download the data from the relataly Github repository, so there is no need to download the data manually.

# Setting up packages for data manipulation and machine learning
import pandas as pd
import numpy as np
import math
import matplotlib.pyplot as plt
import matplotlib as mpl 
import pmdarima as pm
from statsmodels.tsa.arima.model import ARIMA
from import plot_acf, plot_pacf
import statsmodels.api as sm

# Link to the dataset: 

path = ""
df = pd.read_csv(path)

The sales figures in this dataset are always taken at the first of each month.

2 Visualize the Time-Series and check it for Stationarity

Before we start modeling, we visualize our time series and test it for stationarity. This is an important step, because understanding the time-series will help us with the choice of the parameters for our ARIMA model.

To test for stationarity we use the ADFuller test. Usually, this test is performed again at a later point in the project. Therefore, we encapsulate the code directly in a function.

def check_stationarity(dates, sales, title_string):
    # Visualize the data
    fig, ax = plt.subplots(figsize=(16, 8))
    plt.title(title_string, fontsize=14)
    plt.plot(dates, sales, label='monthly sales', color="black", linewidth=1.0)
    ax.fill_between(dates, 1400, sales, color='#b9e1fa')
    ma = sales.rolling(window=12).mean()
    plt.plot(dates, ma, label='12-month ma', color="red", linewidth=2.0)
    ma = sales.rolling(window=25).mean()
    plt.plot(dates, ma, label='25-month ma', color="orange", linewidth=2.0)
    sales = sales.dropna()
    # Perform an Ad Fuller Test
    # the default alpha = .05 stands for a 95% confidence interval
    adf_test = pm.arima.ADFTest(alpha = 0.05) 
df_sales = df['beer']
df_dates = pd.to_datetime(df['date']) 
title = "Beer sales in the US between 1992 and 2018 in million US$/month"
check_stationarity(df_dates, df_sales, title)

We can see that our time-series is steadily climbing and shows clear seasonality on a yearly bases. This makes sense, since beer consumption has been growing in recent decades. Also people drink more beer in summer compared to winter, which is the reason for the seasonality. Not surprisingly, the data do not appear to be stationary.

3 Exemplary Differencing and Autocorrelation

We can easily see from the chart in the previous section, that our time-series is non-stationary, because is shows a clear upward trend. We also know that is has seasonality, so we will need to define additional parameters and construct a SARIMA model.

Before we will use auto-correlation determine the optimal parameters, we will try manual differencing to make the time series stationary. There is no guarantee differencing this works, and it is important to keep in mind that differencing can sometimes also worsen prediction performance. So be careful, not to overdifference. We could also trust that the auto-ARIMA model picks the best parameters for us, but I like to validate things, and ensure that the parameters we chose make sense.

The ideal differencing parameter is the least number of differencing steps to achieve a time-series that is stationary. To check whether differencing was successful, we will monitor the results at autocorrelation plots.

We print the autocorrelation for the original time series, and after the first and second order differencing.

# 3.1 Non-seasonal part
def auto_correlation(df, prefix, lags):
    plt.rcParams.update({'figure.figsize':(7,7), 'figure.dpi':120})
    fig, axes = plt.subplots(3,2, sharex=False)

    # First Difference
    axes[0, 0].plot(df)
    axes[0, 0].set_title('Original' + prefix)
    plot_acf(df_sales, lags=lags, ax=axes[0, 1])

    # First Difference
    df_first_diff = df.diff().dropna()
    axes[1, 0].plot(df_first_diff)
    axes[1, 0].set_title('First Order Difference' + prefix)
    plot_acf(df_first_diff, lags=lags - 1, ax=axes[1, 1])

    # Second Difference
    df_second_diff = df.diff().diff().dropna()
    axes[2, 0].plot(df_second_diff)
    axes[2, 0].set_title('Second Order Difference' + prefix)
    plot_acf(df_second_diff, lags=lags - 2, ax=axes[2, 1])
auto_correlation(df_sales, '', 50)

(0.019143247561160443, False)

When looking at the charts above, we can see that the time-series becomes more stationary after 1 order differencing. However, we can see that the lag goes into the negative very quickly, which indicates that the data may already be overdifference.

Next, we perform the same procedure for the seasonal part of our time series.

# 3.2 Seasonal part

# Reduce the timeframe to a single seasonal period
df_sales_s = df_sales[0:12]

# Autocorrelation for the seasonal part
auto_correlation(df_sales_s, '', 11)
# Check if the First difference is stationary
df_dates = pd.date_range(df_sales_s_first_diff.iloc[1], periods=11, freq='MS') 

title = "First Difference (Seasonal)"
check_stationarity(df_dates, df_sales_s_first_diff, title)
(0.99, True)

The seasonal part of the time-series is stationary after first order differencing. With regard to the autocorrelation plot, we can see that the values go into the negative, but remain in an acceptable boundaries. Second order differencing does not seem to improve these values. This suggests, that first order differencing is a good choice for the D parameter.

4 Use Auto-ARIMA to find an Optimal Model

Next, we auto-fit an ARIMA to our time. In preparation for this, we will split our dataset into train and test. This ensures that we can later objectively measure the performance of our model against a fresh set of data that the model has never seen so far. Model validation could also be done by using cross-validation, but explaining this would have made this post even longer. So I decided to keep it simple.

Once we have created the train and test data sets, we can configure the parameters for the auto_arima stepwise optimization. We can set max_d = 1, so the model will test no differencing and first order differencing on the whole time-series. We set max_p and max_q to 3, so the model will test every combination, including 0.

To deal with the seasonality in our time-series, we set the “seasonal” parameter to True and the “m” parameter to 12 datapoints. This turns our model into a SARIMA model and provides us with additional D, P, and Q parameters. We define a max value for Q and P of 3. I set D to 1, because previously we have already seen that further differencing does not improve the stationarity.

After we have configured the parameters, we fit the model, which will try to find the optimal parameters and choose the model that has the least AIC.

# split into train and test
pred_periods = 30
split_number = df_sales.count() - pred_periods # corresponds to a prediction horizion  of 2,5 years
df_train = df_sales[:split_number]
df_test = df_sales[split_number:]

df_train_dates = df_dates[:split_number]
df_test_dates = df_dates[split_number:] 

# auto_arima
model_fit = pm.auto_arima(df_train, test='adf', 
                         max_p=3, max_d=1, max_q=3, 
                         seasonal=True, m=12,
                         max_P=3, max_D=2, max_Q=3,

# summarize the model characteristics

Auto-ARIMA has determined that the best model is (3,0,0)(3,1,1). These results match well with the results from section 3, in which we manually performed differencing steps.

5 Simulate the Time-Series with in-sample Forecasting

Now that we have a trained model, we can use it to simulate the whole time series. We will do this by calling the predict in sample function. The prediction will match the same time period as the original time-series with which the model was trained. Because, the model just predicts one step ahead, the predictions results will naturally be close to the true values.

# Generate in-sample Predictions
# The parameter dynamic=False means that the model makes predictions upon the lagged values.
# This means that the model is trained until a point in the time-series and then tries to predict the next value.
pred = model_fit.predict_in_sample(dynamic=False) # works only with auto-arima

# Calculate the percentage difference
diff_percent = abs((df_train - pred) / df_train)* 100

# Print the predicted time-series
fig, ax1 = plt.subplots(figsize=(16, 8))
plt.title("In Sample Sales Prediction", fontsize=14)
plt.plot(df_train_dates, df_train, label='train', color="black", linewidth=1.0)
plt.plot(df_train_dates, pred, label='prediction', color="green", linewidth=1.0)
ax1.fill_between(df_train_dates, 0, df_train, color='#b9e1fa')
ax1.set_ylim([1000, 6500])
ax1.set_xlim([df_train_dates.min(), df_train_dates.max()])

# Print percentage prediction errors on a separate axis (ax2)
ax2 = ax1.twinx() 
ax2.set_ylabel('Prediction Errors in %', color='tab:purple', fontsize=14)  
ax2.set_yticks(np.arange(0, 50, 5.0))
ax2.set_ylim([0, 50]), diff_percent, width=20, color='purple', label='absolute errors')
ax2.set_xlim([df_train_dates[20], df_train_dates.max()])

Next we take a look at the prediction errors.

6 Make predictions and visualize them

Now that we have trained an optimal model, we are ready to make predictions about the future. To do this, we specify the number of periods that our model should predict. In addition, we generate an index from the number of predictions, which is adjacent to the original time series and continues it (prediction_index).

# Generate prediction for n periods, 
# Predictions start from the last date of the training data
pred = model_fit.predict(n_periods=pred_periods, dynamic=False)

# Print the predicted time-series
fig, ax = plt.subplots(figsize=(16, 8))
plt.title("Test/Pred Comparison", fontsize=14)
plt.plot(df_train_dates, df_train, label='train', color="black", linewidth=1.0)
plt.plot(df_test_dates, df_test, label='test', color="blue", linewidth=1.0)
plt.plot(df_test_dates, pred, label='forecast', color="green", linewidth=1.0)
ax.fill_between(df_train_dates, 1400, df_train, color='#b9e1fa')
ax.fill_between(df_test_dates, 1400, pred, color='lightgreen')

The forecasts of our model plausibly continue the seasonal pattern of our time series. On the one hand, this indicates that US beer sales will continue to rise and, on the other hand, that our model works just fine 🙂

7 Measure Prediction Errors

The chart with the model simulation from the previous section shows that there are few outliers among the prediction errors. Therefore, we focus our analysis on the percentage errors by calculating the mean absolute error (MAPE) and the mean absolute percentage error (MDAPE).

# Mean Absolute Percentage Error (MAPE)
MAPE = np.mean((np.abs(np.subtract(df_test, pred)/ df_test))) * 100
print(f'Mean Absolute Percentage Error (MAPE): {np.round(MAPE, 2)} %')

# Median Absolute Percentage Error (MDAPE)
MDAPE = np.median((np.abs(np.subtract(df_test, pred)/ df_test)) ) * 100
print(f'Median Absolute Percentage Error (MDAPE): {np.round(MDAPE, 2)} %')
Mean Absolute Percentage Error (MAPE): 3.94 % 
Median Absolute Percentage Error (MDAPE): 3.49 %

The percent errors show that our ARIMA model has decent predictive performance.


Congratulations, in this post you learned how to analyze a seasonal time series using (S)ARIMA to create a forecast for beer sales. You created an in-sample forecast and used Auto-ARIMA to find the optimal parameters for your forecast model. I hope that the knowledge provided in this post will be useful to you when solving your own use cases.

If you liked this post, smash the like button! Also, if you have any questions or suggestions, please let me know in the comments.

Want to learn more data science? Check out some of my other recent posts:


  • Hi, my name is Florian! I am a Zurich-based Data Scientist with a passion for Artificial Intelligence and Machine Learning. After completing my PhD in Business Informatics at the University of Bremen, I started working as a Machine Learning Consultant for the swiss consulting firm ipt. When I'm not working on use cases for our clients, I work on own analytics projects and report on them in this blog.

Follow Florian Müller:

Data Scientist & Machine Learning Consultant

Hi, my name is Florian! I am a Zurich-based Data Scientist with a passion for Artificial Intelligence and Machine Learning. After completing my PhD in Business Informatics at the University of Bremen, I started working as a Machine Learning Consultant for the swiss consulting firm ipt. When I'm not working on use cases for our clients, I work on own analytics projects and report on them in this blog.

Leave a Reply