Evaluate Time Series Forecasting Models with Python

Time series forecasting models

Evaluating the performance of forecasting models is important and a crucial step in their development. This is especially the case for time series forecasting models. Compared to classification models, time series predictions cannot easily be divided into right and wrong. Instead, the deviation between predictions and actual values is measured in numeric values for each prediction. This makes it possible that prediction errors are heterogeneously distributed over the course of a time series and are therefore difficult to measure. This blog post demonstrates the use of different error metrics that are commonly used to evaluate time series forecasting models with Python.

To realistically determine the range of possible prediction errors, data scientists and researcher should be familiar with different error metrics. One of my earlier blog posts: Time Series Forecasting – Error Metrics Cheat Sheet presents a cheat sheet with the most frequently used error metrics in time series analysis along with the formulas and specificities of these metrics. If you are not yet familiar with this topic, I recommend you to start with this previous article.

Sample time series project

The sample evaluation of a time series forecasting model will use data from two added sine curves. The data will be used to train a neural network that will predict the further course of the sine curve time series. Using this example, the post will cover the steps to calculate different error metrics and use them to evaluate the performance of the forecasting model.

This post covers the following steps:

  1. Creating a sample time series
  2. Preparing the data
  3. Training a time series forecasting model
  4. Making predictions
  5. Calculating performance metrics
  6. Evaluating model performance
  7. Summary

Python Environment

This tutorial assumes that you have setup your python environment. I recommend using the Anaconda environment. If you have not yet set the environment up, you can follow this tutorial. It is also assumed that you have the following packages installed: keras (2.0 or higher) with Tensorflow backend, numpy, pandas, matplot, sklearn. The packages can be installed using the console command:

pip install <packagename>

1) Creating a sample time series

We start by creating some artificial sample data based on three multiplied sine curves.

# Setting up packages for data manipulation and machine learning
import math
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl 
from keras.models import Sequential
from sklearn.preprocessing import MinMaxScaler
from keras.layers import LSTM, Dense, TimeDistributed, Dropout, Activation

# Creating the sample sinus curve dataset
steps = 1000; gradient = 0.002
list_a = []
for i in range(0, steps, 1):
    y = 100 * round(math.sin(math.pi * i * 0.02 + 0.01), 4) * round(math.sin(math.pi * i * 0.005 + 0.01), 4) * round(math.sin(math.pi * i * 0.005 + 0.01), 4)
df = pd.DataFrame({"valid": list_a}, columns=["valid"])

# Visualizing the data
fig, ax1 = plt.subplots(figsize=(16, 4))
plt.title("Sine Curve Data", fontsize=14)
plt.plot(df[["valid"]], color="black", linewidth=2.0)
Sine Curve Data

2) Preparing the data

The following code will prepare the data to train a recurrent neural network model.

# Settings
epochs = 4; batch_size = 1; sequencelength = 15; n_features = 1

# Get the number of rows to train the model on 80% of the data
npdataset = df.values
training_data_length = math.ceil(len(npdataset) * 0.6)

# Transform features by scaling each feature to a range between 0 and 1
mmscaler = MinMaxScaler(feature_range=(0, 1))
scaled_data = mmscaler.fit_transform(npdataset)

# Create a scaled training data set
train_data = scaled_data[0:training_data_length, :]

# Split the data into x_train and y_train data sets
x_train = []; y_train = []
trainingdatasize = len(train_data)
for i in range(sequencelength, trainingdatasize-1):
    x_train.append(train_data[i-sequencelength : i, 0]) 
    y_train.append(train_data[i, 0])  # contains all other values

# Convert the x_train and y_train to numpy arrays
x_train = np.array(x_train); y_train = np.array(y_train)

# Reshape the data
x_train = np.reshape(x_train, (x_train.shape[0], x_train.shape[1], 1))
print("x_tain.shape: " + str(x_train.shape) + " -- y_tain.shape: " + str(y_train.shape))
Out: x_tain.shape: (584, 15, 1) -- y_tain.shape: (584,)

3) Training a forecasting model

Now, we can train a forecasting model. For this, we will use a recurrent neural network. Understanding neural networks in all depth is not a prerequisite for this tutorial. If you want to learn more about the architecture and functioning of Neural Networks, I can recommend this YouTube video.

The following code will create the model architecture. The second code block will then define the input shape of the neural net:

# Configure and compile the neural network model
# The number of input neurons is defined by the sequence length multiplied by the number of features
lstm_neuron_number = sequencelength * n_features

# Create the model
model = Sequential()
    LSTM(lstm_neuron_number, return_sequences=False, input_shape=(x_train.shape[1], 1))
model.compile(optimizer="adam", loss="mean_squared_error")
# Settings
batch_size = 5

# Train the model
history = model.fit(x_train, y_train, batch_size=batch_size, epochs=epochs)
Epoch 1/4 
584/584 [==============================] - 1s 2ms/step - loss: 0.1047 Epoch 2/4 
584/584 [==============================] - 1s 1ms/step - loss: 0.0153 Epoch 3/4 
584/584 [==============================] - 1s 1ms/step - loss: 0.0102 Epoch 4/4 
584/584 [==============================] - 1s 1ms/step - loss: 0.0064

4) Making test predictions

# Create the data sets x_test and y_test
test_data = scaled_data[training_data_length - sequencelength :, :]
test_data_len = test_data.shape[0]

x_test, y_test = [], []
for i in range(sequencelength, test_data_len):
    x_test.append(test_data[i-sequencelength:i, 0])
    y_test.append(test_data[i:, 0])

# Convert the x_train and y_train to numpy arrays
x_test, y_test = np.array(x_test), np.array(y_test)
print(x_test.shape, y_test.shape)

# Reshape x_test, so that we get an array with multiple test datasets
x_test = np.reshape(x_test, (x_test.shape[0], x_test.shape[1], 1))
(400, 15) (400,)
# Get the predicted values
predictions = model1.predict(x_test)
predictions = mmscaler.inverse_transform(predictions)

Next we plot the predictions.

# Visualize the data
train = df[:training_data_length]; 
valid = df[training_data_length:]
valid.insert(1, "y_pred", y_pred, True)

fig, ax1 = plt.subplots(figsize=(16, 8), sharex=True)
xt = valid.index; yt = train[["valid"]]
xv = valid.index; yv = valid[["valid", "y_pred"]]
ax1.tick_params(axis="x", rotation=0, labelsize=10, length=0)
plt.title("y_pred vs y_test Truth", fontsize=18)

plt.plot(yv["y_pred"], color="red")
plt.plot(yv["valid"], color="black", linewidth=2)

plt.legend(["y_pred", "y_test"], loc="upper left")

# Fill between plotlines
import matplotlib as mpl
mpl.rc('hatch', color='k', linewidth=2)
ax1.fill_between(xv, yv["valid"], yv["y_pred"],  facecolor = 'white', hatch="||",  edgecolor="blue", alpha=.9) 

5) Calculating error metrics

Now comes the interesting part. With the following code you’ll calculate five common error metrics:

y_pred = yv["y_pred"]
y_test = yv["valid"]
print(y_test.shape, y_pred.shape)

# # Mean Absolute Error (MAE)
MAE = np.mean(abs(y_pred - y_test))
print('Mean Absolute Error (MAE): ' + str(np.round(MAE, 2)))

# Median Absolute Error (MedAE)
MEDAE = np.median(abs(y_pred - y_test))
print('Median Absolute Error (MedAE): ' + str(np.round(MEDAE, 2)))

# Mean Squared Error (MSE)
MSE = np.square(np.subtract(y_pred, y_test)).mean()
print('Mean Squared Error (MSE): ' + str(np.round(MSE, 2)))

# Root Mean Squarred Error (RMSE) 
RMSE = np.sqrt(np.mean(np.square(y_pred - y_test)))
print('Root Mean Squared Error (RMSE): ' + str(np.round(RMSE, 2)))

# Mean Absolute Percentage Error (MAPE)
MAPE = np.mean((np.abs(np.subtract(y_test, y_pred)/ y_test))) * 100
print('Mean Absolute Percentage Error (MAPE): ' + str(np.round(MAPE, 2)) + ' %')

# Median Absolute Percentage Error (MDAPE)
MDAPE = np.median((np.abs(np.subtract(y_test, y_pred)/ y_test))) * 100
print('Median Absolute Percentage Error (MDAPE): ' + str(np.round(MDAPE, 2)) + ' %')
Mean Absolute Error (MAE): 6.95
Median Absolute Error (MedAE): 5.05 
Mean Squared Error (MSE): 78.7
Root Mean Squared Error (RMSE): 8.87 
Mean Absolute Percentage Error (MAPE): 10339.13 %
Median Absolute Percentage Error (MDAPE): 26.8 %

6) Evaluating model performance

First let’s take a look at the MAE and the MedAE. The MAE is 6.95 and the MedAE is 5.05. These values are very close to each other, which is an indication that our prediction errors are equally distributed and that there might be few large outliers in the predictions.

To get a better picture of possible outliers, we take a look at the MSE. With a value of 78.7 the MAE is a little bit higher than the square of the MAE. The RMSE is slightly higher than the MAE, which is another indication that the prediction errors lie in a narrow range.

How much do the predictions of our model deviate from the actual values in percentage terms? The MAPE is typically used as a starting point to answer this question. With a value of 10339.13 percent, it is extremely high. So is our model very much mistaken? The answer is no – the MAPE is misleading. The problem is that several actual values are close to zero, e.g., 0.00001. While the predictions of our model are close to the actual values in absolute numbers, the MAPE divides the residual values by the actual values, e.g., 0.000001, and sums them up. Thus the MAPE becomes very large.

The median is an important measure. This becomes evident, if we look at the median of the MDAPE, which is 26.8%. This means 50% of our forecasting errors are higher than 26.8% and 50% are lower than this. Consequently, we can assume that when our model makes a prediction, there is a 50% probability that the deviation is 26.8% from the actual value – that is not as terrible as the MAPE would have us believe. The plotlines of the predictions and actual values reflect these findings.


In this post, you have learned to evaluate time series forecasting models using different error metrics. You have also seen that performance metrics in time series forecasting can be misleading. Therefore, they should be used with caution and preferably in combination. If there is a key take away from this post, then it’s “never trust a single error metric”.

I hope you found this post useful. Please leave a comment if you have any remarks or questions remaining.


  • Hi, my name is Florian! I am a Zurich-based Data Scientist with a passion for Artificial Intelligence and Machine Learning. After completing my PhD in Business Informatics at the University of Bremen, I started working as a Machine Learning Consultant for the swiss consulting firm ipt. When I'm not working on use cases for our clients, I work on own analytics projects and report on them in this blog.

Follow Florian Müller:

Data Scientist & Machine Learning Consultant

Hi, my name is Florian! I am a Zurich-based Data Scientist with a passion for Artificial Intelligence and Machine Learning. After completing my PhD in Business Informatics at the University of Bremen, I started working as a Machine Learning Consultant for the swiss consulting firm ipt. When I'm not working on use cases for our clients, I work on own analytics projects and report on them in this blog.

3 Responses

Leave a Reply