Measuring Regression Model Performance

Evaluate the Performance of Time Series Forecasting Models with Python

The evaluation of the prediction quality is a crucial step in the development of regression models. To evaluate regression models, we measure the deviation between predictions and actual values in numerical terms. Therefore, unlike classification models, regression models allow for different gradations of false or true. However, choosing the right metrics is a particular challenge. For example, prediction errors can be heterogeneously distributed over a time series or influenced by outliers. This blog post demonstrates various error metrics commonly used to evaluate time series prediction models with Python.

To realistically determine the range of possible prediction errors, data scientists and researchers should be familiar with different error metrics. One of my earlier blog posts: Time Series Forecasting – Error Metrics Cheat Sheet, presents a cheat sheet with the most frequently used error metrics in time series analysis along with the formulas and specificities of these metrics. If you are not yet familiar with this topic, I recommend you to start with this previous article.

Measuring Performance the Performance of a Time Series Forecasting Model in Python

The sample evaluation of a time series forecasting model will use data from two added sine curves. We will use the data to train a neural network that will predict the further course of the sine curve time series. The article will cover the steps to calculate error metrics and use them to evaluate the performance of the forecasting model.

Prerequisites

Before we start the coding part, make sure that you have set up your Python 3 environment and required packages. If you don’t have an environment set up yet, you can follow the steps in this tutorial to set up the Anaconda environment.

Also, make sure you install all required packages. In this tutorial, we will be working with the following standard packages: 

In addition, we will be using Keras (2.0 or higher) with Tensorflow backend and the machine learning library scikit-learn.

You can install packages using console commands:

  • pip install <package name>
  • conda install <package name> (if you are using the anaconda packet manager)

Step #1 Generate Sample Time Series Data

We start by creating some artificial sample data based on three multiplied sine curves.

# Setting up packages for data manipulation and machine learning
import math
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl 
from keras.models import Sequential
from sklearn.preprocessing import MinMaxScaler
from keras.layers import LSTM, Dense, TimeDistributed, Dropout, Activation

# Creating the sample sinus curve dataset
steps = 1000; gradient = 0.002
list_a = []
for i in range(0, steps, 1):
    y = 100 * round(math.sin(math.pi * i * 0.02 + 0.01), 4) * round(math.sin(math.pi * i * 0.005 + 0.01), 4) * round(math.sin(math.pi * i * 0.005 + 0.01), 4)
    list_a.append(y)
df = pd.DataFrame({"valid": list_a}, columns=["valid"])

# Visualizing the data
fig, ax1 = plt.subplots(figsize=(16, 4))
ax1.xaxis.set_major_locator(plt.MaxNLocator(30))
plt.title("Sine Curve Data", fontsize=14)
plt.plot(df[["valid"]], color="black", linewidth=2.0)
plt.show()
sine curve time series data
Sine Curve Data

Step #2 Data Preparation

The following code will prepare the data to train a recurrent neural network model.

# Settings
epochs = 4; batch_size = 1; sequencelength = 15; n_features = 1

# Get the number of rows to train the model on 80% of the data
npdataset = df.values
training_data_length = math.ceil(len(npdataset) * 0.6)

# Transform features by scaling each feature to a range between 0 and 1
mmscaler = MinMaxScaler(feature_range=(0, 1))
scaled_data = mmscaler.fit_transform(npdataset)

# Create a scaled training data set
train_data = scaled_data[0:training_data_length, :]

# Split the data into x_train and y_train data sets
x_train = []; y_train = []
trainingdatasize = len(train_data)
for i in range(sequencelength, trainingdatasize-1):
    x_train.append(train_data[i-sequencelength : i, 0]) 
    y_train.append(train_data[i, 0])  # contains all other values

# Convert the x_train and y_train to numpy arrays
x_train = np.array(x_train); y_train = np.array(y_train)

# Reshape the data
x_train = np.reshape(x_train, (x_train.shape[0], x_train.shape[1], 1))
print("x_tain.shape: " + str(x_train.shape) + " -- y_tain.shape: " + str(y_train.shape))
Out: x_tain.shape: (584, 15, 1) -- y_tain.shape: (584,)

Step #3 Train Time Series Neural Network Regression Model

Now, we can train a forecasting model. For this, we will use a recurrent neural network. Understanding neural networks in all depth is not a prerequisite for this tutorial. If you want to learn more about the architecture and functioning of Neural Networks, I can recommend this YouTube video.

The following code will create the model architecture. The second code block will then define the input shape of the neural net:

# Configure and compile the neural network model
# The number of input neurons is defined by the sequence length multiplied by the number of features
lstm_neuron_number = sequencelength * n_features

# Create the model
model = Sequential()
model.add(
    LSTM(lstm_neuron_number, return_sequences=False, input_shape=(x_train.shape[1], 1))
)
model.add(Dense(1))
model.compile(optimizer="adam", loss="mean_squared_error")
# Settings
batch_size = 5

# Train the model
history = model.fit(x_train, y_train, batch_size=batch_size, epochs=epochs)
Epoch 1/4 
584/584 [==============================] - 1s 2ms/step - loss: 0.1047 Epoch 2/4 
584/584 [==============================] - 1s 1ms/step - loss: 0.0153 Epoch 3/4 
584/584 [==============================] - 1s 1ms/step - loss: 0.0102 Epoch 4/4 
584/584 [==============================] - 1s 1ms/step - loss: 0.0064

Step #4 Making test predictions

# Create the data sets x_test and y_test
test_data = scaled_data[training_data_length - sequencelength :, :]
test_data_len = test_data.shape[0]

x_test, y_test = [], []
for i in range(sequencelength, test_data_len):
    x_test.append(test_data[i-sequencelength:i, 0])
    y_test.append(test_data[i:, 0])

# Convert the x_train and y_train to numpy arrays
x_test, y_test = np.array(x_test), np.array(y_test)
print(x_test.shape, y_test.shape)

# Reshape x_test, so that we get an array with multiple test datasets
x_test = np.reshape(x_test, (x_test.shape[0], x_test.shape[1], 1))
(400, 15) (400,)
# Get the predicted values
predictions = model1.predict(x_test)
predictions = mmscaler.inverse_transform(predictions)

Next, we plot the predictions.

# Visualize the data
train = df[:training_data_length]; 
valid = df[training_data_length:]
valid.insert(1, "y_pred", y_pred, True)

fig, ax1 = plt.subplots(figsize=(16, 8), sharex=True)
xt = valid.index; yt = train[["valid"]]
xv = valid.index; yv = valid[["valid", "y_pred"]]
ax1.tick_params(axis="x", rotation=0, labelsize=10, length=0)
plt.title("y_pred vs y_test Truth", fontsize=18)

plt.plot(yv["y_pred"], color="red")
plt.plot(yv["valid"], color="black", linewidth=2)

plt.legend(["y_pred", "y_test"], loc="upper left")

# Fill between plotlines
import matplotlib as mpl
mpl.rc('hatch', color='k', linewidth=2)
ax1.fill_between(xv, yv["valid"], yv["y_pred"],  facecolor = 'white', hatch="||",  edgecolor="blue", alpha=.9) 
plt.show()
line plot on the forecasted data

Step #5 Calculating error metrics

Now comes the exciting part. With the following code, you’ll calculate five standard error metrics:

y_pred = yv["y_pred"]
y_test = yv["valid"]
print(y_test.shape, y_pred.shape)

# # Mean Absolute Error (MAE)
MAE = np.mean(abs(y_pred - y_test))
print('Mean Absolute Error (MAE): ' + str(np.round(MAE, 2)))

# Median Absolute Error (MedAE)
MEDAE = np.median(abs(y_pred - y_test))
print('Median Absolute Error (MedAE): ' + str(np.round(MEDAE, 2)))

# Mean Squared Error (MSE)
MSE = np.square(np.subtract(y_pred, y_test)).mean()
print('Mean Squared Error (MSE): ' + str(np.round(MSE, 2)))

# Root Mean Squarred Error (RMSE) 
RMSE = np.sqrt(np.mean(np.square(y_pred - y_test)))
print('Root Mean Squared Error (RMSE): ' + str(np.round(RMSE, 2)))

# Mean Absolute Percentage Error (MAPE)
MAPE = np.mean((np.abs(np.subtract(y_test, y_pred)/ y_test))) * 100
print('Mean Absolute Percentage Error (MAPE): ' + str(np.round(MAPE, 2)) + ' %')

# Median Absolute Percentage Error (MDAPE)
MDAPE = np.median((np.abs(np.subtract(y_test, y_pred)/ y_test))) * 100
print('Median Absolute Percentage Error (MDAPE): ' + str(np.round(MDAPE, 2)) + ' %')
Mean Absolute Error (MAE): 6.95
Median Absolute Error (MedAE): 5.05 
Mean Squared Error (MSE): 78.7
Root Mean Squared Error (RMSE): 8.87 
Mean Absolute Percentage Error (MAPE): 10339.13 %
Median Absolute Percentage Error (MDAPE): 26.8 %

Step #6 Evaluating model performance

Let’s take a look at the MAE and the MedAE. The MAE is 6.95, and the MedAE is 5.05. These values are very close to each other, which is an indication that our prediction errors are equally distributed and that there might be few significant outliers in the predictions.

To get a better picture of possible outliers, we take a look at the MSE. With a value of 78.7, the MAE is a little bit higher than the square of the MAE. The RMSE is slightly higher than the MAE, which is another indication that the prediction errors lie in a narrow range.

How much do the predictions of our model deviate from the actual values in percentage terms? The MAPE is typically used as a starting point to answer this question. With 10339.13 percent, it is incredibly high. So is our model very much mistaken? The answer is no – the MAPE is misleading. The problem is that several actual values are close to zero, e.g., 0.00001. While the predictions of our model are close to the real values in absolute numbers, the MAPE divides the residual values by the actual values, e.g., 0.000001, and sums them up. Thus the MAPE becomes very large.

The Median plays an important role in measuring the quality of a forecast. This becomes evident if we look at the median of the MDAPE, which is 26.8%. So, 50% of our forecasting errors are higher than 26.8%, and 50% are lower. Consequently, we can assume that when our model makes a prediction, there is a 50% probability that the deviation is 26.8% from the actual value – that is not as terrible as the MAPE would have us believe. The plotlines of the predictions and actual values reflect these findings.

Summary

In this post, you have learned to evaluate time series forecasting models using different error metrics. You have also seen that performance metrics in time series forecasting can be misleading. Therefore, they should be used with caution and preferably in combination. If there is a crucial takeaway from this post, then it’s “never trust a single error metric.”

I hope this article was helpful. If you have any remarks or questions remaining, write them in the comments. I try to respond within two days.

Author

  • Hi, I am Florian, a Zurich-based consultant for AI and Data. Since the completion of my Ph.D. in 2017, I have been working on the design and implementation of ML use cases in the Swiss financial sector. I started this blog in 2020 with the goal in mind to share my experiences and create a place where you can find key concepts of machine learning and materials that will allow you to kick-start your own Python projects.

Leave a Reply