Synthetic Data Archives - relataly.com

Cluster Analysis with k-Means in Python

Florian Follonier — Sun, 27 Jun 2021 18:21:01 +0000

Embark on a journey into the world of unsupervised machine learning with this beginner-friendly Python tutorial focusing on K-Means clustering, a powerful technique used to group similar data points into distinct clusters. This invaluable tool helps us make sense of complex datasets, finding hidden patterns and associations without the need for a predetermined target variable. This comes in handy, especially when dealing with data whose similarities and differences aren’t immediately apparent.

Divided into two insightful sections, this blog post first delves into the theoretical foundation of the K-Means clustering algorithm. We’ll explore its real-world applications, its strengths, and its weaknesses, providing a comprehensive overview of what makes K-Means an essential tool in any data scientist’s toolkit.

In the second part of the blog, we switch gears and dive into a hands-on Python tutorial. Here, we’ll walk you through a practical example of K-Means clustering, using it to uncover three distinct spherical clusters within a synthetic dataset. To round up, we’ll put our model to the test, using it to predict clusters within a test dataset, and then we’ll visualize the results for easy understanding.

Whether you’re a seasoned data scientist or a beginner looking to dip your toes into the field, this blog post offers a simple and accessible introduction to cluster engineering with K-Means in Python. Follow along, and uncover the hidden potential of your data today!

Patterns are everywhere, and machine learning can help to expose and understand them better. Image created with Midjourney.

Findings Clusters in Multivariate Data with k-Means

K-Means clustering is a prominent unsupervised machine learning algorithm utilized to partition datasets into a predefined number (‘k’) of distinct clusters, each brimming with similar data points. To begin, the algorithm identifies ‘k’ random data points which serve as the initial centroids – the central points representing each cluster. It then proceeds in an iterative fashion, continuously reassigning each data point to the centroid nearest to it, followed by updating each centroid to the average of the data points assigned to its cluster. This process repeats until the centroids no longer change positions or until a set maximum number of iterations is reached.

The underlying objective of the K-Means algorithm is to minimize the within-cluster sum of squares (WCSS), a metric indicating the cumulative distance between each data point within a cluster and its corresponding centroid. By reducing the WCSS, K-Means aims to form the most compact and clearly separable clusters.

The K-Means algorithm has garnered substantial popularity in the field of data science for its speed and simplicity in implementation. However, it isn’t without its limitations. It requires the number of clusters to be specified beforehand and operates under the assumption that all clusters are spherical and uniformly sized. In the following sections, we’ll delve into the intricacies of this powerful yet straightforward algorithm and discuss its use in real-world applications.

Three clusters in a dataset that were separated by the k-Means algorithms

Understanding the Mechanics of K-Means Clustering Algorithm

At its core, K-Means clustering is a dynamic algorithm designed to simplify complex data patterns. This machine learning model is a champion of unsupervised learning, adept at grouping similar data points into a predetermined number (‘k’) of unique clusters.

Starting the process, the K-Means algorithm selects ‘k’ random data points, which act as the initial centroids or the central figures of the clusters. The algorithm then goes through a series of iterative steps, continually allocating each data point to its closest centroid, and recalibrating the centroid to be the mean of all data points that are assigned to its cluster. This cycle repeats until we achieve stable centroids, i.e., they stop moving, or until a pre-set maximum number of iterations is completed.

The fundamental goal of K-Means lies in minimizing the within-cluster sum of squares (WCSS). This term refers to the sum of the distances of each data point in a cluster to its centroid. By lowering WCSS, K-Means strives to construct compact clusters that are distinctly separate from each other.

In the illustration below, we assume a cluster with three cluster centers. K-Means carries out several steps to partition the data.

Initially, the algorithm k chooses random starting positions for the centroids. Alternatively, we can also position the centroids manually.
Then the algorithm calculates the distance between the data points and the three centroids. The data points are assigned to the closest centroid/cluster where the cluster variance increases the least.
Next, the algorithm calculates the Euclidean distance between the centroids and their assigned data points. The result is linear decision boundaries that separate the clusters but are not yet optimal.
From then on, the algorithm optimizes the centroids’ positions to lower the resulting clusters’ variance. Then, the previous steps are repeated: averaging, assigning the data points to clusters, and shifting the centroids.

The process ends when the positions of the centroids do not change anymore.

k-Means iteratively optimizes the decision boundaries between clusters

How many Clusters?

A particular challenge is that k-means requires estimating the number of clusters k. When tackling new clustering problems, we usually don’t know the optimal number of clusters. Unless the data is not too complex, we can often estimate the number of centers by looking at one or more scatter plots. However, this approach only works when the data has a few dimensions. For complex data with many dimensions, it is common to experiment with varying numbers of k to find an appropriate size for the problem. We can automate this process using hyperparameter tuning techniques such as grid search or random search. The idea is to try out different cluster sizes and identify the size that best differentiates between clusters.

Pros and Cons of k-Means Clustering

Although K-Means is esteemed for its swift operation and uncomplicated implementation, it comes with a set of constraints. We need to know the strengths and weaknesses of clustering techniques such as k-Means. In general, clustering can reveal structures and relationships in data supervised machine learning methods like classification likely would not uncover. In particular, when we suspect different subgroups in the data that differ in their behavior, clustering can help discover what makes these groups unique.

K-Means, in particular, possesses a unique ability to detect and segregate spherical-shaped clusters efficiently. However, its performance may falter when faced with clusters embodying more intricate structures like half-moons or circles, often struggling to differentiate them accurately.

Another potential limitation of K-Means lies in its prerequisite of specifying the number of clusters in advance. This requirement can prove challenging when the true number of clusters within a dataset is unknown or ambiguous. In such scenarios, alternative clustering techniques, such as affinity propagation or hi erarchical clustering, might be more suitable, as they possess the ability to automatically determine the optimal number of clusters.

Applications of Clustering

The k-means algorithm is used in a variety of applications, including the following:

A typical use case for clustering is in marketing and market segmentation. Here clustering is used to identify meaningful segments of similar customers. The similarity can be based on demographic data (age, gender, etc.) or customer behavior (for example, the time and amount of a purchase).
Medical research uses clustering to divide patient groups into different subgroups, for example, to assess stroke risk. After clustering, the next step is to develop separate prediction models for the subgroups to estimate the risk more accurately.
An application in the financial sector is outlier detection in fraud detection. Banks and credit card companies use clustering to detect unusual transactions and flag them for verification.
Spam filtering: The input data are attributes of emails (text length, words contained, etc.) and help separate spam from non-spam emails.

Implementing a K-Means Clustering Model in Python

In the following, we run a cluster analysis on synthetic data using Python and scikit-learn. We aim to train a K-Means cluster model in Python that distinguishes three clusters in the data. Given that the data is synthetic, we’re already privy to which cluster each data point pertains to. This foreknowledge enables us to evaluate the performance of our model post-training, gauging how effectively it can differentiate between the three predefined clusters.

The code is available on the GitHub repository.

View on GitHub Relataly GitHub Repo

Prerequisites

Before beginning the coding part, make sure that you have set up your Python 3 environment and required packages. If you don’t have an environment, consider the Anaconda Python environment. To set it up, you can follow the steps in this tutorial.

Also, make sure you install all required packages. In this tutorial, we will be working with the following standard packages:

This article uses the k-means clustering algorithm from the Python library Scikit-learn. We also use Seaborn for visualization.

You can install these libraries using console commands:

pip install
conda install (if you are using the anaconda packet manager)

Step #1: Generate Synthetic Data

We start with the generation of synthetic data. For this purpose, we use the make_blobs function from the scikit-learn library. The function generates random clusters in two dimensions spherically arranged around a center. In addition, the data contains the respective cluster to which the data points belong. We use a scatterplot to visualize the data.

import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt
from sklearn.model_selection import train_test_split as train_test_split
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, confusion_matrix

# Generate synthetic data
features, labels = make_blobs(
    n_samples=400,
    centers=3,
    cluster_std=2.75,
    random_state=42
)

# Visualize the data in scatterplots
def scatter_plots(df, palette):
    fig, ax = plt.subplots(nrows=1, ncols=2, sharex=True, figsize=(20, 8))
    fig.subplots_adjust(hspace=0.5, wspace=0.2)
    

    ax1 = plt.subplot(1,2,1)
    sns.scatterplot(ax = ax1, data=df, x='x', y='y', hue= 'labels', palette=palette)
    plt.title('Clusters')
    
    ax2 = plt.subplot(1,2,2)
    sns.scatterplot(ax = ax2, data=df, x='x', y='y')
    plt.title('How the model sees the data during training')

palette = {1:"tab:cyan",0:"tab:orange", 2:"tab:purple"}
df = pd.DataFrame(features, columns=['x', 'y'])
df['labels'] = labels
scatter_plots(df, palette)

Step #2: Preprocessing

There are some general things to keep in mind when preparing data for use with K-Means clustering:

Missing data and outliers: if we have missing entries in our data, we need to handle these, for example, by removing the records or filling the missing values with a mean or median. In addition, K-means is sensitive to outliers. Therefore, make sure that you eliminate outliers from the training data.
Normalization: K-Means can only deal with integer values. So, either we map the categorical variables to integer values or use one-hot-encoding to create separate binary variables.
Dimensionality reduction: In general, having too many variables in a dataset can negatively affect the performance of clustering algorithms. A good practice is to keep the number of variables below 30, for example, by using techniques for dimensionality reduction such as Principal-Component-Analysis.
Scaling: Important to note is also that K-means require scaling the data.

Our synthetic dataset is free of outliers or missing values. Therefore, we only need to scale the data. In addition, we separate the class labels of the clusters from the training set and split the data into a train- and a test dataset.

# Scale the data
scaler = StandardScaler()
scaled_features = scaler.fit_transform(features)

X = scaled_features #Training data
y = labels #Prediction label

# Split the data into x_train and y_train data sets
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, random_state=0)

Step #3: Train a k-Means Clustering Model

Once we have prepared the data, we can begin the cluster analysis by training a K-means model. Our model uses the k-means algorithm from Python scikit-learn library. We have various options to configure the clustering process:

n_clusters: The number of clusters we expect in the data.
n_init: The number of iterations k-means will run with different initial centroids. The algorithm returns the best model.
max_iter: The max number of iterations for a single run

We expect three clusters and configure the algorithm to run ten iterations.

# Create a k-means model with n clusters
kmeans = KMeans(
    init="random",
    n_clusters=3,
    n_init=10,
    max_iter=300,
    random_state=42
)

# fit the model
kmeans.fit(X_train)
print(f'Converged after {kmeans.n_iter_} iterations')

Converged after 4 iterations

Our model has already converged after four iterations. Next, we will look at the results.

Step #4: Make and Visualize Predictions

Next, we’ll be delving into the practical aspect of our Python tutorial by analyzing the performance of our trained K-Means model using synthetic data.

In the following Python code, we:

Extract the cluster centers from the trained K-Means model and unscale them.
Add the predicted and actual labels to our unscaled training data.
Define a function, scatter_plots(), that creates two scatter plots: one for the predicted labels and another for the actual labels.
Call this function, passing the training data, cluster centers, and a color palette as arguments.

Please note:

We use a dictionary as a color palette to differentiate between clusters.
The colors between the two plots may not match due to K-Means assigning numbers to clusters without prior knowledge of initial labels.

# Get the cluster centers from the trained K-means model
cluster_center = scaler.inverse_transform(kmeans.cluster_centers_)
df_cluster_centers = pd.DataFrame(cluster_center, columns=['x', 'y'])

# Unscale the predictions
X_train_unscaled = scaler.inverse_transform(X_train)
df_train = pd.DataFrame(X_train_unscaled, columns=['x', 'y'])
df_train['pred_label'] = kmeans.labels_
df_train['true_label'] = y_train

def scatter_plots(df, cc, palette):
    fig, ax = plt.subplots(nrows=1, ncols=2, sharex=True, figsize=(20, 8))
    fig.subplots_adjust(hspace=0.5, wspace=0.2)
    
    # Print the predictions    
    ax2 = plt.subplot(1,2,1)
    sns.scatterplot(ax = ax2, data=df, x='x', y='y', hue='pred_label', palette=palette)
    sns.scatterplot(ax = ax2, data=cc, x='x', y='y', color='r', marker="X")
    plt.title('Predicted Labels')

    # Print the actual values
    ax1 = plt.subplot(1,2,2)
    sns.scatterplot(ax = ax1, data=df, x='x', y='y', hue= 'true_label', palette=palette)
    sns.scatterplot(ax = ax1, data=cc, x='x', y='y', color='r', marker="X")
    plt.title('Actual Labels')


# The colors between the two plots may not match.
# This is because K-means does not know the initial labels and assigns numbers to clusters 
palette = {1:"tab:cyan",0:"tab:orange", 2:"tab:purple"}
scatter_plots(df_train, df_cluster_centers, palette)

The scatterplot above shows that K-Means found the three clusters. As a side note, the colors between the two plots do not match because K-means does not know the initial labels and assigns numbers to clusters.

Step #5: Measuring Model Performance

Next, we measure the performance of our clustering model. K-means is an unsupervised machine learning algorithm, which means that it is used to cluster data points into groups based on their similarity, without the use of labeled training data. However, we can compare the cluster assignments to the labels attached to the data and see if the model can predict them correctly. In this way, we can use traditional classification metrics such as accuracy and f1_score to measure the performance of our clustering model.

First, we unify the cluster labels as A, B, and C.

# The predictive model has the labels 0 and 1 reversed. We will correct that first. 
#df_train['pred_test'] = df_train['pred_labels'].map({2:2, 1:3, 0:1})
df_eval = df_train.copy()
df_eval['true_label'] = df_eval['true_label'].map({0:'A', 1:'B', 2:'C'})
df_eval['pred_label'] = df_eval['pred_label'].map({0:'B', 1:'A', 2:'C'})
df_eval.head(10)

	x			y			pred_label	true_label
0	-9.007547	-10.302910	C			C
1	1.009238	7.009681	A			B
2	-6.565501	-6.466780	C			C
3	2.389772	7.727235	B			B
4	-5.422666	-2.915796	C			C
5	-12.024305	-7.846772	C			C
6	-4.006250	9.319323	A			A
7	-6.297788	6.435267	A			A
8	2.169238	3.325947	B			B
9	-5.140506	-4.205585	C			C

It is a common practice to create scatterplots on the predictions to visually verify the quality of the clusters and their decision boundaries. The following scatter plot shows the correctly assigned values and where our model was wrong.

# Scatterplot on correctly and falsely classified values
df_eval.loc[df_eval['pred_label'] == df_eval['true_label'], 'Correctly classified?'] = 'True' 
df_eval.loc[df_eval['pred_label'] != df_eval['true_label'], 'Correctly classified?'] = 'False' 

plt.rcParams["figure.figsize"] = (10,8)
sns.scatterplot(data=df_eval, x='x', y='y', color='r', hue='Correctly classified?')

The K-Means model correctly assigned most data points (blue) to their actual cluster. The few misclassified points are located at a decision boundary between two clusters (marked in orange).

# Create a confusion matrix
def evaluate_results(model, y_true, y_pred, class_names):
    tick_marks = [0.5, 1.5, 2.5]
    
    # Print the Confusion Matrix
    fig, ax = plt.subplots(figsize=(10, 6))
    results_log = classification_report(y_true, y_pred, target_names=class_names, output_dict=True)
    results_df_log = pd.DataFrame(results_log).transpose()
    matrix = confusion_matrix(y_true,  y_pred)
    model_score = score(y_pred, y_true, average='macro')
    
    sns.heatmap(pd.DataFrame(matrix), annot=True, fmt="d", linewidths=.5, cmap="YlGnBu")
    plt.xlabel('Predicted label'); plt.ylabel('Actual label')
    plt.title('Confusion Matrix on the Test Data')
    plt.yticks(tick_marks, class_names); plt.xticks(tick_marks, class_names)
    
    print(results_df_log)


y_true = df_eval['true_label']
y_pred = df_eval['pred_label']
class_names = ['A', 'B', 'C']
evaluate_results(kmeans, y_true, y_pred, class_names)

As we can see, the model has done a good job of grouping the labels into the attached classes.

Summary

algorithm, capable of parsing intricate datasets into discrete, non-overlapping clusters. With this guide, you should have a clear understanding of how this algorithm operates, as well as its unique strengths and potential limitations. Remember, k-Means excels when dealing with spherical clusters but may struggle with clusters of more complex shapes.

We walked through how to implement the k-Means algorithm in Python, using it to identify spherical clusters within synthetic data. Moreover, we explored different methods to evaluate and visualize a clustering model’s performance, providing you with valuable tools to effectively analyze and group your data.

Whether it’s anomaly detection in time-series data or customer segmentation, k-Means can be a powerful tool in your data science arsenal. If you’re interested in anomaly detection, feel free to explore my recent article on the subject, written with Python enthusiasts in mind. Armed with this knowledge, you’re ready to embark on your own data exploration journey using k-Means clustering.

If you are interested in this topic, check out my recent article on anomaly detection with Python. And if you have any questions, please ask them in the comments.

Did you know that customer segmentation is an area where real-world data can be prone to bias and unfairness? If you’re concerned that your models may reflect the same bias, check out our latest article on addressing fairness in machine learning with fairlearn.

Sources and Further Reading

Books on Clustering

“Data Clustering: Algorithms and Applications” by Charu C. Aggarwal: This book covers a wide range of clustering algorithms, including hierarchical clustering, and discusses their applications in various fields.
“Data Mining: Practical Machine Learning Tools and Techniques” by Ian H. Witten and Eibe Frank: This book is a comprehensive introduction to data mining and machine learning, including a chapter on hierarchical clustering.

Books on Machine Learning

The links above to Amazon are affiliate links. By buying through these links, you support the Relataly.com blog and help to cover the hosting costs. Using the links does not affect the price.

Measuring Regression Errors with Python

Florian Follonier — Mon, 04 May 2020 09:34:09 +0000

Evaluating performance is a crucial step in developing regression models. Because regression models return continuous outputs, such models allow for different gradations of right or wrong. Therefore, we measure the deviation between predictions and actual values in numerical terms. However, a universal metric to measure the performance of regression models does not exist. Instead, there are several metrics, each with its advantages and disadvantages. None of these metrics is sufficient alone, and it is often necessary to use them in combination. This article presents six regression error metrics and explains how to implement them in Python with Scikit-learn.

The rest of this article proceeds in two parts. The first part is conceptual and introduces six error metrics for measuring regression performance. We look at formulas and discuss their pros and cons. The discussion is summarized in a cheat sheet. The second part is a hands-on Python tutorial in which we generate synthetic time series data and use them for training a prediction model. Then we implement the six regression error metrics and evaluate the performance of our model.

Note that this article deals with regression errors. If you are looking for an overview of classification error metrics, check out this tutorial on classification error metrics.

goal arrow shot archery machine learning error metrics

" data-image-caption="

goal arrow shot archery machine learning error metrics

" data-large-file="https://www.relataly.com/wp-content/uploads/2023/02/goal-arrow-shot-archery-machine-learning-error-metrics.png" src="https://www.relataly.com/wp-content/uploads/2023/02/goal-arrow-shot-archery-machine-learning-error-metrics.png" alt="goal arrow shot archery machine learning error metrics" class="wp-image-12494" srcset="https://www.relataly.com/wp-content/uploads/2023/02/goal-arrow-shot-archery-machine-learning-error-metrics.png 504w, https://www.relataly.com/wp-content/uploads/2023/02/goal-arrow-shot-archery-machine-learning-error-metrics.png 230w" sizes="(max-width: 504px) 100vw, 504px" />

goal arrow shot archery machine learning error metrics

Measuring Regression Errors

In general, we measure the performance of regression models by calculating the deviations between the predictions (y_pred) and the actual values (y_test). If the prediction value is below the actual value, the prediction error is positive. If the prediction lies above the real value, the prediction error is negative. However, in a sample of prediction values, the errors can vary greatly depending on the data point. Therefore, it is not enough to look at individual error values. Error metrics can inform us about the statistical distribution of errors in a prediction sample and, in this way, help us to measure the performance of regression models objectively.

Various metrics exist to measure regression errors. Each error metric can only cover a part of the overall picture. For instance, imagine you have developed a model to predict the consumption of a power plant. The model predictions are generally accurate, but the projections are wrong in a few cases. In other words, outliers among the prediction errors make it difficult to conclude the model performance. It is insufficient to calculate the average prediction error to understand this situation. Instead, a more robust measuring approach would combine different error metrics to conclude the probability that prediction errors lie within a specific range.

Predictions vs. Actual Values in Time Series Forecasting

Six Error Metrics for Measuring Regression Errors

The following six metrics help measure prediction errors. We can apply them to various regression problems, including time series forecasting.

Mean Absolute Error (MAE)
Mean Absolute Percentage Error (MAPE)
Median Absolute Error (MedAE)
Mean Squared Error (MSE)
Root Mean Squared Error (RMSE)
Median Absolute Percent Error (MdAPE)

Mean Absolute Error (MAE)

Mean Absolute Error (MAE) is a metric commonly used to measure the arithmetic average of deviations between predictions and actual values.

An MAE of “5” tells us that, on average, our predictions deviate from the actual values by 5. Whether this error is considered small or large will depend on the application case and the scale of the predictions. For instance, 5 nanometers in the case of a building might be small, but if it’s five nanometers in the case of a biological membrane, it might be significant. So when working with the MAE, mind the scale.

It is scale-dependent
The MAE considers the absolute values to take both positive and negative deviations from the actual.
The MAE is sensitive to outliers, as large values can substantially impact. For this reason, we should use the MAE in combination with additional metrics.
The MAE shares the same unit with the predictions.

\[MAE=\ \frac{\sum_{i=1}^{n}{|y_i-x_i|}}{n} \]

\[x_i\ = actual value \\ y_i = predictions \\ n = sample size \]

Mean Absolute Percentage Error (MAPE)

The mean absolute percentage error calculates the mean percentage deviation between predictions and actual values.

The mean absolute percentage error is scale-independent, making it easier to interpret.
We must not use the MAPE whenever a single value is zero
The MAPE puts a heavier penalty on negative errors

\[\mathrm{MAPE=\ }\frac{\mathrm{1}}{\mathrm{n}}\sum_{\mathrm{t=1}}^{\mathrm{n}}\left|\frac{\mathrm{A}_\mathrm{t}\mathrm{\ -\ } \mathrm{F}_\mathrm{t}}{\mathrm{A}_\mathrm{t}}\right|\mathrm{\ \ast100} \]

Mean Squared Error (MSE)

We can calculate the MSE by measuring the average squares of the differences between the estimated and actual values.

Since all values are squared, the MSE is very sensitive to outliers.
An MSE much larger than the MAE indicates strong outliers among the prediction errors. The formula of the MSE is:

\[MSE=\frac{\sum_{i=1}^{n}{|y_i-x_i|}^2}{n} \]

Median Absolute Error (MedAE)

The Median Absolute Error (MedAE) calculates the median deviation between predictions and actual values.

The MedAE has the same unit as the predictions.
A MedAE of value 10 means that 50% of the errors are greater than 10 and 50% are below this value.
The MedAE is resistant to outliers. Therefore, we often use it in combination with the MAE. A substantial deviation between MAE and MedAE is an indication that there are outliers among the errors. In other words, the prediction model deviates more from the actual value than on average.

\[MedAE=\frac{\sum_{i=1}^{n}{|y_i-{\hat{x}}_i|}}{n}\]

Root Mean Squared Error (RMSE)

The root-mean-squared error is another standard way to measure the performance of a forecasting model.

Has the same unit as the predictions
A good measure of how accurately the model predicts the response
Robust to outliers

\[RMSE=\frac{1}{n}\sqrt{\sum_{i=1}^{n}{|y_i-x_i|}^2}\]

Median Absolute Percentage Error (MdAPE)

The median absolute percentage error (MdAPE) measures the accuracy of a prediction model. It is similar to the median absolute percentage error but, as the name implies, calculates the median error for a set of forecasts. As a result, the MdAPE is more resilient to outliers than the MAPE. However, it is also less intuitive. A MdAPE of 5% means that half of the absolute percentage errors are less than 5%, and half are over 5%.

Scale-dependent
Not to be used whenever a single value is zero.
More robust to distortion from outliers than the MAPE

\[MdAPE= \ median(\left|\frac{A_t\ -\ F_t}{A_t}\right|)\ast100\]

Regression Error Cheat Sheet

The cheat sheet provides an overview of the six regression error metrics. It contains the mathematical formula for each regression metric, a short code sequence to implement in Python, and some hints for their interpretation.

Cheat-Sheet.pdf Download

Implementing Regression Error Metrics in Python: Time Series Prediction Example

Now that we have familiarized ourselves with standard regression error metrics, it’s time to see them in action. In the following, we will develop and test a regression model in Python. We begin by generating some synthetic time series data. Subsequently, we use the data to train a simple regression model based on a Keras neural network. The model will try to continue the time series and predicts a continious value for the next time step. We will use this model to predict a test dataset and measure the prediction performance using the error metrics.

The code of this Python example is available on the GitHub repository.

View on GitHub Relataly Github Repo

Prerequisites

Before starting the coding part, make sure that you have set up your Python 3 environment and required packages. If you don’t have an environment yet, you can follow the steps in this tutorial to set up the Anaconda environment.

Also, make sure you install all required packages. In this tutorial, we will be working with the following standard packages:

In addition, we will be using Keras (2.0 or higher) with Tensorflow backend and the machine learning library scikit-learn.

You can install packages using console commands:

pip install
conda install (if you are using the anaconda packet manager)

Step #1 Generate Synthetic Time Series Data

We begin by generating synthetic time series data. The script below creates the time series by multiplying different sine curves.

# A tutorial for this file is available at www.relataly.com

import math
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl 
from tensorflow.keras.models import Sequential
from sklearn.preprocessing import MinMaxScaler
from tensorflow.keras.layers import LSTM, Dense
import seaborn as sns
sns.set_style('white', { 'axes.spines.right': False, 'axes.spines.top': False})

# Creating the sample sinus curve dataset
steps = 1000; gradient = 0.002
list_a = []
for i in range(0, steps, 1):
    y = 100 * round(math.sin(math.pi * i * 0.02 + 0.01), 4) * round(math.sin(math.pi * i * 0.005 + 0.01), 4) * round(math.sin(math.pi * i * 0.005 + 0.01), 4)
    list_a.append(y)
df = pd.DataFrame({"valid": list_a}, columns=["valid"])

# Visualizing the data
fig, ax1 = plt.subplots(figsize=(16, 4))
sns.lineplot(data=df)
ax1.xaxis.set_major_locator(plt.MaxNLocator(30))
plt.title("sine curve data")

Step #2 Data Preparation

Now that we have the synthetic data available, we can prepare it as inputs for training our regression model. Running the following code will scale and split the data and bring it into a shape that we can use as input batches to a neural network.

# Feature Selection - Only Close Data
train_df = df.copy()
data_unscaled = df.values

# Transform features by scaling each feature to a range between 0 and 1
mmscaler = MinMaxScaler(feature_range=(0, 1))
np_data = mmscaler.fit_transform(data_unscaled)

# Set the sequence length - this is the timeframe used to make a single prediction
sequence_length = 15

# Prediction Index
index_Close = 0

# Split the training data into train and train data sets
# As a first step, we get the number of rows to train the model on 80% of the data 
train_data_length = math.ceil(np_data.shape[0] * 0.8)

# Create the training and test data
train_data = np_data[0:train_data_length, :]
test_data = np_data[train_data_length - sequence_length:, :]

# The RNN needs data with the format of [samples, time steps, features]
# Here, we create N samples, sequence_length time steps per sample, and 6 features
def partition_dataset(sequence_length, train_df):
    x, y = [], []
    data_len = train_df.shape[0]
    for i in range(sequence_length, data_len):
        x.append(train_df[i-sequence_length:i,:]) #contains sequence_length values 0-sequence_length * columsn
        y.append(train_df[i, index_Close]) #contains the prediction values for validation (3rd column = Close),  for single-step prediction
    
    # Convert the x and y to numpy arrays
    x = np.array(x)
    y = np.array(y)
    return x, y

# Generate training data and test data
x_train, y_train = partition_dataset(sequence_length, train_data)
x_test, y_test = partition_dataset(sequence_length, test_data)

# Print the shapes: the result is: (rows, training_sequence, features) (prediction value, )
print(x_train.shape, y_train.shape)
print(x_test.shape, y_test.shape)

# Validate that the prediction value and the input match up
# The last close price of the second input sample should equal the first prediction value
print(x_test[1][sequence_length-1][index_Close])
print(y_test[0])

Out: x_tain.shape: (584, 15, 1) -- y_tain.shape: (584,)

Step #3 Training a Time Series Regression Model

Once we have prepared the data, we can train the regression model. Our model uses a simple neural network architecture. Running the code below defines the model architecture and compiles the model.

# Settings
batch_size = 5
epochs = 4
n_features = 1

# Configure and compile the neural network model
# The number of input neurons is defined by the sequence length multiplied by the number of features
lstm_neuron_number = sequence_length * n_features

# Create the model
model = Sequential()
model.add(LSTM(lstm_neuron_number, return_sequences=False, input_shape=(x_train.shape[1], 1))
)
model.add(Dense(1))
model.compile(optimizer="adam", loss="mean_squared_error")

# Train the model
history = model.fit(x_train, y_train, batch_size=batch_size, epochs=epochs)

Epoch 1/4 
584/584 [==============================] - 1s 2ms/step - loss: 0.1047 Epoch 2/4 
584/584 [==============================] - 1s 1ms/step - loss: 0.0153 Epoch 3/4 
584/584 [==============================] - 1s 1ms/step - loss: 0.0102 Epoch 4/4 
584/584 [==============================] - 1s 1ms/step - loss: 0.0064

Now that the model architecture is defined, you can run the code below to initiate the training process.

# Settings
batch_size = 5

# Train the model
history = model.fit(x_train, y_train, batch_size=batch_size, epochs=epochs)

Step #4 Making Test Predictions

Let’s see how good or bad our model performs. We will make predictions on our test dataset by running the code below. We store the results in a new DataFrame called “predictions.”

# Get the predicted values
y_pred_scaled = model.predict(x_test)
y_pred = mmscaler.inverse_transform(y_pred_scaled)
y_test_unscaled = mmscaler.inverse_transform(y_test.reshape(-1, 1))

Next, we want to get an idea of how our model performs. We, therefore, create a line plot that shows the predictions and the actual values of the time series. We colorize the differences between predictions and actual values to highlight the prediction errors.

# Create the line plot
test_df = pd.DataFrame({'y_test': y_test_unscaled.flatten(), 'y_pred': y_pred.flatten()})
fig, ax1 = plt.subplots(figsize=(16, 8), sharex=True)
sns.lineplot(data=test_df)
ax1.tick_params(axis="x", rotation=0, labelsize=10, length=0)
plt.title("y_pred vs y_test Truth")
plt.legend(["y_pred", "y_test"], loc="upper left")

# Fill between plotlines
mpl.rc('hatch', color='k', linewidth=2)
ax1.fill_between(test_df.index, test_df["y_test"], test_df["y_pred"],  facecolor = 'white', alpha=.9) 
plt.show()

The plot shows that the prediction errors vary and are sometimes positive and sometimes negative.

Step #5 Calculating the Regression Error Metrics: Implementation and Evaluation

Now that we have predicted the test set, we calculate the six regression error metrics. In most cases, you won’t have to use all six regression error metrics to understand how well a model performs. In most cases, it is sufficient to use a combination of two or three of them.

# Mean Absolute Error (MAE)
MAE = np.mean(abs(y_pred - y_test_unscaled))
print('Mean Absolute Error (MAE): ' + str(np.round(MAE, 2)))

# Median Absolute Error (MedAE)
MEDAE = np.median(abs(y_pred - y_test_unscaled))
print('Median Absolute Error (MedAE): ' + str(np.round(MEDAE, 2)))

# Mean Squared Error (MSE)
MSE = np.square(np.subtract(y_pred, y_test_unscaled)).mean()
print('Mean Squared Error (MSE): ' + str(np.round(MSE, 2)))

# Root Mean Squarred Error (RMSE) 
RMSE = np.sqrt(np.mean(np.square(y_pred - y_test_unscaled)))
print('Root Mean Squared Error (RMSE): ' + str(np.round(RMSE, 2)))

# Mean Absolute Percentage Error (MAPE)
MAPE = np.mean((np.abs(np.subtract(y_test_unscaled, y_pred)/ y_test_unscaled))) * 100
print('Mean Absolute Percentage Error (MAPE): ' + str(np.round(MAPE, 2)) + ' %')

# Median Absolute Percentage Error (MDAPE)
MDAPE = np.median((np.abs(np.subtract(y_test_unscaled, y_pred)/ y_test_unscaled))) * 100
print('Median Absolute Percentage Error (MDAPE): ' + str(np.round(MDAPE, 2)) + ' %')

Mean Absolute Error (MAE): 6.95 
Median Absolute Error (MedAE): 5.05  
Mean Squared Error (MSE): 78.7 
Root Mean Squared Error (RMSE): 8.87  
Mean Absolute Percentage Error (MAPE): 10339.13 % 
Median Absolute Percentage Error (MDAPE): 26.8 %

Step #6 Interpreting the Regression Error Metrics

Let’s look at the regression error metrics, starting with the MAE and the MedAE. The MAE is 6.95, and the MedAE is 5.05. These values are close, indicating that our prediction errors are equally distributed but might include some outliers.

To better understand possible outliers, we look at the MSE. With a value of 78.7, the MAE is a little bit higher than the square of the MAE. The RMSE is slightly higher than the MAE, which is another indication that the prediction errors lie in a narrow range.

How much deviate the predictions of our model from the actual values in percentage terms? The MAPE is typically used as a starting point to answer this question. With 10339.13 percent, it is exceptionally high. So is our model very much mistaken? The answer is no – the MAPE is misleading. The problem is that several actual values are close to zero, e.g., 0.00001. While the predictions of our model are close to the actual values in absolute numbers, the MAPE divides the residual values by the actual values, e.g., 0.000001, and sums them up. Thus the MAPE becomes very large.

The Median of the MDAPE is 26.8%. So, 50% of our forecasting errors are higher than 26.8%, and 50% are lower. Consequently, we can assume that when our model makes a prediction, the probability that the deviation is 26.8% from the actual value is 50% – that is not as terrible as the MAPE would indicate. The plotlines of the predictions and actual values support these findings.

Summary

This article has presented six error metrics for evaluating regression errors. Remember that none of these metrics alone is sufficient to evaluate a model’s performance. Instead, we should use a combination of multiple metrics. We have discussed the advantages and disadvantages of the metrics. In the second part of this tutorial, we implemented a time series regression example. After training an exemplary regression model, we used the six regression metrics to evaluate the model performance.

It’s important to remember that different error metrics are suitable for different types of regression problems. For example, mean absolute error (MAE) is a good choice when you want to know how close the predictions are to the true values, but it is not very sensitive to large errors. Root mean squared error (RMSE) is a more sensitive metric that punishes large errors more heavily, but it can be difficult to interpret because it is in the same units as the original data.

I hope this article was helpful. If you have any remarks or questions remaining, write them in the comments.

Sources and Further Reading

The links above to Amazon are affiliate links. By buying through these links, you support the Relataly.com blog and help to cover the hosting costs. Using the links does not affect the price.

The post Measuring Regression Errors with Python appeared first on relataly.com.

Rolling Time Series Forecasting: Creating a Multi-Step Prediction for a Rising Sine Curve using Neural Networks in Python

Florian Follonier — Sun, 19 Apr 2020 09:59:49 +0000

Many time forecasting problems can be solved by predicting just one step into the future. However, some problems require a forecast for an extended period of time, which calls for a multi-step time series forecasting approach. This approach involves modeling the distribution of future values of a signal over a prediction horizon. In this article, we use the rising sine curve as an example to demonstrate how to apply a multi-step prediction approach using Keras neural networks with LSTM layers in Python. We create a rolling forecast for the sine curve by generating several single-step predictions and iteratively using them as input to predict further steps in the future.

The remainder of this article proceeds as follows: We begin by looking at the sine curve problem and provide a quick intro to recurrent neural networks. After this conceptual introduction, we turn to the hands-on part in Python. We generate synthetic sine curve data and preprocess the data to train univariate models using a Keras neural network. We thereby experiment with different architectures and hyperparameters. Then, we use these model variants to create rolling multi-step forecasts. Finally, we evaluate the model performance.

A multi-step time series forecast for a rising sine curve, as we will create in this article.

If you are just getting started with time-series forecasting, we have covered the single-step forecasting approach in two previous articles:

Predicting Stock Markets with Neural Networks – A Step-by-Step Guide
Stock Market Prediction – Adjusting Time Series Prediction Intervals in Python

The Problem of a Rising Sine Curve

The line plot below illustrates the sample of a rising sine curve. The goal is to use the ground truth values (blue) to forecast several points in this curve (purple).

Traditional mathematical methods can resolve this function by decomposing it into constituent parts. Thus, we could easily foresee the further course of the curve. But in practice, the function might not be precisely periodic and change over a more extended period. Therefore, recurrent neural networks can achieve better results than traditional mathematical approaches, especially when they train on extensive data.

Also: Stock Market Prediction using Univariate Recurrent Neural Networks (RNN) with Python

Application Domains with Similar Time Series Forecasting Problems

A rising sine wave may sound like an abstract problem at first. However, similar issues are widespread. Imagine you are the owner of an online shop. The users who visit your shop and buy something fluctuate depending on the time and the weekday. For instance, there are fewer visitors at night, and on weekends, the number of visitors rises sharply. At the same time, the overall number of users increases over a more extended period as the shop becomes known to a broader audience. To plan the number of goods to be held in stock, you need to see the number of orders at any point in time over several weeks. It is a typical multi-step time series problem, and similar problems exist in various domains:

Healthcare: e.g., forecasting of health signals such as heart, blood, or breathing signals
Network Security: e.g., analysis of network traffic in intrusion detection systems
Sales and Marketing: e.g., forecasting of market demand
Production demand forecasting: e.g., for power consumption and capacity planning
Prediction and filtering of sensor signals, e.g., of audio signals

A time series forecasting problem: predicting sine curve data

Recurrent Neural Networks for Time Series Forecasting

The model used in this article is a recurrent neural network with Long short-term memory (LSTM) layers. Unlike feedforward neural networks, recurrent networks with LSTM layers have loops to pass output values from one training instance to the next. LSTM layers are a powerful and widely-used tool for deep learning, and they work particularly well for time series data. By using LSTM layers, it is possible to train machine learning models that can make accurate predictions based on time series data, which can be useful for a wide range of applications, including finance, weather forecasting, and many others.

Also: Mastering Multivariate Stock Market Prediction with Python: A Guide to Effective Feature Engineering Techniques

The Training Process of a Recurrent Neural Network

When training neural networks, the process typically involves multiple epochs, where an epoch refers to a single pass through the entire training dataset. During each epoch, the neural network receives the entire training data and adjusts its weights accordingly through forward and backward propagation. The batch size, which determines the number of examples passed through the network at once, also affects how the weights are updated between neurons.

It’s important to note that one epoch is usually not enough for the model to learn the underlying patterns and relationships in the data, leading to underfitting and poor performance in prediction tasks. Hence, multiple epochs are often needed to fine-tune the network and improve its predictive capabilities. However, care must be taken not to overfit the model by choosing too many epochs. Overfitting occurs when the model becomes too complex and performs well on the training data but poorly on any other data, resulting in poor generalization.

To avoid overfitting, we can employ various techniques such as early stopping, where the training process is stopped when the validation loss begins to increase, indicating that the model is overfitting. Another method is to use regularization techniques such as L1 or L2 regularization, dropout, or data augmentation to prevent overfitting and improve the model’s generalization capabilities.

Functioning of an LSTM layer

An LSTM (Long Short-Term Memory) layer is a specialized type of recurrent neural network (RNN) layer that is specifically designed for processing and making predictions based on time series data. Unlike traditional RNNs, which suffer from the vanishing gradient problem, LSTM layers can maintain a memory of past events and selectively use this memory to make predictions about future events.

LSTM layers accomplish this by incorporating several unique mechanisms, such as gates and cell states. The gates control the flow of information into and out of the cell, while the cell state represents the memory of the network. These mechanisms work together to enable LSTM layers to learn and make predictions based on long-term dependencies in the data, which is a characteristic of many time series datasets.

One of the key advantages of using LSTM layers for time series forecasting is their ability to generate predictions for multiple timesteps. This is achieved by iteratively reusing the model outputs from the previous training run as input for the next prediction step. This approach, known as a rolling forecast, allows the model to make accurate predictions for multiple time steps into the future.

Functioning of an LSTM layer

LSTM Components

To understand how an LSTM layer works, it is helpful to think about the layer as consisting of several different components, each of which has a specific role in the overall operation of the layer. These components include the following:

Input gate: The input gate controls which information from the input data will be passed on to the cell state. The input gate uses a sigmoid activation function to determine which information should be retained and which should be discarded.
Forget gate: The forget gate controls which information from the cell state will be discarded. The forget gate uses a sigmoid activation function to determine which information should be forgotten and which should be retained.
Cell state: The cell state is a vector that contains the information that is retained by the LSTM layer. This information is updated at each time step based on the input from the input gate and the forget gate.
Output gate: The output gate controls which information from the cell state will be passed on to the output of the LSTM layer. The output gate uses a sigmoid activation function to determine which information should be retained and which should be discarded.

This article uses LSTM layers combined with a rolling forecast approach to predict the course of a sinus curve with a linear slope. The result is a multi-step time series forecast.

Creating a Rolling Multi-Step Time Series Forecast in Python

In this tutorial, we will explore the process of creating a rolling multi-step forecast using Python. Our dataset for this exercise will be a synthetically generated rising sine curve, which we will use to demonstrate the principles of multi-step time series forecasting.

By following the steps outlined in this tutorial, you will gain a comprehensive understanding of the process involved in multi-step time series forecasting. This will include techniques for data preprocessing, feature engineering, and model selection.

To get started, we have made the code available on our GitHub repository. You can easily access and follow along with the tutorial by downloading the code and running it on your local machine. This will enable you to experiment with different parameters and modify the code to suit your specific requirements.

View on GitHub Relataly GitHub Repo

Prerequisites

Before starting the coding part, make sure that you have set up your Python 3 environment and required packages. If you don’t have an environment yet, you can follow this tutorial to set up the Anaconda environment.

Also, make sure you install all required packages. In this tutorial, we will be working with the following standard packages:

In addition, we will be using Keras (2.0 or higher) with Tensorflow backend and the machine learning library Scikit-learn.

You can install packages using console commands:

pip install
conda install (if you are using the anaconda packet manager)

Step #1 Generating Synthetic Data

We begin by loading the required packages and creating a synthetic dataset. The synthetic data contains 300 values of the sinus function combined with a slight linear upward slope of 0.02. The code below creates the data and visualizes it in a line plot.

# A tutorial for this file is available at www.relataly.com

import math
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, TimeDistributed, Dropout, Activation
from sklearn.preprocessing import RobustScaler
from sklearn.metrics import mean_absolute_error
import tensorflow as tf
import seaborn as sns
sns.set_style('white', { 'axes.spines.right': False, 'axes.spines.top': False})

# check the tensorflow version and the number of available GPUs
print('Tensorflow Version: ' + tf.__version__)
physical_devices = tf.config.list_physical_devices('GPU')
print("Num GPUs:", len(physical_devices))

# Creating the synthetic sinus curve dataset
steps = 300
gradient = 0.02
list_a = []
for i in range(0, steps, 1):
    y = round(gradient * i + math.sin(math.pi * 0.125 * i), 5)
    list_a.append(y)
df = pd.DataFrame({"sine_curve": list_a}, columns=["sine_curve"])

# Visualizing the data
fig, ax1 = plt.subplots(figsize=(16, 4))
ax1.xaxis.set_major_locator(plt.MaxNLocator(30))
plt.title("Sinus Data")
sns.lineplot(data=df[["sine_curve"]], color="#039dfc")
plt.legend()
plt.show()

The signal curve oscillates and is steadily moving upward.

Step #2 Preprocessing

Next, we preprocess the data to bring it into the shape required by our neural network model. Running the code below will perform the following tasks:

Scale the data to a standard range.
Partition the data into multiple periods with a predefined sequence length. Each partition series contains 110 data points. The number of steps needs to match the number of neurons in the first layer of the neural network architecture.
Split the data into two separate datasets for training and testing.

Also: Mastering Multivariate Stock Market Prediction with Python: A Guide to Effective Feature Engineering Techniques

data = df.copy()

# Get the number of rows in the data
nrows = df.shape[0]

# Convert the data to numpy values
np_data_unscaled = np.array(df)
np_data_unscaled = np.reshape(np_data_unscaled, (nrows, -1))
print(np_data_unscaled.shape)

# Transform the data by scaling each feature to a range between 0 and 1
scaler = RobustScaler()
np_data_scaled = scaler.fit_transform(np_data_unscaled)

# Set the sequence length - this is the timeframe used to make a single prediction
sequence_length = 110

# Prediction Index
index_Close = 0

# Split the training data into train and train data sets
# As a first step, we get the number of rows to train the model on 80% of the data 
train_data_len = math.ceil(np_data_scaled.shape[0] * 0.8)

# Create the training and test data
train_data = np_data_scaled[0:train_data_len, :]
test_data = np_data_scaled[train_data_len - sequence_length:, :]

# The RNN needs data with the format of [samples, time steps, features]
# Here, we create N samples, sequence_length time steps per sample, and 6 features
def partition_dataset(sequence_length, data):
    x, y = [], []
    data_len = data.shape[0]
    for i in range(sequence_length, data_len):
        x.append(data[i-sequence_length:i,:]) #contains sequence_length values 0-sequence_length * columsn
        y.append(data[i, index_Close]) #contains the prediction values for validation (3rd column = Close),  for single-step prediction
    
    # Convert the x and y to numpy arrays
    x = np.array(x)
    y = np.array(y)
    return x, y

# Generate training data and test data
x_train, y_train = partition_dataset(sequence_length, train_data)
x_test, y_test = partition_dataset(sequence_length, test_data)

# Print the shapes: the result is: (rows, training_sequence, features) (prediction value, )
print(x_train.shape, y_train.shape)
print(x_test.shape, y_test.shape)

# Validate that the prediction value and the input match up
# The last close price of the second input sample should equal the first prediction value
print(x_test[1][sequence_length-1][index_Close])
print(y_test[0])

(130, 110, 1) (130,)
(60, 110, 1) (60,)
0.599655121905751
0.599655121905751

Step #3 Preparing Data for the Multi-Step Time Series Forecasting Model

Now that we have prepared the synthetic data, we need to determine the architecture of our neural network. There are no clear guidelines available on how to configure a neural network. Most problems are unique, and the model settings depend on the nature and extent of the data. Thus, configuring LSTM layers can be a real challenge.

3.1 Overview of Model Parameters

Finding the optimal configuration is often a process of trial and error. Below you find a list of model parameters with which you can experiment:

Keras model parameters used in this tutorial
epoch: An epoch is an iteration over the entire x_train and y_train data provided.
Batch_size: Number of samples per gradient update. If unspecified, batch_size is 32.
Activation: Mathematical equations determine whether a neuron in the network should activate (“fired”) or not.
Input_shape: Tensor with shape: (batch_size, …, input_dim).
Input_len: the length of the generated input sequence.
Return_sequences: If set to false, the layer will return the final output in the output sequence. If true, the layer returns the entire series.
Loss: The loss function tells the model how to evaluate its performance so that the weights can be updated to reduce the loss on the following evaluation.
Optimizer: Every time a neural network finishes processing a batch through the network and generates prediction results, the optimizer decides how to adjust weights between the nodes.

Source: keras.io/layers/recurrent – view the Keras documentation for a complete list of parameters.

3.2 Choosing Model Parameters

Finding a suitable architecture for a neural network is not easy and requires a systematic approach. A common practice is to start with an established standard architecture. We can then change the model parameters slightly and record the configurations and outcomes. We can also automate configuring and testing the model (Hyperparameter Tuning). Still, because this is not the focus of this article, we will use a manual approach.

We start with five epochs, a batch_size of 1, and configure our recurrent model with one LSTM layer with 110 neurons, corresponding to a data slice of 110 input values. In addition, we add a dense layer that provides us with a single output value.

# Configure the neural network model
epochs = 12; batch_size = 1;

# Model with n_neurons = inputshape Timestamps, each with x_train.shape[2] variables
n_neurons = x_train.shape[1] * x_train.shape[2]
model = Sequential()
model.add(LSTM(n_neurons, return_sequences=True, input_shape=(x_train.shape[1], 1)))
model.add(LSTM(n_neurons, return_sequences=False))
model.add(Dense(5))
model.add(Dense(1))
model.compile(optimizer="adam", loss="mean_squared_error")

Step #3 Training the Prediction Model

Now that we have defined the architecture of our recurrent neural network, the next step is to train the model. Training the model involves using a training dataset to adjust the weights and biases of the network so that it can make accurate predictions on new data. This process is done using an optimization algorithm, such as gradient descent, to minimize the difference between the predicted values and the true values.

Training can be time-consuming, especially for large datasets or complex models, and the amount of time it takes may vary depending on the performance of your computer. The complexity of our model is relatively low, and we only use five training epochs. It should thus only take a couple of minutes to train the model.

It is important to monitor the training process to ensure that the model is learning effectively and to identify any potential issues or problems. Once the training is complete, we can test the model on a separate test dataset to evaluate its performance.

# Train the model
history = model.fit(x_train, y_train, batch_size=batch_size, epochs=epochs)

Epoch 1/12
130/130 [==============================] - 7s 28ms/step - loss: 0.0544
Epoch 2/12
130/130 [==============================] - 4s 27ms/step - loss: 0.0041
Epoch 3/12
130/130 [==============================] - 4s 28ms/step - loss: 3.5105e-04
Epoch 4/12
130/130 [==============================] - 4s 28ms/step - loss: 4.3903e-05
Epoch 5/12
130/130 [==============================] - 4s 28ms/step - loss: 5.9422e-05
Epoch 6/12
130/130 [==============================] - 4s 28ms/step - loss: 5.7988e-05
Epoch 7/12
130/130 [==============================] - 4s 28ms/step - loss: 8.3814e-04
Epoch 8/12
130/130 [==============================] - 4s 28ms/step - loss: 1.9122e-04
Epoch 9/12
130/130 [==============================] - 4s 28ms/step - loss: 0.0014
Epoch 10/12
130/130 [==============================] - 4s 29ms/step - loss: 0.0114
Epoch 11/12
130/130 [==============================] - 4s 28ms/step - loss: 0.0013
Epoch 12/12
130/130 [==============================] - 4s 28ms/step - loss: 3.4571e-05

Step #4 Predicting a Single-step Ahead

We continue by making single-step predictions based on the training data. Then we calculate the mean squared error and the median error to measure the performance of our model.

# Reshape the data, so that we get an array with multiple test datasets
x_test_np = np.array(x_test)
x_test_reshape = np.reshape(x_test_np, (x_test_np.shape[0], x_test_np.shape[1], 1))

# Get the predicted values
y_pred = model.predict(x_test_reshape)
y_pred_unscaled = scaler.inverse_transform(y_pred)
y_test_unscaled = scaler.inverse_transform(y_test.reshape(-1, 1))

# Mean Absolute Error (MAE)
MAE = mean_absolute_error(y_test_unscaled, y_pred_unscaled)
print(f'Median Absolute Error (MAE): {np.round(MAE, 2)}')

# Mean Absolute Percentage Error (MAPE)
MAPE = np.mean((np.abs(np.subtract(y_test_unscaled, y_pred_unscaled)/ y_test_unscaled))) * 100
print(f'Mean Absolute Percentage Error (MAPE): {np.round(MAPE, 2)} %')

# Median Absolute Percentage Error (MDAPE)
MDAPE = np.median((np.abs(np.subtract(y_test_unscaled, y_pred_unscaled)/ y_test_unscaled)) ) * 100
print(f'Median Absolute Percentage Error (MDAPE): {np.round(MDAPE, 2)} %')

Out: me: 0.0223, rmse: 0.0206

Both the mean error and the squared mean error are pretty small. As a result, the values predicted by our model are close to the actual values of the ground truth. Even though it is unlikely because of the small number of epochs, it could still be that the model is over-fitting.

Step #5 Visualizing Predictions and Loss

Next, we look at the quality of the training predictions by plotting the training predictions and the actual values (i.e., ground truth).

# Visualize the data
#train = df[:train_data_len]
df_valid_pred = df[train_data_len:]
df_valid_pred.insert(1, "y_pred", y_pred_unscaled, True)

# Create the lineplot
fig, ax1 = plt.subplots(figsize=(32, 5), sharex=True)
ax1.tick_params(axis="x", rotation=0, labelsize=10, length=0)
plt.title("Predictions vs Ground Truth")
sns.lineplot(data=df_valid_pred)
plt.show()

The smaller the area between the two lines, the better our model predictions. So we can tell from the plot that the predictions are not entirely wrong. We also check the learning path of the regression model.

# Plot training & validation loss values
fig, ax = plt.subplots(figsize=(10, 5), sharex=True)
sns.lineplot(data=history.history["loss"])
plt.title("Model loss")
plt.ylabel("Loss")
plt.xlabel("Epoch")
ax.xaxis.set_major_locator(plt.MaxNLocator(epochs))
plt.legend(["Train"], loc="upper left")
plt.grid()
plt.show()

The loss drops quickly, and after five epochs, the model seems to have converged.

Step #6 Rolling Forecasting: Creating a Multi-step Time Series Forecast

Next, we will generate the rolling multi-step forecast. This approach is different from a single-step approach in that we predict several points of a signal within a prediction window and not just a single value. However, the prediction quality decreases over more extended periods because reusing predictions creates a feedback loop that amplifies potential errors over time.

Also: Measuring Regression Errors with Python

The forecasting process begins with an initial prediction for a single time step. After that, we add the predicted value to the input values for another projection, and so on. In this way, we create the rolling forecast with multiple time steps.

# Settings and Model Labels
rolling_forecast_range = 30
titletext = "Forecast Chart Model A"
ms = [
    ["epochs", epochs],
    ["batch_size", batch_size],
    ["lstm_neuron_number", n_neurons],
    ["rolling_forecast_range", rolling_forecast_range],
    ["layers", "LSTM, DENSE(1)"],
]
settings_text = ""
lms = len(ms)
for i in range(0, lms):
    settings_text += ms[i][0] + ": " + str(ms[i][1])
    
    if i < lms - 1:
        settings_text = settings_text + ",  "

# Making a Multi-Step Prediction
# Create the initial input data
new_df = df.filter(["sine_curve"])
for i in range(0, rolling_forecast_range):
    # Select the last sequence from the dataframe as input for the prediction model
    last_values = new_df[-n_neurons:].values
    
    # Scale the input data and bring it into shape
    last_values_scaled = scaler.transform(last_values)
    X_input = np.array(last_values_scaled).reshape([1, 110, 1])
    X_test = np.reshape(X_input, (X_input.shape[0], X_input.shape[1], 1))
    
    # Predict and unscale the predictions
    pred_value = model.predict(X_input)
    pred_value_unscaled = scaler.inverse_transform(pred_value)
    
    # Add the prediction to the next input dataframe
    new_df = pd.concat([new_df, pd.DataFrame({"sine_curve": pred_value_unscaled[0, 0]}, index=new_df.iloc[[-1]].index.values + 1)])
    new_df_length = new_df.size
forecast = new_df[new_df_length - rolling_forecast_range : new_df_length].rename(
    columns={"sine_curve": "Forecast"}
)

We can plot the forecast together with the ground truth.

#Visualize the results
validxs = df_valid_pred.copy()
dflen = new_df.size - 1
dfs = pd.concat([validxs, forecast], sort=False)
dfs.at[dflen, "y_pred_train"] = dfs.at[dflen, "y_pred"]
dfs

# Zoom in to a closer timeframe
df_zoom = dfs[dfs.index > 200]

# Visualize the data
fig, ax = plt.subplots(figsize=(16, 5), sharex=True)
ax.tick_params(axis="x", rotation=0, labelsize=10, length=0)
ax.xaxis.set_major_locator(plt.MaxNLocator(rolling_forecast_range))
plt.title('Forecast And Predictions (y_pred)')
sns.lineplot(data=df_zoom[["sine_curve", "y_pred"]], ax=ax)
sns.scatterplot(data=df_zoom[["Forecast"]], x=df_zoom.index, y='Forecast', linewidth=1.0, ax=ax)
plt.legend(["x_train", "y_pred", "y_test_rolling"], loc="lower right")
ax.annotate('ModelSettings: ' + settings_text, xy=(0.06, .015),  xycoords='figure fraction', horizontalalignment='left', verticalalignment='bottom', fontsize=10)
plt.show()

The model learned to predict the periodic movement of the curve and thus succeeds in modeling its further course. However, the amplitude of the predicted curve increases over time, which increases the prediction errors.

Step #7 Comparing Results for Different Parameters

There is plenty of room to improve the forecasting model further. So far, our model seems not to consider that the amplitude of the sinus curve is gradually increasing. Over time, errors are amplified, leading to a growing deviation from the ground truth signal.

We can improve the model by changing the model parameters and the model architecture. However, I would not recommend changing them one at a time.

I tested several model configurations with varying epochs and neuron numbers/sample sizes to further optimize the model. Parameters such as Batch_size (=1), the timeframe for the forecast (=30 timesteps), and the model architecture (=1 LSTM Layer, 1 DENSE Layer) were left unchanged. The illustrations below show the results:

Different designs of the rolling time series forecasting model

The model that looks most promising is model #6. This model considers both the periodic movements of the sinus curve and the steadily increasing slope. The configuration of this model uses 15 epochs and 115 neurons/sample values.

Errors are amplified over more extended periods when they enter a feedback loop. For this reason, we can see in the graph below that the prediction accuracy decreases over time.

Long-time multi-step forecast (model #6)

Summary

In this tutorial, we have created a rolling time-series forecast for a rising sine curve. A multi-step forecast helps better understand how a signal will develop over a more extended period. Finally, we have tested and compared different model variants and selected the best-performing model.

There are several ways to improve model accuracy further. For example, the current network architecture was kept simple and only had a single LSTM layer. Consequently, we could try to add additional layers. Another possibility is to experiment with different hyperparameters. For example, we could increase the training epochs and use dropout to prevent overfitting. In addition, we could experiment with other activation functions. Feel free to try it out.

I hope this article was helpful. If you have questions remaining, let me know in the comments.

Another forecasting approach is to train a neural network model that predicts multiple outputs per input batch. Another relataly article demonstrates how this multi-output regression approach works: Stock Market Prediction – Multi-output regression in Python.

Also, consider non-neural network approaches to time series forecasting, such as ARIMA, which can achieve great results too.

Sources and Further Reading

The links above to Amazon are affiliate links. By buying through these links, you support the Relataly.com blog and help to cover the hosting costs. Using the links does not affect the price.

The post Rolling Time Series Forecasting: Creating a Multi-Step Prediction for a Rising Sine Curve using Neural Networks in Python appeared first on relataly.com.

Synthetic Data Archives - relataly.com

Cluster Analysis with k-Means in Python

Findings Clusters in Multivariate Data with k-Means

Understanding the Mechanics of K-Means Clustering Algorithm

How many Clusters?

Pros and Cons of k-Means Clustering

Applications of Clustering

Implementing a K-Means Clustering Model in Python

Prerequisites

Step #1: Generate Synthetic Data

Step #2: Preprocessing

Step #3: Train a k-Means Clustering Model

Step #4: Make and Visualize Predictions

Step #5: Measuring Model Performance

Summary

Sources and Further Reading

Books on Clustering

Books on Machine Learning

Related Articles on Clustering

Measuring Regression Errors with Python

Measuring Regression Errors

Six Error Metrics for Measuring Regression Errors

Mean Absolute Error (MAE)

Mean Absolute Percentage Error (MAPE)

Mean Squared Error (MSE)

Median Absolute Error (MedAE)

Root Mean Squared Error (RMSE)

Median Absolute Percentage Error (MdAPE)

Regression Error Cheat Sheet

Implementing Regression Error Metrics in Python: Time Series Prediction Example

Prerequisites

Step #1 Generate Synthetic Time Series Data

Step #2 Data Preparation

Step #3 Training a Time Series Regression Model

Step #4 Making Test Predictions

Step #5 Calculating the Regression Error Metrics: Implementation and Evaluation

Step #6 Interpreting the Regression Error Metrics

Summary

Sources and Further Reading

Rolling Time Series Forecasting: Creating a Multi-Step Prediction for a Rising Sine Curve using Neural Networks in Python

The Problem of a Rising Sine Curve

Application Domains with Similar Time Series Forecasting Problems

Recurrent Neural Networks for Time Series Forecasting

The Training Process of a Recurrent Neural Network

Functioning of an LSTM layer

LSTM Components

Creating a Rolling Multi-Step Time Series Forecast in Python

Prerequisites

Step #1 Generating Synthetic Data

Step #2 Preprocessing

Step #3 Preparing Data for the Multi-Step Time Series Forecasting Model

3.1 Overview of Model Parameters

3.2 Choosing Model Parameters

Step #3 Training the Prediction Model

Step #4 Predicting a Single-step Ahead

Step #5 Visualizing Predictions and Loss

Step #6 Rolling Forecasting: Creating a Multi-step Time Series Forecast

Step #7 Comparing Results for Different Parameters

Summary

Sources and Further Reading