Time Series Forecasting Archives - relataly.com

Univariate Stock Market Forecasting using Facebook Prophet in Python

Florian Follonier — Thu, 15 Dec 2022 22:54:34 +0000

Have you ever wondered how Facebook predicts the future? Meet Facebook Prophet, the open-source time series forecasting tool developed by Facebook’s Core Data Science team. Built on top of the PyStan library, Facebook Prophet offers a simple and intuitive interface for creating forecasts using historical data. What sets Facebook Prophet apart is its highly modular design, allowing for a range of customizable components that can be combined to create a wide variety of forecasting models. This makes it perfect for modeling data with strong seasonal effects, like daily or weekly patterns, and it can handle missing data and outliers with ease. In this tutorial, we will take a closer look at the capabilities of Facebook Prophet and see how it can be used to make accurate predictions.

We begin with a brief discussion of how the Facebook Prophet decomposes a time series into different components. Then we turn to the hands-on part. you can use its model in Python to generate a stock market forecast. We will train our Facebook Prophet model using the historical price of the Coca-Cola stock. We will also cover different options to customize the model settings.

Disclaimer: This article does not constitute financial advice. Stock markets can be very volatile and are generally difficult to predict. Predictive models and other forms of analytics applied in this article only illustrate machine learning use cases.

Facebook Prophet – an open-source tool for time series forecasting

What is Facebook Prophet?

Facebook Prophet is a tool that can be used to make predictions about future events based on historical data. It was developed by Taylor and Letham, 2017, who later made it available as an open-source project. The authors developed Facebook Prophet to solve various business forecasting problems without requiring much prior knowledge. In this way, the framework addresses a significant problem many companies face today. They have various prediction problems (e.g., capacity and demand forecasting) but face a skill gap when it comes to generating reliable forecasts with techniques such as ARIMA or neural networks. Compared to that, Facebook Prophet requires minimal fine-tuning and can deal with various challenges, including seasonality, outliers, and changing trend lines. This allows Facebook Prophet to handle a wide range of forecasting problems flexibly. Before we dive into the hands-on part, let’s gain a quick overview of how Facebook Prophet works.

Also: Stock Market Prediction using Multivariate Time Series

time series forecasting with facebook prophet python tutorial

" data-image-caption="

time series forecasting with facebook prophet python tutorial

" data-large-file="https://www.relataly.com/wp-content/uploads/2023/02/an_ancient_prophet_looking_into_a_crystal_ball.png" src="https://www.relataly.com/wp-content/uploads/2023/02/an_ancient_prophet_looking_into_a_crystal_ball-1024x1024.png" alt="time series forecasting with facebook prophet python tutorial" class="wp-image-12356" srcset="https://www.relataly.com/wp-content/uploads/2023/02/an_ancient_prophet_looking_into_a_crystal_ball.png 1024w, https://www.relataly.com/wp-content/uploads/2023/02/an_ancient_prophet_looking_into_a_crystal_ball.png 300w, https://www.relataly.com/wp-content/uploads/2023/02/an_ancient_prophet_looking_into_a_crystal_ball.png 140w, https://www.relataly.com/wp-content/uploads/2023/02/an_ancient_prophet_looking_into_a_crystal_ball.png 768w" sizes="(max-width: 1024px) 100vw, 1024px" />

Time-series forecasting with Facebook Prophet. Image generated with Midjourney.

How Facebook Prophet Works

Facebook Prophet uses a technique called additive regression to model time series data. This involves breaking the time series into a series of components:

Trends
Seasonality
Holiday

Traditional time series methods such as (S)ARIMA base their prediction on a model that weights the linear sum of past observations or lags. Facebook’s Prophet is similar in that it uses a decreasing weight for past observations. This means current observations have a higher significance for the model than those that date back a long time. It then models each component separately using a combination of linear and non-linear functions. Finally, Facebook Prophet combines these components to form the complete forecast model. Let’s take a closer look at these components and how Facebook Prophet handles them.

A) Dealing with Trends

Time series often have a trendline. However, even more often, a time series will not follow a single trend, but it has several trend components that are separated by breakpoints. Facebook Prophet tries to handle these trends in several ways. First, the model tries to identify the breakpoints (knots) in a time series that divide different periods. Each breakpoint separates two periods with different trendlines. Facebook Prophet then uses these inflection points between periods to fit the model to the data and create the forecast. In addition, trendlines do not have to be linear but can also be logarithmic. This is all done automatically, but it is also possible to specify breakpoints manually.

B) Seasonality

Facebook Prophet works very well when the data shows a strong seasonal pattern. It uses Fourier transformations (adding different sine and cosine frequencies) to account for daily, weekly and yearly seasonality. The Facebook Prophet model is flexible on the type of data you have by allowing you to adjust the seasonal components of your data. By default, Facebook Prophet assumes daily data with weekly and yearly seasonal effects. If your data differentiates from this standard, for example, you have weekly data with monthly seasonality, then you need to adjust the number of terms accordingly.

C) Holiday

Every year, public holidays can lead to strong deviations in a time series; for example, thinking of computing power, demand more people will visit the Facebook website. The Facebook Prophet model also accounts for such special events by allowing us to specify binary indicators that mark whether a certain day is a public holiday. If you have other non-holiday events that occur yearly, you can use this indicator for the same purpose. Usually, Facebook Prophet will automatically remove outliers from the data. But if an outlier occurs on a day highlighted as a public holiday, Facebook Prophet will adjust its model accordingly.

Hyperparameter Tuning and Customization

Facebook Prophet includes additional optimization techniques, such as Bayesian optimization, to automatically tune the model’s hyperparameters, such as the length of the seasonal period, to improve its accuracy. Once the model is trained, it can be used to predict future values in the time series. However, users with a strong domain knowledge may prefer to tweak these parameters themselves, and Facebook Prophet provides several functions for this purpose. It also includes a range of tools for model evaluation and diagnostics, as well as for visualizing the model and the input data.

Also: Using Random Search to Tune the Hyperparameters of a Random Decision Forest with Python

Application Domains

Facebook Prophet is a powerful forecasting tool that has been specifically designed to make forecasting easy. As mentioned, Prophet is easy to use and can flexibly handle various forecasting problems. In addition, it requires very little preprocessing to generate accurate forecasts. As a result of these advantages, Facebook Prophet has been adopted by various application domains. Some possible application domains for Facebook Prophet include:

Sales forecasting: Facebook Prophet can be used to predict future sales of a product or service, based on historical sales data. This can be useful for businesses to plan their inventory and staffing, and to make informed decisions about future investments and growth.
Financial forecasting: Facebook Prophet can be used to predict future stock prices, currency exchange rates, or other financial metrics. This can be useful for investors and financial analysts to make informed decisions about the market.
Traffic forecasting: Facebook Prophet can be used to predict future traffic on a website or mobile app based on historical data. This can be useful for businesses to plan for capacity and optimize their servers and infrastructure.
Energy consumption forecasting: Facebook Prophet can be used to predict future energy consumption based on historical data. This can be useful for utilities and energy companies to plan for demand and optimize their generation and distribution.

When to Use Facebook Prophet?

Although Facebook Prophet is applicable in any domain where time series data is available, it is most effective when certain conditions are met. These include univariate time series data with prominent seasonal effects and an extensive historical record spanning multiple seasons. Facebook Prophet is especially beneficial when dealing with large quantities of historical data that require efficient analysis and quick, accurate predictions of future trends.

Also: Rolling Time Series Forecasting: Creating a Multi-Step Prediction

Using Facebook Prophet to Forecast the Coca-Cola Stock Price in Python

In this hands-on tutorial, we’ll use Facebook Prophet and Python to create a forecast for the Coca-Cola stock price. We have chosen Coca-Cola as an example because the Coca-Cola share is known to be a cyclical stock. As such, its chart reflects a seasonal pattern, different periods, and varying trend lines. We train our model on historical price data and then predict the next data points half-year in advance. In addition, we will discuss how we could finetune our model to improve the accuracy of the predictions further. This involves the following steps:

Collect historical stock data for CocaCola and familiarize ourselves with the data.
Use Facebook Prophet to fit a model to the data.
Use the model to make predictions about the future stock price of Coca-Cola.
Visualize model components and predictions.
Manually adjust the model to improve the model fit.

By following these steps, we will try to gain insights into the future performance of Coca-Cola stock. Let’s get started!

As always, you can find the code of this tutorial on the GitHub repository.

View on GitHub Relataly GitHub Repo

Prerequisites

Before you proceed, ensure that you have set up your Python environment (3.8 or higher) and the required packages. If you don’t have an environment, consider following this tutorial to set up the Anaconda environment.

Also, make sure you install all required Python packages. We will be working with the following standard Python packages:

pandas
seaborn
matplotlib

In addition, we will use the Facebook Prophet library that goes by the library name “prophet.” You can install these packages using the following commands:

pip install 
conda install  (if you are using the anaconda packet manager)

Step #1 Loading Packages and API Key

Let’s begin by loading the required Python packages and historical price quotes for the Coca-Cola stock. We will obtain the data from the yahoo finance API. Note that the API will return several columns of data, including, opening, average, and closing prices. We will only use the closing price.

# Tested with Python 3.8.8, Matplotlib 3.5, Seaborn 0.11.1, numpy 1.19.5, plotly 4.1.1, cufflinks 0.17.3, prophet 1.1.1, CmdStan 2.31.0
import pandas as pd 
import matplotlib.pyplot as plt 
import numpy as np 
from math import log, exp 
from datetime import date, timedelta, datetime
import seaborn as sns
sns.set_style('white', {'axes.spines.right': False, 'axes.spines.top': False})
from scipy.stats import norm
from prophet import Prophet
from prophet.plot import add_changepoints_to_plot
import cmdstanpy
cmdstanpy.install_cmdstan()
cmdstanpy.install_cmdstan(compiler=True)
# Setting the timeframe for the data extraction
end_date =  date.today().strftime("%Y-%m-%d")
start_date = '2010-01-01'
# Getting quotes
stockname = 'Coca Cola'
symbol = 'KO'
# You can either use webreader or yfinance to load the data from yahoo finance
# import pandas_datareader as webreader
# df = webreader.DataReader(symbol, start=start_date, end=end_date, data_source="yahoo")
import yfinance as yf #Alternative package if webreader does not work: pip install yfinance
df = yf.download(symbol, start=start_date, end=end_date)
# Quick overview of dataset
print(df.head())

[*********************100%***********************]  1 of 1 completed
                 Open       High        Low      Close  Adj Close    Volume
Date                                                                       
2010-01-04  28.580000  28.610001  28.450001  28.520000  19.081614  13870400
2010-01-05  28.424999  28.495001  28.070000  28.174999  18.850786  23172400
2010-01-06  28.174999  28.219999  27.990000  28.165001  18.844103  19264600
2010-01-07  28.165001  28.184999  27.875000  28.094999  18.797268  13234600
2010-01-08  27.730000  27.820000  27.375000  27.575001  18.449350  28712400

Once we have downloaded the data, we create a line plot of the closing price to familiarize ourselves with the time series data. Note that Facebook Prophet works on a single input signal only (univariate data). This input will be the closing price. For illustration purposes, we add a moving average to the chart. However, the moving average makes it easier to spot trends and seasonal patterns, it will not be used to fit the model.

# Visualize the original time series
rolling_window=25
y_a_add_ma = df['Close'].rolling(window=rolling_window).mean() 
fig, ax = plt.subplots(figsize=(20,5))
sns.lineplot(data=df, x=df.index, y='Close', color='skyblue', linewidth=0.5, label='Close')
sns.lineplot(data=df, x=df.index, y=y_a_add_ma, 
    linewidth=1.0, color='royalblue', linestyle='--', label=f'{rolling_window}-Day MA')

The chart shows a long-term upward trend interrupted by phases of downturns. In addition, between 2010 and 2018, we can see some cyclical movements. At some points, we can spot clear breakpoints, for example, in 2019 and mid-2020.

Step #2 Preparing the Data

Next, we prepare our data for model training. Propjet has a strict condition on how the input columns must be named. In order to use Facebook Prophet, your data needs to be in a time series format with the time as the index and the value as the first column. In addition, column names need to adhere to the following naming convention:

ds for the timestamp
y for the metric columns, which in our case is the closing price

So before we proceed, we must rename the columns in our dataframe. In addition, we will remove the index and drop NA values.

df_x = df[['Close']].copy()
df_x['ds'] = df.index.copy()
df_x.rename(columns={'Close': 'y'}, inplace=True)
df_x.reset_index(inplace=True, drop=True)
df_x.dropna(inplace=True)
df_x.tail(9)

		y			ds
3257	63.139999	2022-12-09
3258	63.970001	2022-12-12
3259	63.990002	2022-12-13

Now we have a simple dataframe with ds and y as the only variables.

Step #3 Model Fitting and Forecasting

Next, let’s fit our forecasting model to the time series data. Afterward, we can make predictions about future values in the series. However, before we do this, we need to define our prediction interval.

3.1 Setting the Prediction Interval

The prediction interval is a measure of uncertainty in a forecast made with Facebook Prophet. It indicates the range within which the true value of the forecasted quantity is expected to fall a certain percentage of the time. For example, a 95% prediction interval means that the true value of the forecasted quantity is expected to fall within the given range 95% of the time.

In Facebook Prophet, the prediction interval is controlled by the interval_width parameter, which can be set when calling the predict method. The default value for interval_width is 0.80. This means that the true value of the forecasted quantity is expected to fall within the prediction interval 80% of the time. We can adjust the value of interval_width to change the width of the prediction interval as desired. In the example below, we use a prediction interval of 0.85.

3.2 Fit the Model

Next, let’s fit our model and generate a one-year forecast. First, we need to instantiate our model with by calling Prophet(). Then we use model.fit(df) to fit this model to the historical price quotes of the Coca-Cola stock. Once, we have done that, we use the model instance model.make_future_dataframe() to create an extended dataframe (future_df). This dataframe has been extended with records for a one-year period. The records are empty dummy values ready to be filled with the real forecast. We then pass this dummy dataframe to the model.predict(df) function, Facebook Prophet creates the forecast and fills up the dummy dataframe with the forecast values.

For the sake of reusability, I have encapsulated the entire process into a wrapper function. This will allow us to run quick experiments with different parameter values.

# This function fits the prophet model to the input data and generates a forecast
def fit_and_forecast(df, periods, interval_width, changepoint_range=0.8):
    # set the uncertainty interval
    Prophet(interval_width=interval_width)
    # Instantiate the model
    model = Prophet(changepoint_range=changepoint_range)
    # Fit the model
    model.fit(df)
    # Create a dataframe with a given number of dates
    future_df = model.make_future_dataframe(periods=periods)
    # Generate a forecast for the given dates
    forecast_df = model.predict(future_df)
    #print(forecast_df.head())
    return forecast_df, model, future_df
# Forecast for 365 days with full data
forecast_df, model, future_df = fit_and_forecast(df_x, 365, 0.95)
print(forecast_df.columns)
forecast_df[['yhat_lower', 'yhat_upper', 'yhat']].head(5)

Index(['ds', 'trend', 'yhat_lower', 'yhat_upper', 'trend_lower', 'trend_upper',
       'additive_terms', 'additive_terms_lower', 'additive_terms_upper',
       'weekly', 'weekly_lower', 'weekly_upper', 'yearly', 'yearly_lower',
       'yearly_upper', 'multiplicative_terms', 'multiplicative_terms_lower',
       'multiplicative_terms_upper', 'yhat'],
      dtype='object')
	yhat_lower	yhat_upper	yhat
0	24.468273	28.944286	26.691615
1	24.496074	29.146425	26.706924
2	24.513424	28.829159	26.682213
3	24.358048	28.767209	26.667476
4	24.487963	28.839966	26.666242

Voila, we have generated a one-year forecast.

Step #4 Analyzing the Forecast

Next, let’s visualize our forecast and discuss what we see. The most simple way is to create the plot with a standard Facebook Prophet function.

Also: Measuring Regression Errors with Python

model.plot(forecast_df, uncertainty=True)

So what do we see? The forecast shows that our model does not simply predict a straight line and instead has generated a more sophisticated forecast that displays an upward cyclical trend with higher highs and higher lows.

The black dots are the data points from the historical data to which we have fit our model.
The dark blue line is the most likely path.
The light blue lines are the upper and lower boundaries of the prediction interval. We have set the prediction interval to 0.85, which means there is a probability of 85% the actual values will fall into this range.
In total, the model seems confident that the price of Coca-Cola stock will rise within the next year (no financial advice). However, as we will see later, the forecast depends on where the model sees the breakpoints.

In case, you want to create a custom plot, you can use the function below.

# Visualize the Forecast
def visualize_the_forecast(df_f, df_o):
    rolling_window = 20
    yhat_mean = df_f['yhat'].rolling(window=rolling_window).mean() 
    # Thin out the ground truth data for illustration purposes
    df_lim = df_o
    # Print the Forecast
    fig, ax = plt.subplots(figsize=[20,7])
    sns.lineplot(data=df_f, x=df_f.ds, y=yhat_mean, ax=ax, label='predicted path', color='blue')
    sns.lineplot(data=df_lim, x=df_lim.ds, y='y', ax=ax, label='ground_truth', color='orange')
    #sns.lineplot(data=df_f, x=df_f.ds, y='yhat_lower', ax=ax, label='yhat_lower', color='skyblue', linewidth=1.0)
    #sns.lineplot(data=df_f, x=df_f.ds, y='yhat_upper', ax=ax, label='yhat_upper', color='coral', linewidth=1.0)
    plt.fill_between(df_f.ds, df_f.yhat_lower, df_f.yhat_upper, color='lightgreen')
    plt.legend(framealpha=0)
    ax.set(ylabel=stockname + " stock price")
    ax.set(xlabel=None)
visualize_the_forecast(forecast_df, df_x)

Step #5 Analyzing Model Components

We can gain a better understanding of different model components by using the plot_components function. This method creates a plot showing the trend, weekly and yearly seasonality, and any additional user-defined seasonalities of the forecast. This can be useful for understanding the underlying patterns in the data and for diagnosing potential issues with the model.

model.plot_components(forecast_df)

The first chart shows the trendlines that the model sees within different periods. The trendlines are separated by breakpoints about, which we will talk in the next section. When we look at the second plot, we can see no price changes during the weekend. This is plausible, considering that the stock markets are closed over the weekend. The third chart is most interesting, as it shows that the model has recognized some yearly seasonality with two peaks in April and August, as well as lows in March and October.

Step #6 Adjusting the Changepoints of our Facebook Prophet Model

Let’s take a closer look at the changepoints in our model. Changepoints are the points in time where the trend of the time series is expected to change, and Facebook Prophet’s algorithm automatically detects these points and adapts the model accordingly. Changepoints are important to Facebook Prophet because they allow the model to capture gradual changes or shifts in the data. By identifying and incorporating changepoints into the forecasting model, Facebook Prophet can make more accurate predictions. Changepoints can also help to identify potential outliers in the data.

6.1 Checking Current Changepoints

We can illustrate the changepoints in our model with the add_changepoints_to_plot method. The method adds vertical lines to a plot to indicate the locations of the changepoints in the data. By plotting the changepoints on a graph, we can visually identify when these changes in trend occur and potentially diagnose any issues with our model.

# Printing the ChangePoints of our Model
forecast_df, model, future_df = fit_and_forecast(df_x, 365, 1.0)
axislist = add_changepoints_to_plot(model.plot(forecast_df).gca(), model, forecast_df)

The chart above shows that our model has identified several changepoints in the historical data. However, it has only searched for changepoints within 80% of the time series. As a result, the algorithm hasn’t identified any change points in the most recent years after 2020. We can adjust the changepoints with the changepoint_range (default = 80%) variable. This is what we will do in the next section.

6.2 Adjusting Changepoints

We can adjust the range within which Facebook Prophet looks for changepoints with the “changepoint_range”. It is specified as a fraction of the total duration of the time series. For example, if changepoint_range is set to 0.8 and the time series spans 10 years, the algorithm will look for changepoints within the last 8 years of the series.

By default, changepoint_range is set to 0.8, which means that the algorithm will look for changepoints within the last 80% of the time series. We can adjust this value depending on the characteristics of our data and our desired level of flexibility in the model.

Increasing the value of changepoint_range will allow the algorithm to identify more changepoints and potentially improve the fit of the model, but it may also increase the risk of overfitting. Conversely, decreasing the value of changepoint_range will reduce the number of changepoints detected and may improve the model’s ability to generalize to new data, but it may also reduce the accuracy of the forecast.

Let’s fit our model again, but this time we let Facebook Prophet search for changepoints within the entire time series (changepoint_range=1.0).

# Adjusting ChangePoints of our Model
forecast_df, model, future_df = fit_and_forecast(df_x, 365, 1.0, 1.0)
axislist = add_changepoints_to_plot(model.plot(forecast_df).gca(), model, forecast_df)

The plot above shows that Facebook Prophet has now identified several additional breakpoints in the time series. As a result, the forecast has become rather pessimistic, as Facebook Prophet gave more weight to recent changes.

Finally, it is worth mentioning that it is possible to add changepoints for specific dates manually. You can try this out using “model.changepoints(series)”. The function takes a series of timestamps as the parameter value.

Summary

Get ready to dive into the world of stock market prediction with Facebook Prophet! In this article, we’ll show you how to leverage the power of this amazing tool to forecast time series data, using Coca-Cola’s stock as an example. We’ll guide you through the process of fitting a curve to univariate time series data and fine-tuning the initial breakpoints and trendlines to enhance model performance. With Facebook Prophet’s automatic trend identification algorithm, you’ll be able to easily adapt to changes in the data over time.

Also: Mastering Multivariate Stock Market Prediction with Python

As a data scientist, you’ll appreciate how easy it is to use Facebook Prophet and how it consistently outperforms other models. With its straightforward interface and impressive accuracy, this tool is a must-have for your forecasting toolkit. And we’re always looking for feedback from our audience, so let us know what you think! We’re committed to improving our content to provide the best learning experience possible.

Sources and Further Reading

The links above to Amazon are affiliate links. By buying through these links, you support the Relataly.com blog and help to cover the hosting costs. Using the links does not affect the price.

Other Methods for Time Series Forecasting

The post Univariate Stock Market Forecasting using Facebook Prophet in Python appeared first on relataly.com.

Unveiling Hidden Patterns in the Cryptocurrency Market with Affinity Propagation and Python

Florian Follonier — Mon, 02 May 2022 18:34:02 +0000

Affinity propagation is a powerful unsupervised clustering technique that can identify hidden patterns in large datasets. In the cryptocurrency world, where new coins are constantly emerging and prices can be highly volatile, affinity propagation can help investors simplify the chaos.

By analyzing historical price data, affinity propagation groups coins into clusters based on their past price fluctuations. Such a cluster analysis enables crypto investors to identify promising entry and exit points, ultimately helping them make smarter investment decisions.

To use this technique effectively, it’s important to understand essential concepts such as covariance, lasso regression, and affinity propagation. Once you understand these concepts, you can apply them to analyze price time series data and identify hidden patterns.

Finally, visualizing the results in two and three dimensions can better understand the relationships between coins and their respective clusters. The resulting crypto market map can be a powerful tool for investors to gain insight into the market’s structure and make informed investment decisions.

Disclaimer

This article does not constitute financial advice. Stock markets can be very volatile and are generally difficult to predict. Predictive models and other forms of analytics applied in this article only serve the purpose of illustrating machine learning use cases.

What is Stock Market Clustering?

Clustering stock markets refers to grouping stocks based on their similarities or common characteristics. This can be done using various clustering algorithms, which analyze the data and assign each stock market to a cluster based on its similarity to other stock markets in the same cluster. In this article, we will run a cluster analysis on historical time series data. This approach involves grouping stocks into clusters based on their historical performance over a certain period of time.

Clustering stock market data can be useful for a variety of purposes, such as identifying patterns or trends in the data, comparing the performance of different stocks or sectors, or generating investment recommendations. However, it’s important to keep in mind that clustering is just one tool among many for analyzing stock market data, and it’s important to consider a range of factors when making investment decisions. It can also be used to compare the performance of different stock markets and identify potential risks or correlations between them.

Also: Color-Coded Cryptocurrency Price Charts in Python

neural network machine learning python affinity propagation midjourney relataly crypto-min

" data-image-caption="

neural network machine learning python affinity propagation midjourney relataly crypto-min

" data-large-file="https://www.relataly.com/wp-content/uploads/2023/03/neural-network-machine-learning-python-affinity-propagation-midjourney-relataly-crypto-min.png" src="https://www.relataly.com/wp-content/uploads/2023/03/neural-network-machine-learning-python-affinity-propagation-midjourney-relataly-crypto-min.png" alt="neural network machine learning python midjourney relataly crypto market map" class="wp-image-12694" srcset="https://www.relataly.com/wp-content/uploads/2023/03/neural-network-machine-learning-python-affinity-propagation-midjourney-relataly-crypto-min.png 509w, https://www.relataly.com/wp-content/uploads/2023/03/neural-network-machine-learning-python-affinity-propagation-midjourney-relataly-crypto-min.png 300w, https://www.relataly.com/wp-content/uploads/2023/03/neural-network-machine-learning-python-affinity-propagation-midjourney-relataly-crypto-min.png 140w" sizes="(max-width: 509px) 100vw, 509px" />

We can use a crypto market map to illustrate the price correlation between cryptocurrencies.

What’s the Problem with Prototype-based Clustering?

Clustering is an unsupervised learning technique that groups similar objects into clusters and separates them from different ones. One of the most popular clustering techniques is k-means. K-means belongs to the so-called prototype-based clustering techniques, which divide data points into a predefined number of groups (in the case of k-means, the groups are of equal variance).

The prototype-based clustering approach works great if the number of clusters in a dataset is known and the clusters have similar despair. However, when we deal with real-world problems, we often encounter more complex data for which the optimal number of clusters is unknown and difficult or even impossible to guess. In such a case, affinity propagation has a significant advantage because it can automatically estimate the number of clusters.

Affinity Propagation: What it is and How it Works

The idea of affinity propagation is to identify clusters by measuring the similarity of data points relative to one another. The algorithm chooses data points as cluster centers that best represent other data points near them.

We can imagine the process of identifying these representative data points as an election. Each data point (i) is a voter who casts votes and a candidate (k) who can receive votes from other voters. Votes are a measure of the similarity of data points. A voter who gives many votes to a candidate expresses that this data point is similar to him and therefore is suitable for representing him as a cluster center. The voting process continues until the algorithm reaches a consensus and selects a set number of cluster candidates.

Affinity Propagation: Data points cast votes for candidates and receive votes from other data points

The clustering process involves many separate steps (This article provides a detailed description of the steps involved) and works with several matrices:

The similarity matrix assesses the suitability of data points (candidates) to act as cluster centers.
The availability matrix (or responsibility matrix) collects the support of the data points for the candidates (potential cluster centers) and their suitability to represent them.
The criterion matrix sums up the results and defines the clusters. Data points with equal scores in the criterion matrix are considered part of the same cluster.

Criterion Matrix: Data Points (Cryptos) with equal numbers are part of the same cluster

Time Series Clustering using Affinity Propagation – Visualizing Cryptocurrency Market Structures in Python

Ready to implement affinity propagation in Python to analyze the crypto market structure and create a visual representation of price similarity? Let’s dive in!

First, we define a portfolio of cryptocurrencies and download their historical price quotes from coinmarketcap. We then visualize the time series on separate line charts to ensure that the data has been loaded successfully. After preparing and cleaning the data, we can move on to clustering the cryptocurrencies into groups with similar price movements using Affinity Propagation.

Unlike other clustering algorithms, we don’t set the number of clusters in advance. Instead, we let affinity propagation determine the optimal number of clusters for our portfolio. Finally, we calculate the covariance matrix between clusters and arrange the cryptocurrencies on a 2D map into clusters. We create a network overlay based on covariance to better understand the relationships between different clusters.

With affinity propagation, we can identify hidden patterns in the crypto market and group coins into clusters based on their past price fluctuations. This process allows us to identify promising entry and exit points, ultimately helping us make smarter investment decisions. Plus, the 2D map and network overlay help us visualize the relationships between different clusters and coins.

We can use affinity propagation to cluster financial assets and visualize them on a map.

The Python code for this tutorial is available in the relataly repository on GitHub.

View on GitHub Relataly GitHub Repo

Prerequisites

Before beginning the coding part, ensure that you have set up your Python 3 environment and required packages. Consider Anaconda if you don’t have a Python environment set up yet. To set it up, you can follow the steps in this tutorial. Also, make sure you install all required packages. In this tutorial, we will be working with the following standard packages:

Please also make sure you have the Cmcscaper package installed. We will be using it to download past crypto prices from coinmarketcap.

You can install these packages using console commands:

pip install
conda install (if you are using the anaconda packet manager)

Step #1: Load the Stock Market Data

We start by loading historical crypto price data from Coinmarketcap. To download the data, we use Cmcscraper, a Python library that allows us to collect Coinmarketcap data without signing up for the official API.

The download returns a dataframe with daily price quotes (Close, Open, Avg) for cryptocurrencies between 2016 and today. You can use the dictionary (“symbol_dict”) to control which cryptos you want to include in the data. We limit the data we use in our cluster analysis to the last 50 days. In this way, we let the correlation consider earlier price developments. But it’s up to you to specify a different period. In addition, instead of using absolute price values, we will use daily percentage fluctuations.

Also: Requesting Crypto Price Data from the Gate.io REST API in Python

Loading the data can take several minutes, depending on how many cryptocurrencies we include in the request. So it makes sense not to load the data every time you run the code. Therefore, the code below stores the historical prices in a CSV file.

The script will check if the data already exists if you run the code below. If it does, it will use the data from the CSV file. Otherwise, it will load a fresh copy of the data from coinmarketcap.

# A tutorial for this file is available at www.relataly.com
# Tested with Python 3.8.8, Matplotlib 3.5, Scikit-learn 0.24.1, Seaborn 0.11.1, numpy 1.19.5

from cryptocmd import CmcScraper
import pandas as pd 
import matplotlib.pyplot as plt 
import numpy as np 
import seaborn as sns
from sklearn import cluster, covariance, manifold
import requests
import json


#get a dictionary of the top 100 coin symbols and names from an API
def get_symbol_dict():
    url = 'https://api.coinmarketcap.com/data-api/v3/cryptocurrency/listing?start=1&limit=50&sortBy=market_cap&sortType=desc&convert=USD&cryptoType=all&tagType=all&audited=false'
    response = requests.get(url)
    data = json.loads(response.text)
    df = pd.DataFrame(data['data']['cryptoCurrencyList'])

    # exclude stable coins
    df = df[~df['symbol'].isin(['USDT', 'USDC', 'BUSD', 'DAI', 'TUSD', 'PAX', 'GUSD', 'HUSD', 'USDK', 'USDS', 'USDP', 'USDN', 'USDSB', 'USDX', 'USD++', 'BIDR', 'IDRT', 'VAI', 'BGBP'])]
    df = df[['symbol', 'name']]
    df = df.set_index('symbol')
    df = df.to_dict()
    df = df['name']
    return df

symbol_dict = get_symbol_dict()


# Download historic crypto prices via CmcScraper
def load_fresh_data_and_save_to_disc(symbol_dict, save_path):
    # Extract symbols and names from the symbol_dict
    symbols, names = np.array(sorted(symbol_dict.items())).T
    
    # Initialize an empty DataFrame for storing the prices
    df_crypto = pd.DataFrame()

    # Download and process the price data for each symbol
    for symbol in symbols:
        print(f"Fetching prices for {symbol}...")
        
        # Download the price data using CmcScraper
        scraper = CmcScraper(symbol)
        df_coin_prices = scraper.get_dataframe()

        # Process the price data and add it to df_crypto
        df = pd.DataFrame({
            f"{symbol}_Open": df_coin_prices["Open"],
            f"{symbol}_Close": df_coin_prices["Close"],
            f"{symbol}_Avg": (df_coin_prices["Close"] + df_coin_prices["Open"]) / 2,
            f"{symbol}_p": (df_coin_prices["Open"] - df_coin_prices["Close"]) / df_coin_prices["Open"]
        })
        df_crypto = pd.concat([df_crypto, df], axis=1)

    # Save the price data to a CSV file
    X_df_filtered = df_crypto.filter(like="_p")
    X_df_filtered.to_csv(save_path + "historical_crypto_prices.csv")

    return names, symbols, X_df_filtered
        

# If set to False the data will only be downloaded when you execute the code
# Set to True, if you want a fresh copy of the data.  
fetch_new_data = True 
save_path = '' # path where the price data will be stored in a csv file

# Fetch fresh data via the scraping package, or use data from the csv file on disk
if fetch_new_data == False:
    try:
        print('loading from disk')
        X_df_filtered = pd.read_csv(save_path + 'historical_crypto_prices.csv')
        if 'Unnamed: 0' in X_df_filtered.columns: 
            X_df_filtered = X_df_filtered.drop(['Unnamed: 0'], axis=1)
            symbols, names = np.array(sorted(symbol_dict.items())).T
        print(list(X_df_filtered.columns))
    except:
        print('no existing price data found - loading fresh data from coinmarketcap and saving them to disk')
        names, symbols, X_df_filtered = load_fresh_data_and_save_to_disc(symbol_dict, save_path)
        print(list(symbols))
else:
       print('loading fresh data from coinmarketcap and saving them to disk')
       names, symbols, X_df_filtered = load_fresh_data_and_save_to_disc(symbol_dict, save_path)
       print(list(symbols))

# Limit the price data to the last t days
t= 14 # in days
X_df_filtered = X_df_filtered[:t]
X_df_filtered.head()

	ACM_p		ADA_p		ARK_p		ATM_p		ATOM_p		AVAX_p		BAT_p		BCH_p		BLZ_p		BNB_p		...	THETA_p		UNI_p		USDT_p		VET_p		WAVES_p		XLM_p		XMR_p		XRP_p		ZIL_p		ZRX_p
0	0.031987	-0.037645	-0.005702	0.030928	-0.005897	-0.012404	-0.012262	-0.022529	0.008072	-0.007111	...	-0.021994	-0.023758	-0.000103	-0.021024	-0.015416	-0.004096	-0.022988	-0.027397	-0.016659	-0.012255
1	0.028192	0.065034	0.122306	0.010310	0.093558	0.106811	0.082863	0.075567	0.062105	0.054733	...	0.067264	0.081040	0.000136	0.077203	0.092987	0.078562	0.111519	0.071696	0.076484	0.085094
2	0.040771	0.016097	-0.133345	0.018963	0.011304	-0.033328	-0.007616	0.011458	-0.019993	0.005134	...	-0.005104	-0.024190	0.000077	0.002218	0.008920	0.004139	-0.031822	-0.012107	-0.003906	-0.021170
3	-0.027698	0.005129	-0.031516	-0.002639	0.022235	-0.008117	0.003969	0.019119	0.015403	0.005920	...	0.007992	0.027203	0.000003	0.000701	0.010739	0.005324	-0.007914	0.007168	0.004556	-0.003786
4	-0.021129	-0.019053	0.003273	-0.008121	0.002883	-0.004927	0.002548	-0.000599	0.028492	-0.012181	...	0.000198	-0.025817	-0.000047	-0.002800	-0.051515	-0.004861	0.015134	-0.000596	-0.010343	0.004530

The data looks good, so let’s continue.

Step #2 Plotting Crypto Price Charts

Now that the data is available, we can visualize it in various line graphs. The visualization helps us better understand what kind of data we are dealing with and check if the download was successful.

# Create Prices Charts for all Cryptocurrencies
list_length = X_df_filtered.shape[1]
ncols = 10
nrows = int(round(list_length / ncols, 0))
height = list_length/3 if list_length > 30 else 4
fig, axs = plt.subplots(nrows=nrows, ncols=ncols, sharex=True, sharey=True, figsize=(20, height))
for i, ax in enumerate(fig.axes):
        if i < list_length:
            sns.lineplot(data=X_df_filtered, x=X_df_filtered.index, y=X_df_filtered.iloc[:, i], ax=ax)
            ax.set_title(X_df_filtered.columns[i])
plt.show()

We can see the lineplots for all cryptocurrencies and everything looks as expected.

Step #3 Clustering Cryptocurrencies using Affinity Propagation

Next, we must prepare the data and run the affinity propagation algorithm. For some cryptocurrencies, we may encounter data that contains NaN values. Because clustering is sensitive to missing values, we must ensure good data quality. In addition, the Python code below will convert the DataFrame into a NumPy array and transpose it into a form where we have crypto assets as records and the days as columns.

Running the code below returns a dictionary of clusters with the cryptocurrencies assigned to them by the affinity propagation algorithm.

# Drop NaN values
X_df = pd.DataFrame(np.array(X_df_filtered)).dropna()
# Transpose the data to structure prices along columns
X = X_df.copy()
X /= X.std(axis=0)
X = np.array(X)
# Define an edge model based on covariance
edge_model = covariance.GraphicalLassoCV()
# Standardize the time series
edge_model.fit(X)
# Group cryptos to clusters using affinity propagation
# The number of clusters will be determined by the algorithm
cluster_centers_indices , labels = cluster.affinity_propagation(edge_model.covariance_, random_state=1)
cluster_dict = {}
n_labels = labels.max()
print(f"{n_labels} Clusters")
for i in range(n_labels + 1):
    clusters = ', '.join(names[labels == i])
    print('Cluster %i: %s' % ((i + 1), clusters))
    cluster_dict[i] = (clusters)

9 Clusters
Cluster 1: Binance Coin, Cake Defi
Cluster 2: Bitcoin Cash, Bitcoin, BitTorrent, Decred, EOS, Ethereum Classic, Ethereum, Ampleforth, Komodo, Solana, Sys Coin, DOT
Cluster 3: Celsius
Cluster 4: Doge Coin
Cluster 5: Cardano, ATOM, Avalance, Enjin, Internet Computer, Link, Loopring, Polygon, IOTA, NEO, Synthetix, Theta, Vechain
Cluster 6: Litecoin
Cluster 7: ACM Token, Atletico Madrid Token, Chilliz, Juventus Turin Token, PSG Token
Cluster 8: LRC
Cluster 9: Tether
Cluster 10: ARK, Battoken, BLZ, Digibyte, AS Rom Token, WAVES, Stellar Lumen, Monero, Ripple, Zilliqa, Zer0

We can see that the algorithm has identified 13 different clusters in the data and a couple of clusters with only a single member. You will most likely encounter different results depending on when you run it.

Step #4 Create a 2D Positioning Model based on the Graph Structure

In addition to clusters, we want to show the covariance between cryptocurrencies in our Crypto Market map. We need a graph-like structure that contains the covariance and position data of the cryptocurrencies for each crypto pair.

In addition, we use a node position model that calculates their relative position on a 2D plane from the covariance of the cryptocurrencies. However, the positions are only relative, so the absolute axes have no meaning.

# Create a node_position_model that find the best position of the cryptos on a 2D plane
# The number of components defines the dimensions in which the nodes will be positioned
node_position_model = manifold.LocallyLinearEmbedding(n_components=2, eigen_solver='dense', n_neighbors=20)
embedding = node_position_model.fit_transform(X.T).T
# The result are x and y coordindates for all cryptocurrencies
pd.DataFrame(embedding)
# Create an edge_model that represents the partial correlations between the nodes
partial_correlations = edge_model.precision_.copy()
d = 1 / np.sqrt(np.diag(partial_correlations))
partial_correlations *= d
partial_correlations *= d[:, np.newaxis]
# Only consider partial correlations above a specific threshold (0.02)
non_zero = (np.abs(np.triu(partial_correlations, k=1)) > 0.02)
# Convert the Positioning Model into a DataFrame
data = pd.DataFrame.from_dict({"embedding_x":embedding[0],"embedding_y":embedding[1]})
# Add the labels to the 2D positioning model
data["labels"] = labels
print(data.shape)
data.head()

(48, 3)
	embedding_x	embedding_y	labels
0	0.400590	-0.136473	6
1	-0.081908	-0.086039	4
2	-0.033982	-0.038526	9
3	0.416745	0.076849	6
4	-0.041938	0.031966	4

The next step is to create a graph of the partial correlations.

Step #5 Visualize the Crypto Market Structure

Our goal is to visualize differences in the covariance between crypto pairs by varying the connection strengths. We calculate the line strength by normalizing the covariance of the crypto pairs. In addition, we visualize the distribution of the covariance.

# Create an array with the segments for connecting the data points
start_idx, end_idx = np.where(non_zero) 
segments = [[np.array([embedding[:, start], embedding[:, stop]]).T, start, stop] for start, stop in zip(start_idx, end_idx)]
# Create a normalized representation of partial correlation between crypto currencies
# We can later use covariance to vizualize the strength of the connections
pc = np.abs(partial_correlations[non_zero])
normalized = (pc-min(pc))/(max(pc)-min(pc))
# plot the distribution of covariance between the cryptocurrencies
sns.histplot(pc)

The hist plot shows that the covariance between the crypto pairs is mostly below 0.005.

Finally, it is time to map cryptocurrencies on a 2D plane. To do this, we first define the cryptocurrencies using their relative position data with a scatterplot. We set the color of the points based on their clusters so that points in the same cluster are colored the same. Subsequently, we connect the points to the data from the edge model. The covariance between the crypto pairs determines the strength of their connections.

We also define the color of the connections as follows.

The map only shows connections with a covariance greater than 0.002.
Connections with a covariance greater than 0.05 are colored red.
Otherwise, connections between points within a cluster are shown in the cluster’s color.
We color connections in grey that are between points of different clusters.

Last but not least, we add the labels of the cryptocurrencies.

# Visualization
plt.figure(1, facecolor='w', figsize=(20, 8))
plt.clf()
ax = plt.axes([0., 0., 1., 1.])

# Plot the nodes using the coordinates of our embedding
sc = sns.scatterplot(
    data=data,
    x="embedding_x",
    y="embedding_y",
    zorder=1,
    s=350 * d ** 2,
    c=labels,
    cmap=plt.cm.nipy_spectral,
    alpha=.9,
    #palette="muted",
)

# Plot the covariance edges between the nodes (scatter points)
line_strength = 3.2
    
for index, ((x, y), start, stop) in enumerate(segments):     
    norm_partial_correlation = normalized[index]
    if list(data.iloc[[start]]['labels'])[0] == list(data.iloc[[stop]]['labels'])[0]:
        if norm_partial_correlation > 0.5:
            color = 'red'; linestyle='solid'
        else:
            color = plt.cm.nipy_spectral(list(data.iloc[[start]]['labels'])[0] / float(n_labels)); linestyle='solid'
    else:
        if norm_partial_correlation > 0.5:
            color = 'red'; linestyle='solid'
        else:
            color = 'grey'; linestyle='dashed'
    # Plot the edges
    # if x and y larger than 0
    if x[0] > 0 and y[0] > 0:
        plt.plot(x, y, alpha=.4, zorder=0, linewidth=normalized[index]*line_strength, color=color, linestyle=linestyle)

    
# Labels the nodes and position the labels to avoid overlap with other labels
for name, label, (x, y) in zip(names, labels, embedding.T):
    color = plt.cm.nipy_spectral(label / float(n_labels))
    ax.annotate(
        name,
        xy=(x, y),
        xytext=(5, 2),
        textcoords='offset points',
        ha='right',
        va='bottom',
        fontsize=10,
        color='black',
        bbox=dict(facecolor='w', edgecolor="w", alpha=.0),
     )

Note that you will likely see a different map when you run the code on your machine. Differences result from changes in market prices and covariance that lead to other graph structures.

Let’s see what the crypto market map tells us.

Interpreting the Cryptomarket Map

The 2D crypto market map tells us several things:

Most cryptos fall into the light green and dark green clusters corresponding to different types of crypto (Decentralized Finance Coins, NFT/Metaverse Coins).
There is a significant covariance between large-cap players in the crypto space, such as Cardano and Loopring and Ethereum and Bitcoin, which is plausible considering recent price movements. Some results are surprising, for example, the partial correlation between NEO and Ethereum Classic.
Some clusters are isolated and contain only a single member, for example, Tether, Komodo, AC Milan token, Wave token, and Dogecoin). The reason is that the prices of these coins/tokens have developed independently of the market.
- Tether is a stablecoin that does not change in price. It, therefore, strongly differs from the other cryptocurrencies on our map.
- Komodo has been trading sideways without following the general market trend.
- And the MCM token is a soccer token that has recently outperformed the market.
Soccer tokens are colored in dark blue. These tokens’ prices correlate with how the soccer clubs performed during the current season. It, therefore, makes perfect sense that these tokens are grouped into a cluster. An exception is the AC Milan token, which recently performed better than the other soccer tokens.

Step #6 Creating a 3D Representation

Instead of a 2D representation of the data points, we can also use a 3D node positioning model. For this purpose, the node positioning model distributes the affinity values over three dimensions.

# Find the best position of the cryptos on a 3D plane
node_position_model = manifold.LocallyLinearEmbedding(n_components=3, eigen_solver='dense', n_neighbors=20)
embedding = node_position_model.fit_transform(X.T).T
# The result are x and y coordindates for all cryptocurrencies
pd.DataFrame(embedding)
# Display a graph of the partial correlations
partial_correlations = edge_model.precision_.copy()
d = 1 / np.sqrt(np.diag(partial_correlations))
partial_correlations *= d
partial_correlations *= d[:, np.newaxis]
non_zero = (np.abs(np.triu(partial_correlations, k=1)) > 0.02)
data = pd.DataFrame.from_dict({"embedding_x":embedding[0],"embedding_y":embedding[1],"embedding_z":embedding[1]})
data["labels"] = labels
data["names"] = names
import matplotlib.pyplot as plt
import numpy as np
fig = plt.figure(figsize=(20,20))
ax = fig.add_subplot(projection='3d')
xs = data["embedding_x"]
ys = data["embedding_y"]
zs = data["embedding_z"]
sc = ax.scatter(xs, ys, zs, c=labels, s=100)
    
for i in range(len(data)):
    x = xs[i]
    y = ys[i]
    z = zs[i]
    label = data["names"][i]
    ax.text(x, y, z, label)
    
plt.legend(*sc.legend_elements(), bbox_to_anchor=(1.05, 1), loc=2)
plt.show()

Summary

Affinity propagation is a powerful technique for clustering items when the optimal number of clusters is unknown. In this article, we’ve demonstrated how to apply affinity propagation to analyze the cryptocurrency market and identify groups of assets based on similar price fluctuations.

In our example, we identified 13 groups of cryptocurrencies without specifying the number of clusters in advance. We also visualized the market structure on a 2D and 3D map using a node distribution technique. This approach can be extended to analyze and cluster stock markets, highlighting complex price patterns among multiple financial assets.

Once you’ve identified clusters, you can dive deeper into individual groups. Sometimes, outliers that temporarily break out of their usual pattern indicate interesting investment opportunities. These outliers can eventually return to the price pattern of their group, or they may represent forerunners of their group, indicating broader market movements.

By using affinity propagation, we can visualize financial assets in a new and exciting way. If you have any questions or comments about this approach, please let me know.

Sources and Further Reading

This article modifies some of the code from Scikit-learn and adapts it from the stock market to cryptocurrencies.

Jansen (2020) Machine Learning for Algorithmic Trading: Predictive models to extract signals from market and alternative data for systematic trading strategies with Python
Aurélien Géron (2019) Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems
David Forsyth (2019) Applied Machine Learning Springer
Andriy Burkov (2020) Machine Learning Engineering
Images are created using Midjourney, an AI that creates images from text.

The links above to Amazon are affiliate links. By buying through these links, you support the Relataly.com blog and help to cover the hosting costs. Using the links does not affect the price.

The post Unveiling Hidden Patterns in the Cryptocurrency Market with Affinity Propagation and Python appeared first on relataly.com.

Stock Market Forecasting Neural Networks for Multi-Output Regression in Python

Florian Follonier — Tue, 13 Jul 2021 21:10:23 +0000

Multi-output time series regression can forecast several steps of a time series at once. The number of neurons in the final output layer determines how many steps the model can predict. Models with one output return single-step forecasts. Models with various outputs can return entire series of time steps and thus deliver a more detailed projection of how a time series could develop in the future. This article is a hands-on Python tutorial that shows how to design a neural network architecture with multiple outputs. The goal is to create a multi-output model for stock-price forecasting using Python and Keras. By the end of this tutorial, you will have learned how to design a multi-output model for stock price forecasting using Python and Keras. This knowledge can be applied to other types of time series forecasting tasks, such as weather forecasting or sales forecasting.

This article proceeds as follows: We briefly discuss the architecture of a multi-output neural network. After familiarizing ourselves with the model architecture, we develop a Keras neural network for multi-output regression. For data preparation, we perform various steps, including cleaning, splitting, selecting, and scaling the data. Afterward, we define a model architecture with multiple LSTM layers and ten output neurons in the last layer. This architecture enables the model to generate projections for ten consecutive steps. After configuring the model architecture, we train the model with the historical daily prices of the Apple stock. Finally, we use this model to generate a ten-day forecast.

Disclaimer

Multi-Output Regression vs. Single-Output Regression

In time series regression, we train a statistical model on the past values of a time series to make statements about how the time series develops further. During model training, we feed the model with so-called mini-batches and the corresponding target values. The model then creates forecasts for all input batches and compares these predictions to the actual target values to calculate the residuals (prediction errors). In this way, the model can adjust its parameters iteratively and learn to make better predictions.

Multivariate forecasting models take into account multiple input variables, such as historical time series data and additional features like moving averages or momentum indicators, to improve the accuracy of their predictions. The idea is that these various variables can help the model identify patterns in the data that suggest future price movements.

An exemplary architecture of a neural network with five input neurons (blue) and four output neurons (red)

The Architecture of a Neural Network with Multiple Outputs

Next, we will discuss the architecture of a neural network with multiple outputs. The architecture consists of several layers, including an input layer, several hidden layers, and an output layer. The number of neurons in the first layer must match the input data, and the number of neurons in the output layer determines the period length of the predictions.

Models with a single neuron in the output layer are used to predict a single time step. It is possible to predict multiple price steps with a single-output model. It requires a rolling forecasting approach in which the outputs are iteratively reused to make further-reaching predictions. However, this way is somewhat cumbersome. A more elegant way is to train a multi-output model right away.

The inputs and outputs of a neural network for time series regression with five input neurons and four outputs

Training Neural Networks with Multiple Outputs

A model with multiple neurons in the output layer can predict numerous steps once per batch. Multi-output regression models train on many sequences of subsequent values, followed by the consecutive output sequence. The model architecture thus contains multiple neurons in the initial layer and various neurons in the output layer (as illustrated).

In a multi-output regression model, each neuron in the output layer is responsible for predicting a different time step in the future. To train such a model, you need to provide a sequence of input data followed by the corresponding sequence of output data. For example, if you want to predict the stock price for the next ten days, you would provide a sequence of input data containing the historical stock prices for the past 50 days, followed by a sequence of output data containing the stock prices for the next 10 days.

The model will then learn to map the input sequence to the output sequence so that it can make predictions for multiple time steps in the future based on the input data.

In the next part of this tutorial, we will walk through the process of developing a multi-output regression model in more detail.

Implementing a Neural Network Model for Multi-Output Multi-Step Regression in Python

Let’s get started with the hands-on Python part. In the following, we will develop a neural network with Keras and Tensorflow that forecasts the Apple stock price. To prepare the data for a neural network with multiple outputs in time series forecasting, we will spend the most time preparing it and bringing it into the right shape. Broadly this involves the following steps:

Load the time series data that we want to use as input and output for your model. We use historical price data that is available via the yahoo finance API.
Then we split our data into training and testing sets. We will use the training set to fit the model and the testing set to evaluate the model’s performance.
Preprocess the data: This includes scaling the data and selecting relevant features.
Reshape the data and bring them into a format that can be input into the neural network. This involves converting the data into a 3D array for time series data.
Finally, we will train our model and generate the forecasting.

The code is available on the GitHub repository.

View on GitHub Relataly GitHub Repo

Neural network architectures with multiple outputs allow for more potent solutions but are more complex to train. Image created with Midjourney.

Prerequisites

Before beginning the coding part, ensure that you have set up your Python 3 environment and required packages. If you don’t have a Python environment, consider Anaconda. To set it up, you can follow the steps in this tutorial.

Also, make sure you install all required packages. In this tutorial, we will be working with the following standard packages:

In addition, we will be using the machine learning libraries Keras, Scikit-learn, and Tensorflow. For visualization, we will be using the Seaborn package.

Please also have either the pandas_datareader or the yfinance package installed. You will use one of these packages to retrieve the historical stock quotes.

You can install these packages using console commands:

pip install
conda install (if you are using the anaconda packet manager)

Step #1: Load the Data

The Pandas DataReader library is our first choice for interacting with the yahoo finance API. If the library causes a problem (it sometimes does), you can also use the yfinance package, which should return the same data. We begin by loading historical price quotes of the Apple stock from the public yahoo finance API. Running the code below will load the data into a Pandas DataFrame.

# import pandas_datareader as webreader # Remote data access for pandas
import math # Mathematical functions 
import numpy as np # Fundamental package for scientific computing with Python
import pandas as pd # Additional functions for analysing and manipulating data
from datetime import date, timedelta, datetime # Date Functions
from pandas.plotting import register_matplotlib_converters # This function adds plotting functions for calender dates
import matplotlib.pyplot as plt # Important package for visualization - we use this to plot the market data
import matplotlib.dates as mdates # Formatting dates
from sklearn.metrics import mean_absolute_error, mean_squared_error # Packages for measuring model performance / errors
from keras.models import Sequential # Deep learning library, used for neural networks
from keras.layers import LSTM, Dense, Dropout # Deep learning classes for recurrent and regular densely-connected layers
from keras.callbacks import EarlyStopping # EarlyStopping during model training
from sklearn.preprocessing import RobustScaler, MinMaxScaler # This Scaler removes the median and scales the data according to the quantile range to normalize the price data 
import seaborn as sns

# from pandas_datareader.nasdaq_trader import get_nasdaq_symbols
# symbols = get_nasdaq_symbols()

# Setting the timeframe for the data extraction
today = date.today()
date_today = today.strftime("%Y-%m-%d")
date_start = '2010-01-01'

# Getting NASDAQ quotes
stockname = 'Apple'
symbol = 'AAPL'
# df = webreader.DataReader(
#     symbol, start=date_start, end=date_today, data_source="yahoo"
# )

import yfinance as yf #Alternative package if webreader does not work: pip install yfinance
df = yf.download(symbol, start=date_start, end=date_today)

# # Create a quick overview of the dataset
df.head()

Tensorflow Version: 2.6.0
Num GPUs: 1
[*********************100%***********************]  1 of 1 completed
			Open		High		Low			Close		Adj Close	Volume
Date						
2010-01-04	7.622500	7.660714	7.585000	7.643214	6.515213	493729600
2010-01-05	7.664286	7.699643	7.616071	7.656429	6.526477	601904800
2010-01-06	7.656429	7.686786	7.526786	7.534643	6.422666	552160000
2010-01-07	7.562500	7.571429	7.466071	7.520714	6.410791	477131200
2010-01-08	7.510714	7.571429	7.466429	7.570714	6.453413	447610800

The data should comprise the following columns:

Close
Open
High
Low
Adj Close
Volume

The target variable that we are trying to predict is the Closing price (Close).

Step #2: Explore the Data

Once we have loaded the data, we print a quick overview of the time-series data using different line graphs. The following code will plot a line chart for each column in df_plot using the seaborn library. The charts will be organized in a grid with nrows number of rows and ncols number of columns. The sharex parameter is set to True, which means that the x-axes of the subplots will be shared. The figsize parameter determines the size of the plot in inches.

# Plot line charts
df_plot = df.copy()

ncols = 2
nrows = int(round(df_plot.shape[1] / ncols, 0))

fig, ax = plt.subplots(nrows=nrows, ncols=ncols, sharex=True, figsize=(14, 7))
for i, ax in enumerate(fig.axes):
        sns.lineplot(data = df_plot.iloc[:, i], ax=ax)
        ax.tick_params(axis="x", rotation=30, labelsize=10, length=0)
        ax.xaxis.set_major_locator(mdates.AutoDateLocator())
fig.tight_layout()
plt.show()

The line plots look as expected and reflect the Apple stock price history. Because we are fetching daily data from an API, please note that the lineplots will look different depending on when you run the code.

Step #3: Preprocess the Data

Next, we prepare the data for the training process of our multi-output forecasting model. Preparing the data for multivariate forecasting involves several steps:

Selecting features for model training
Scaling and splitting the data into separate sets for training and testing
Slicing the time series into several shifted training batches

Remember that the steps are specific to our data and the use case. The steps required to prepare the data for a neural network with multiple outputs in time series forecasting will depend on the characteristics of your data and the requirements of your model. It is essential to consider these factors and tailor your data preparation accordingly and carefully.

3.1 Basic Preparations

We begin by creating a copy of the initial data and resetting the index.

# Indexing Batches
df_train = df.sort_values(by=['Date']).copy()

# We safe a copy of the dates index, before we need to reset it to numbers
date_index = df_train.index

# We reset the index, so we can convert the date-index to a number-index
df_train = df_train.reset_index(drop=True).copy()
df_train.head(5)

			Open		High		Low			Close		Adj Close	Volume
Date						
2022-11-29	144.289993	144.809998	140.350006	141.169998	141.169998	83763800
2022-11-30	141.399994	148.720001	140.550003	148.029999	148.029999	111224400
2022-12-01	148.210007	149.130005	146.610001	148.309998	148.309998	71250400
2022-12-02	145.960007	148.000000	145.649994	147.809998	147.809998	65421400
2022-12-05	147.770004	150.919998	145.770004	146.630005	146.630005	68732400

3.2 Feature Selection and Scaling

We proceed with feature selection. To keep things simple, we will use the features from the input data without any modifications. After selecting the features, we scale them to a range between 0 and 1. To ease unscaling the predictions after training, we create two different scalers: One for the training data, which takes five columns, and one for the output data that scales a single column (the Close Price). I have covered feature engineering in a separate article if you want to learn more about this topic.

def prepare_data(df):

    # List of considered Features
    FEATURES = ['Open', 'High', 'Low', 'Close', 'Volume']

    print('FEATURE LIST')
    print([f for f in FEATURES])

    # Create the dataset with features and filter the data to the list of FEATURES
    df_filter = df[FEATURES]
    
    # Convert the data to numpy values
    np_filter_unscaled = np.array(df_filter)
    #np_filter_unscaled = np.reshape(np_unscaled, (df_filter.shape[0], -1))
    print(np_filter_unscaled.shape)

    np_c_unscaled = np.array(df['Close']).reshape(-1, 1)
    
    return np_filter_unscaled, np_c_unscaled, df_filter
    
np_filter_unscaled, np_c_unscaled, df_filter = prepare_data(df_train)
                                          
# Creating a separate scaler that works on a single column for scaling predictions
# Scale each feature to a range between 0 and 1
scaler_train = MinMaxScaler()
np_scaled = scaler_train.fit_transform(np_filter_unscaled)
    
# Create a separate scaler for a single column
scaler_pred = MinMaxScaler()
np_scaled_c = scaler_pred.fit_transform(np_c_unscaled)

FEATURE LIST
['Open', 'High', 'Low', 'Close', 'Volume']
(3254, 5)

The final step of the data preparation is to create the structure for the input data. This structure needs to match the input layer of the model architecture.

3.3 Slicing the Data for a Model with Multiple In- and Outputs

The code below starts a sliding window process that cuts the initial time series data into multiple slices, i.e., mini-batches. Each batch is a smaller fraction of the initial time series shifted by a single step. Because we will feed our model with multivariate input data, the time series consists of five input columns/features. Each batch comprises a period of 50 steps from the time series and an output sequence of ten consecutive values. To validate that the batches have the right shape, we visualize mini-batches in a line graph with their consecutive target values.

# Set the input_sequence_length length - this is the timeframe used to make a single prediction
input_sequence_length = 50
# The output sequence length is the number of steps that the neural network predicts
output_sequence_length = 10 #

# Prediction Index
index_Close = df_train.columns.get_loc("Close")

# Split the training data into train and train data sets
# As a first step, we get the number of rows to train the model on 80% of the data 
train_data_length = math.ceil(np_scaled.shape[0] * 0.8)

# Create the training and test data
train_data = np_scaled[:train_data_length, :]
test_data = np_scaled[train_data_length - input_sequence_length:, :]

# The RNN needs data with the format of [samples, time steps, features]
# Here, we create N samples, input_sequence_length time steps per sample, and f features
def partition_dataset(input_sequence_length, output_sequence_length, data):
    x, y = [], []
    data_len = data.shape[0]
    for i in range(input_sequence_length, data_len - output_sequence_length):
        x.append(data[i-input_sequence_length:i,:]) #contains input_sequence_length values 0-input_sequence_length * columns
        y.append(data[i:i + output_sequence_length, index_Close]) #contains the prediction values for validation (3rd column = Close),  for single-step prediction
    
    # Convert the x and y to numpy arrays
    x = np.array(x)
    y = np.array(y)
    return x, y

# Generate training data and test data
x_train, y_train = partition_dataset(input_sequence_length, output_sequence_length, train_data)
x_test, y_test = partition_dataset(input_sequence_length, output_sequence_length, test_data)

# Print the shapes: the result is: (rows, training_sequence, features) (prediction value, )
print(x_train.shape, y_train.shape)
print(x_test.shape, y_test.shape)

# Validate that the prediction value and the input match up
# The last close price of the second input sample should equal the first prediction value
nrows = 3 # number of shifted plots
fig, ax = plt.subplots(nrows=nrows, ncols=1, figsize=(16, 8))
for i, ax in enumerate(fig.axes):
    xtrain = pd.DataFrame(x_train[i][:,index_Close], columns={f'x_train_{i}'})
    ytrain = pd.DataFrame(y_train[i][:output_sequence_length-1], columns={f'y_train_{i}'})
    ytrain.index = np.arange(input_sequence_length, input_sequence_length + output_sequence_length-1)
    xtrain_ = pd.concat([xtrain, ytrain[:1].rename(columns={ytrain.columns[0]:xtrain.columns[0]})])
    df_merge = pd.concat([xtrain_, ytrain])
    sns.lineplot(data = df_merge, ax=ax)
plt.show

(2544, 50, 5) (2544, 10)
(640, 50, 5) (640, 10)

Step #4: Prepare the Neural Network Architecture and Train the Multi-Output Regression Model

Now that we have the training data prepared and ready, the next step is to configure the architecture of the multi-out neural network. Because we will be using multiple input series, our model is, in fact, a multivariate architecture so that it corresponds to the input training batches.

4.1 Configuring and Training the Model

We choose a comparably simple architecture with only two LSTM layers and two additional dense layers. The first dense layer has 20 neurons, and the second layer is the output layer, which has ten output neurons. If you wonder how I got to the number of neurons in the third layer, I conducted several experiments and found that this number leads to solid results.

To ensure that the architecture matches our input data’s structure, we reuse the variables for the previous code section (n_input_neurons, n_output_neurons. The input sequence length is 50, and the output sequence (the steps for the period we want to predict) is ten.

# Configure the neural network model
model = Sequential()
n_output_neurons = output_sequence_length

# Model with n_neurons = inputshape Timestamps, each with x_train.shape[2] variables
n_input_neurons = x_train.shape[1] * x_train.shape[2]
print(n_input_neurons, x_train.shape[1], x_train.shape[2])
model.add(LSTM(n_input_neurons, return_sequences=True, input_shape=(x_train.shape[1], x_train.shape[2]))) 
model.add(LSTM(n_input_neurons, return_sequences=False))
model.add(Dense(20))
model.add(Dense(n_output_neurons))

# Compile the model
model.compile(optimizer='adam', loss='mse')

After configuring the model architecture, we can initiate the training process and illustrate how the loss develops over the training epochs.

# Training the model
epochs = 10
batch_size = 16
early_stop = EarlyStopping(monitor='loss', patience=5, verbose=1)
history = model.fit(x_train, y_train, 
                    batch_size=batch_size, 
                    epochs=epochs,
                    validation_data=(x_test, y_test)
                   )
                    
                    #callbacks=[early_stop])

Epoch 1/5
159/159 [==============================] - 7s 14ms/step - loss: 0.0047 - val_loss: 0.0262
Epoch 2/5
159/159 [==============================] - 2s 11ms/step - loss: 3.6759e-04 - val_loss: 0.0097
Epoch 3/5
159/159 [==============================] - 2s 11ms/step - loss: 1.5222e-04 - val_loss: 0.0056
Epoch 4/5
159/159 [==============================] - 2s 11ms/step - loss: 1.0327e-04 - val_loss: 0.0031
Epoch 5/5
159/159 [==============================] - 2s 11ms/step - loss: 1.1690e-04 - val_loss: 0.0026

4.2 Loss Curve

Next, we plot the loss curve, which represents the amount of error between the model’s predicted values and the actual values in the training data. A lower loss value indicates that the model makes more accurate predictions on the training data.

# Plot training & validation loss values
fig, ax = plt.subplots(figsize=(10, 5), sharex=True)
plt.plot(history.history["loss"])
plt.title("Model loss")
plt.ylabel("Loss")
plt.xlabel("Epoch")
ax.xaxis.set_major_locator(plt.MaxNLocator(epochs))
plt.legend(["Train", "Test"], loc="upper left")
plt.grid()
plt.show()

As we can see, the loss curve drops quickly during training, which typically means that the model is quickly learning to make accurate predictions.

Step #5 Evaluate Model Performance

Now that we have trained the model, we can make forecasts on the test data and use traditional regression metrics such as the MAE, MAPE, or MDAPE to measure the performance of our model.

# Get the predicted values
y_pred_scaled = model.predict(x_test)

# Unscale the predicted values
y_pred = scaler_pred.inverse_transform(y_pred_scaled)
y_test_unscaled = scaler_pred.inverse_transform(y_test).reshape(-1, output_sequence_length)

# Mean Absolute Error (MAE)
MAE = mean_absolute_error(y_test_unscaled, y_pred)
print(f'Median Absolute Error (MAE): {np.round(MAE, 2)}')

# Mean Absolute Percentage Error (MAPE)
MAPE = np.mean((np.abs(np.subtract(y_test_unscaled, y_pred)/ y_test_unscaled))) * 100
print(f'Mean Absolute Percentage Error (MAPE): {np.round(MAPE, 2)} %')

# Median Absolute Percentage Error (MDAPE)
MDAPE = np.median((np.abs(np.subtract(y_test_unscaled, y_pred)/ y_test_unscaled)) ) * 100
print(f'Median Absolute Percentage Error (MDAPE): {np.round(MDAPE, 2)} %')


def prepare_df(i, x, y, y_pred_unscaled):
    # Undo the scaling on x, reshape the testset into a one-dimensional array, so that it fits to the pred scaler
    x_test_unscaled_df = pd.DataFrame(scaler_pred.inverse_transform((x[i]))[:,index_Close]).rename(columns={0:'x_test'})
    
    y_test_unscaled_df = []
    # Undo the scaling on y
    if type(y) == np.ndarray:
        y_test_unscaled_df = pd.DataFrame(scaler_pred.inverse_transform(y)[i]).rename(columns={0:'y_test'})

    # Create a dataframe for the y_pred at position i, y_pred is already unscaled
    y_pred_df = pd.DataFrame(y_pred_unscaled[i]).rename(columns={0:'y_pred'})
    return x_test_unscaled_df, y_pred_df, y_test_unscaled_df


def plot_multi_test_forecast(x_test_unscaled_df, y_test_unscaled_df, y_pred_df, title): 
    # Package y_pred_unscaled and y_test_unscaled into a dataframe with columns pred and true   
    if type(y_test_unscaled_df) == pd.core.frame.DataFrame:
        df_merge = y_pred_df.join(y_test_unscaled_df, how='left')
    else:
        df_merge = y_pred_df.copy()
    
    # Merge the dataframes 
    df_merge_ = pd.concat([x_test_unscaled_df, df_merge]).reset_index(drop=True)
    
    # Plot the linecharts
    fig, ax = plt.subplots(figsize=(20, 8))
    plt.title(title, fontsize=12)
    ax.set(ylabel = stockname + "_stock_price_quotes")
    sns.lineplot(data = df_merge_, linewidth=2.0, ax=ax)

# Creates a linechart for a specific test batch_number and corresponding test predictions
batch_number = 50
x_test_unscaled_df, y_pred_df, y_test_unscaled_df = prepare_df(i, x_test, y_test, y_pred)
title = f"Predictions vs y_test - test batch number {batch_number}"
plot_multi_test_forecast(x_test_unscaled_df, y_test_unscaled_df, y_pred_df, title)

The quality of the predictions is acceptable, considering that this tutorial aimed not to achieve excellent predictions but to demonstrate the process and architecture of training a multi-output regression. So, there is certainly room for improvement. Feel free to experiment with different features or try other hyperparameters and neural network layers.

Step #6 Create a New Forecast

Finally, let’s create a forecast on a new dataset. We take the scaled dataset from section 2 (np_scaled) and extract a series with the latest 50 values. The data is reshaped into a 3D array with shape (1, 50, 5) to match the expected input shape of the model. We use these values to generate a new prediction for the next ten days using the predict method. We store the result in the y_pred_scaled variable. In addition, we need to transform the predictions back to the original scale. We do this by using the inverse_transform method of the scaler_pred object, which was fit on the training data. Finally, we visualize the multi-step forecast in another line chart.

# Get the latest input batch from the test dataset, which is contains the price values for the last ten trading days
x_test_latest_batch = np_scaled[-50:,:].reshape(1,50,5)

# Predict on the batch
y_pred_scaled = model.predict(x_test_latest_batch)
y_pred_unscaled = scaler_pred.inverse_transform(y_pred_scaled)

# Prepare the data and plot the input data and the predictions
x_test_unscaled_df, y_test_unscaled_df, _ = prepare_df(0, x_test_latest_batch, '', y_pred_unscaled)
plot_multi_test_forecast(x_test_unscaled_df, '', y_test_unscaled_df, "x_new Vs. y_new_pred")

Summary

In this tutorial, we demonstrated how to use multiple output neural networks to make predictions at different time steps. We first discussed the architecture of a recurrent neural network and how it can be used to process sequential data. We then showed how to properly preprocess the data and split it into training and test sets for training a multi-output regression model.

Next, we trained a model to predict the stock price of Apple ten steps into the future using historical data. We also discussed how to use the trained model to make multi-step predictions on new data and how to visualize the results.

To further improve the performance of the model, you can experiment with different hyperparameters and adjust the model architecture. For example, adding more neurons to the output layers will increase the prediction horizon, but remember that prediction error will also increase as the horizon lengthens. You can also try using different activation functions or adding more layers to the model to see how it affects the performance.

I hope this article was helpful in understanding multi-output neural networks better. If you have any questions or comments, please let me know.

Sources and Further Reading

The links above to Amazon are affiliate links. By buying through these links, you support the Relataly.com blog and help to cover the hosting costs. Using the links does not affect the price.

If you want to learn about an alternative approach to univariate stock market forecasting, consider taking a look at Facebook Prophet or ARIMA models

The post Stock Market Forecasting Neural Networks for Multi-Output Regression in Python appeared first on relataly.com.

Automate Crypto Trading with a Python-Powered Twitter Bot and Gate.io Signals

Florian Follonier — Wed, 19 May 2021 04:57:00 +0000

This tutorial develops a Twitter bot in Python that will generate automated trading signals. The bot will pull real-time price data on various cryptocurrencies (Bitcoin, Ethereum, Doge, etc.) from the crypto exchange Gate.io and analyze it using predefined rules. Whenever the bot detects a relevant price change, it automatically posts a tweet via Twitter. Simple Twitter bots can proactively inform their audiences about relevant events in the market. Such an event can be a sharp rise or fall in price or a sudden spike in the trading volume. If we examine data for specific price movements, we can also store these events and use them later to train a predictive model.

More advanced signal bots use predictive models to signal when it is appropriate to enter or exit the market. Or the bot executes the buy- and sell-orders directly itself. A well-defined signaling logic can therefore constitute the first step toward algorithmic trading. But one thing at a time. So in this article, we will begin by developing a simple signal bot.

The rest of this article is structured as follows. First, we take a look at the different code modules of the Twitter bot. After that, we’ll implement the other code modules in Python. Finally, we will integrate the modules and run some tests. We will also quickly introduce the APIs used to build the bot.

Bots can do a lot of cool things but should be used with caution. Image created with Midjourney

Different Modules of the Signal Bot

This section briefly describes the conceptual architecture of the Crypto Twitter bot. Its architecture adheres to a modular design pattern and separates into four loosely coupled modules. Each module has a clear function.

The Data Collection Module retrieves price data from the crypto exchange Gate.io. The module sends requests at regular intervals against the gate.io API. The module adds the data to separate data stores – one for each cryptocurrency. It then forwards the data to the preprocessing module.
The Data Preprocessing Module calculates the statistical indicators, such as moving averages or means, which become the basis for the signaling logic.
The Signaling Module searches for relevant events based on the indicator values provided. If a relevant event is detected, it is reported to the communication module.
The Communication Module connects to the Twitter API. As soon as it is informed about a new event, it tweets about this event on Twitter.

Now that you are familiar with the modules of our Crypto Twitter Bot, we can take a look at its underlying APIs.

Components of the Relataly Crypto Signal Bot

About the APIs Used in this Tutorial

In this tutorial, we will be using two APIs:

The Gate.io API to fetch price data.
Twitter to post Tweets about Trading Signals

The Gate.io API

Firstly, we will be using the Gate.io API to obtain prices for various cryptocurrencies. Gate.io is one of the smaller crypto exchanges in the crypto-verse. However, it offers a wide range of smaller cryptocurrencies, especially those you cannot trade anywhere else. As of now, the gate.io market endpoint does not require authentication to use its essential functions.

Check out our recent relataly gate.io tutorial to learn how to pull data via the gate.io API in Python.

The Twitter API

The second API that our bot will use is the Twitter API. We will use this API via the Python package Tweepy to post crypto price signals. Check out this article if you are looking for a simple code example of submitting tweets via the Twitter API. If you don’t want to use Twitter, you can disable its use in the code.

Posting tweets via the API requires authentication with a valid developer account. You can apply for a developer account for free on the Twitter developer website. Just be aware that the confirmation can sometimes take several days.

Storing the Twitter API Key

Storing API keys in your code can compromise the security of your application. If the code is made public, for example, by publishing it on a code-sharing website like GitHub, anyone who has access to the code can use the API key to make requests to the API and potentially access sensitive information or cause harm to your account or application. A better practice is to import and access the API key from a separate YAML file, from where you can import it into your project. To store the Twitter API Key, create a YAML file with the name “api_config_twitter.yml” and insert your API key into this file as follows:

api_key: “your api key”

Implementing a Twitter Signal Bot using Python

In this article, we will walk through the process of creating a Twitter bot that automatically tweets updates about cryptocurrency prices. The bot will be designed to pull real-time data on cryptocurrency prices from an external API, and then automatically generate and post tweets on a regular basis. By the end of the article, you will have a fully functional Twitter bot that can keep your followers informed about the latest cryptocurrency prices.

Note: You require a Twitter developer account if you want to use the Twitter functionality. Without an account, you can still print out trading signals to yourself, but you will not be able to post them via the Twitter API.

The code is available on the GitHub repository.

View on GitHub Relataly GitHub Repo

Python Prerequisites

Before starting the coding part, make sure that you have set up your Python 3 environment and required packages. If you don’t have an environment set up yet, you can follow this tutorial to set up the Anaconda environment.

Also, make sure you install all required packages. In this tutorial, we will be working with the following standard packages:

pandas
NumPy

In addition, we will use the following two packages:

Firstly, the gate.io package (package name gate-API) pulls crypto price data from gate.io.
Secondly, we will use the Twitter API library Tweepy to post trading signals via the Twitter API.

You can install packages using console commands:

pip install
conda install (if you are using the anaconda packet manager)

Step #1: Regular Retrieval of Price Data

First, we will define a “prices” class to handle the incoming data flow. The prices class contains a “get_latest_prices” attribute that retrieves price information from gate.io. The function regularly calls the gate.io list_ticker market endpoint.

The list_ticker endpoint returns a list of data fields for cryptocurrency pairs. Examples of price pairs are BTC_USD, BTC_ETH, BTC_ADA, etc. We can limit the response to a single price pair by passing a single pair as a variable in the API call. However, it is not possible to restrict the response to multiple pairs. We either get data for a single pair or all pairs. The response contains a list of the following data fields:

Overview of the data fields in the response

The following code maintains a separate dictionary for each cryptocurrency pair. The dictionary contains the name of the cryptocurrency pair and a data frame that includes the price data history. Each time the crypto bot receives a new response from the API, it goes through the response, extracts the price data(Price, Volume, etc.), and appends this data to the Data Frame of the respective cryptocurrency pair. Then the information is passed to the preprocessing module.

import pandas as pd
import numpy as np
import json
import requests
import datetime as dt
import logging
import threading
import time
from __future__ import print_function

import tweepy 
import gate_api
from gate_api.exceptions import ApiException, GateApiException
from twitter_secrets import twitter_secrets as ts # place the twitter_secrets file under /anaconda3/Lib

class Prices:
    """Class that uses the gate api to retrieve currency data."""

    def __init__(self, config):
        self._config = config
        self._logger = logging.getLogger(__name__)
        configuration = gate_api.Configuration(host="https://api.gateio.ws/api/v4")
        api_client = gate_api.ApiClient(configuration)
        self._api_instance = gate_api.SpotApi(api_client)
        self._price_history = {}
        self._cont_update_thread = None
        self._stop_cont_update_thread = None
        self._price_history_lock = threading.Lock()

    def get_price_history(self):
        """Returns a dictionary with the price histories for the currencies."""
        return self._price_history, self._price_history_lock

    def get_latest_prices(self):
        """Gets new price data and adds the values to a DataFrame.

        Returns the DataFrame in a dictionary with the currencies as keys."""
        timestamp = dt.datetime.now()
        try:
            api_response = self._api_instance.list_tickers()
        except GateApiException as e:
            logging.warning(
                "Gate api exception, label: %s, message: %s\n" % (e.label, e.message)
            )
            return {}
        except ApiException as e:
            logging.warning("Exception when calling SpotApi->list_tickers: %s\n" % e)
            return {}
        latest_prices = {}
        for response in api_response:
            currency = response.currency_pair
            if "USDT" not in currency or "BEAR" in currency:
                continue
            value_dict = {
                "base_volume": pd.to_numeric(response.base_volume),
                "change_percentage": pd.to_numeric(response.change_percentage),
                "etf_leverage": pd.to_numeric(response.etf_leverage),
                "etf_net_value": pd.to_numeric(response.etf_net_value),
                "etf_pre_net_value": pd.to_numeric(response.etf_pre_net_value),
                "etf_pre_timestamp": response.etf_pre_timestamp,
                "high_24h": pd.to_numeric(response.high_24h),
                "highest_bid": pd.to_numeric(response.highest_bid),
                "high_bid": pd.to_numeric(response.highest_bid),
                "last": pd.to_numeric(response.last),
                "low_24h": pd.to_numeric(response.low_24h),
                "lowest_ask": pd.to_numeric(response.lowest_ask),
                "quote_volume": pd.to_numeric(response.quote_volume),
                "timestamp": timestamp,
            }
            latest_prices[currency] = pd.DataFrame(value_dict, index=[1])
        return latest_prices

    def start_cont_update(self):
        self._stop_cont_update_thread = threading.Event()
        self._stop_cont_update_thread.clear()
        self._cont_update_thread = threading.Thread(
            target=self._cont_update,
            args=(
                self._stop_cont_update_thread,
                self._price_history_lock,
            ),
        )
        self._cont_update_thread.start()
        self._logger.info("Started continuous price logging")

    def _cont_update(self, stop_event, lock):
        """Continuously adds new prices to the price history."""
        while not stop_event.is_set():
            start_time = time.time()
            lock.acquire()
            for currency, df in self.get_latest_prices().items():
                if currency in self._price_history.keys():
                    self._price_history[currency] = self._price_history[
                        currency
                    ].append(df, ignore_index=True)
                else:
                    self._price_history[currency] = df
            lock.release()
            self._logger.debug("Currency_dfs updated")
            self._wait_before_update(start_time)

    def _wait_before_update(self, start_time):
        elapsed_time = time.time() - start_time
        self._logger.debug(f"Elapsed time: {elapsed_time}")
        if elapsed_time > self._config["price_update_delay"]:
            delay = 0
            self._logger.warning(
                #"It took longer to retrieve the price data than the update_delay!"
            )
        else:
            delay = self._config["price_update_delay"] - elapsed_time
        self._logger.debug(f"Waiting {delay}s until next update")
        time.sleep(delay)

Step #2: Calculate Indicator Values

Next, we will define a few functions that process the regular data inflow from gate.io and calculate indicator values for the different cryptocurrencies.

Absolute price values signal the bot that the price moves up or down. However, our signaling logic will primarily work with thresholds on percentage values. These indicators have a p at the end of the name in the code below.

In addition, we will avoid misleading signals by incorporating moving averages into the signaling logic. Moving averages work on historical data, so we have to hand over the price history when we call the “calc_indicators” function. Furthermore, we take over other indicators from the data frame, including the 24h_low and the 24h_high. These indicators give us additional information about the indicators of the preceding price points. We can use them to build more robust trading signals.

All indicators are calculated separately for each crypto pair, passed to a dictionary, and then passed to the signaling logic. In the next step, we can use these indicator values in our signaling rules.

def calc_indicators(price_history):
    indicators = {}
    indicators_over_all = calc_indicators_over_all(price_history)
    for currency, df in price_history.items():
        if len(df) <= 2:
            logging.getLogger().debug(
                f"Skipped '{currency} when calculating indicators due to a lack of information"
            )
            continue
        volume = df["base_volume"].iloc[-1]
        last_price = df["last"].iloc[-1]
        moving_avg_price = df["last"].mean()
        moving_average_volume = df["base_volume"].mean()
        moving_average_deviation_percent = np.round(
            div(last_price, moving_avg_price) - 1, 2
        )

        price_before = df["last"].iloc[-2]
        price_delta = last_price - price_before
        price_delta_p = div(price_delta, last_price)
        price_delta_before = price_before - df["last"].iloc[-3]
        price_delta_p_before = div((price_before - df["last"].iloc[-3]), price_before)
        low_24h = df["low_24h"].iloc[-1]
        high_24h = df["high_24h"].iloc[-1]
        low_high_diff_p = div(high_24h - low_24h, low_24h)
        change_percentage = df["change_percentage"].iloc[-1]

        indicator_values = {
            "last_price": last_price,
            "price_before": price_before,
            "volume": volume,
            "moving_avg_price": moving_avg_price,
            "moving_average_volume": moving_average_volume,
            "moving_average_deviation_percent": moving_average_deviation_percent,
            "price_delta_p": price_delta_p,
            "price_delta": price_delta,
            "price_delta_before": price_delta_before,
            "price_delta_p_before": price_delta_p_before,
            "high_24h": high_24h,
            "low_24h": low_24h,
            "low_high_diff_p": low_high_diff_p,
            "change_percentage": change_percentage,
        }
        indicator_values.update(indicators_over_all)
        indicators[currency] = indicator_values
    return indicators


def calc_indicators_over_all(price_history):
    avg_change_p = 0
    for currency, df in price_history.items():
        avg_change_p += df["change_percentage"].iloc[-1]
    nr_of_currencies = len(price_history)
    avg_change_p = div(avg_change_p, nr_of_currencies)
    values = {
        "avg_change_p": avg_change_p,
    }
    return values


def div(dividend, divisor, alt_value=0.0):
    return dividend / divisor if divisor != 0 else alt_value

Step #3: Define the Signaling Logic of the Twitter Bot

Our bot will use a signaling logic that differentiates between the following price signals:

A simple uptick: Price_delta_p must be higher than the threshold (10%) to trigger.
A simple downtick: Price_delta_p must be lower than the threshold (10%) to trigger.
The bot does also report on new 24-hour lows and highs
Another event on which the bot reports is when an up or down price trend begins to accelerate or slows down.
The bot reports that when a price performs a trend reversal (pullback and recovery)

Overview of the different trading signals generated by the signaling logic

Be aware that the price_delta_p measures the percentage deviation from the previous price point. Thus, the signaling logic that our bot has in place is very dependent on the interval in which the bots request new price data. Shorter time intervals will have a lower chance of triggering because more considerable changes typically occur over a longer time. For more details regarding the signaling logic, please view the code below.

def check_signal(currency, indicators, cs_config):
    ind = indicators[currency]
    signal = ''
    if (ind['moving_avg_price'] > 0
            and ind['last_price'] > 0.0
            and abs(ind['price_delta']) > 0.0
            and abs(ind['price_delta_p']) >= cs_config["delta_threshold_p"]
            and ind['volume'] > 0
    ):
        # up
        if ind['price_delta'] > 0:
            movement_type = 'up +'
            if abs(ind['price_delta_p_before']) > cs_config["delta_threshold_p"]:
                if ind['price_delta_before'] <= 0:
                    movement_type = 'recovery from ' + str(ind['price_before']) + ' to ' + str(ind['last_price'])
                else:
                    if ind['price_delta_p'] * (1-cs_config["delta_threshold_p"]) > ind['price_delta_p_before']:
                        movement_type = 'upward trend accelerates +'
                    elif ind['price_delta_p'] < ind['price_delta_p_before'] * (1-cs_config["delta_threshold_p"]):
                        movement_type = 'upward trend slows down +'
                    elif ind['price_delta_p'] * (1+cs_config["delta_threshold_p"]) >= ind['price_delta_p_before'] >= ind['price_delta_p'] * (1-cs_config["delta_threshold_p"]):
                        movement_type = 'upward trend continues +'
        # down
        elif ind['price_delta'] < 0:
            movement_type = 'down '
            if abs(ind['price_delta_p_before']) > cs_config["delta_threshold_p"]:
                if ind['price_delta_before'] > 0:
                    movement_type = 'pullback from ' + str(ind['price_before']) + ' to ' + str(ind['last_price'])
                else:
                    if ind['price_delta_p'] * (1-cs_config["delta_threshold_p"]) > ind['price_delta_p_before']:
                        movement_type = 'down trend accelerates '
                    elif ind['price_delta_p'] * (1+cs_config["delta_threshold_p"]) >= ind['price_delta_p_before'] >= ind['price_delta_p'] * (1-cs_config["delta_threshold_p"]):
                        movement_type = 'down trend continues '
                    elif ind['price_delta_p'] < ind['price_delta_p_before'] * (1+cs_config["delta_threshold_p"]):
                        movement_type = 'downward trend slows down '

        signal = get_signal_log(movement_type, currency, ind['price_delta_p'], ind['last_price'],
                                ind['moving_avg_price'], ind['volume'], ind['price_delta'], ind['change_percentage'],
                                ind['high_24h'], ind['low_24h'], ind['low_high_diff_p'])

        check_24h_peak(currency, ind['last_price'], ind['low_24h'], ind['high_24h'])

    return signal
    # trade_signal


def check_24h_peak(currency, last_price, low_24h, high_24h):
    if last_price < low_24h:
        print(currency + ' new 24h low $' + str(last_price))
    elif last_price > high_24h:
        print(currency + ' new 24h high $' + str(last_price))


def get_signal_log(movement_type, currency, price_delta_p, last_price, moving_avg_price, volume, price_delta,
                   daily_up_p, high_24h, low_24h, low_high_diff_p):
    signal = f'{currency} {movement_type} ' \
             f'{np.round(price_delta_p * 100, 5)}% ' \
             f'MA:${np.round(moving_avg_price, 6)} ' \
             f'last_price:${np.round(last_price, 6)} ' \
             f'price delta:{np.round(price_delta, 6)} ' \
             f'volume:${np.round(volume, 1)} ' \
             f'daily_change:{np.round(daily_up_p, 2)}% ' \
             f'high_24h:${high_24h} ' \
             f'low_24h:${low_24h} ' \
             f'low_high_diff_p:{np.round(low_high_diff_p * 100, 2)}%'
    return signal

Step #4: Send Tweets via Twitter

Next, we define a simple function that calls the Twitter API and tweets our price signal. Because the Twitter API requires authentication, you must provide the API authentication credentials from a valid Twitter developer account.

It’s best not to store the API credentials directly in code. Still not perfect, but slightly better is to keep the data in a separate python file (for example, called “twitter_secrets”) that you put into your package folder (for example, under /anaconda3/Lib), from where you can import it directly into your code.

# Twitter Consumer API keys
CONSUMER_KEY    = "api123"
CONSUMER_SECRET = "api123"

# Twitter Access token & access token secret
ACCESS_TOKEN    = "api123"
ACCESS_SECRET   = "api123"

BEARER_TOKEN = "api123"

class TwitterSecrets:
    """Class that holds Twitter Secrets"""

    def __init__(self):
        self.CONSUMER_KEY    = CONSUMER_KEY
        self.CONSUMER_SECRET = CONSUMER_SECRET
        self.ACCESS_TOKEN    = ACCESS_TOKEN
        self.ACCESS_SECRET   = ACCESS_SECRET
        self.BEARER_TOKEN   = BEARER_TOKEN
        
        # Tests if keys are present
        for key, secret in self.__dict__.items():
            assert secret != "", f"Please provide a valid secret for: {key}"

twitter_secrets = TwitterSecrets()

Once you have imported the file, you can then load the API credentials from the file in the following way:

consumer_key = ts.CONSUMER_KEY
consumer_secret = ts.CONSUMER_SECRET
access_token = ts.ACCESS_TOKEN
access_secret = ts.ACCESS_SECRET

### Print API Auth Data (leave disabled for security reasons)
# print(f'consumer_key: {consumer_key}')
# print(f'consumer_secret: {consumer_secret}')
# print(f'access_token: {access_token}')
# print(f'access_secret: {access_token}')

#authenticating to access the twitter API
auth=tweepy.OAuthHandler(consumer_key,consumer_secret)
auth.set_access_token(access_token,access_secret)
api=tweepy.API(auth)

def send_pricechange_tweet(signal):
    api.update_status(f"{signal} \n {relataly_url}")

Step #5 Starting the Crypto Signal Bot

Finally, we can hit the start button of our crypto signal bot. But before we do this, take a look at some configuration options of the bot.

CYCLE_DELAY is the standard interval in seconds in which the bot will call the gate.io API.
CURRENCY_PAIR is another API parameter limiting the cryptocurrency pairs to specific currency pairs. The bot will scan the entire market with all currency pairs in the standard setting, including all USDT pairs.
TWITTER_ACTIVE defines whether the bot posts signals on Twitter. Be aware that your bot may instantly report any signal on your Twitter account if you enable it.
RUNS defines the max number of prices that the bot will retrieve before the bot stops.

Now, let’s test the bot:

RUNS = 50 # the bot will stop after 50 price points
CYCLE_DELAY = 20 # the interval for checking the data and retrieving another price point
EVAL_PRICES_DELAY = 10
CURRENCY_PAIR = "" # the bot will retrieve data for all currency pairs listed on gate.io
PRICES_CONFIG = {"price_update_delay": 20}
TWITTER_ACTIVE = False

CHECK_SIGNAL_CONFIG = {
    "moving_avg_threshold_down_p": 0.10,
    "moving_avg_threshold_up_p": 0.10,
    "delta_threshold_p": 0.07,
    'enable_twitter': TWITTER_ACTIVE,
}

if __name__ == "__main__":
    logging.basicConfig(
        level=logging.INFO, format="\033[02m%(asctime)s %(levelname)s: %(message)s"
    )
    logger = logging.getLogger(__name__)
    prices = Prices(PRICES_CONFIG)
    prices.start_cont_update()
    currency_dfs = {}
    logging.info(f"Crypto bot is starting - please wait")
    logger.info(f"Collecting crypto data from gate.io for {EVAL_PRICES_DELAY}s")
    time.sleep(EVAL_PRICES_DELAY)
    logger.info(f"\n<< Crypto signal bot started :-) >>")
    logger.info(f"<< Checking prices every {CYCLE_DELAY} seconds >>")
    logger.info(f"Now checking for signals - please wait\n")
    for i in range(RUNS):
        price_history, lock = prices.get_price_history()
        lock.acquire()
        indicators = calc_indicators(price_history)
        lock.release()
        for currency in indicators.keys():
            if not indicators[currency]:
                continue
            signal = check_signal(
                currency,
                indicators,
                CHECK_SIGNAL_CONFIG,
            )
            if signal:
                logger.info(signal)
                if CHECK_SIGNAL_CONFIG['enable_twitter']:
                    send_pricechange_tweet(signal)
                    print('send via twitter')
        time.sleep(CYCLE_DELAY)

2022-03-09 11:40:38,939 INFO: Started continuous price logging
2022-03-09 11:40:38,940 INFO: Crypto bot is starting - please wait
2022-03-09 11:40:38,940 INFO: Collecting crypto data from gate.io for 10s
2022-03-09 11:40:48,941 INFO: 
<< Crypto signal bot started :-) >>
2022-03-09 11:40:48,942 INFO: << Checking prices every 20 seconds >>
2022-03-09 11:40:48,942 INFO: Now checking for signals - please wait

2022-03-09 11:52:06,800 INFO: EOSBULL_USDT up + 19.42446% MA:$1.1e-05 last_price:$1.4e-05 price delta:3e-06 volume:$1272326905.1 daily_change:33.65% high_24h:$1.16e-05 low_24h:$9.8e-06low_high_diff_p:18.37%
EOSBULL_USDT new 24h high $1.39e-05
send via twitter

And this is what the tweets will look like on Twitter:

Summary

Congratulations on completing this tutorial! In this article, you learned how to build a Python-based Twitter crypto signal bot. When run, the bot will regularly retrieve cryptocurrency quotes from the Gate.io exchange and tweet about any price movements based on a simple signaling logic.

While the signaling logic in this tutorial is kept simple, this basic framework provides a foundation for you to further develop and enhance the signaling rules. For example, you could consider using volume or price volatility changes as the basis for defining signals. Have fun experimenting and expanding upon this project!

If you found this article helpful, please show your appreciation by leaving a comment. Cheers

Sources and Further Reading

The links above to Amazon are affiliate links. By buying through these links, you support the Relataly.com blog and help to cover the hosting costs. Using the links does not affect the price.

The post Automate Crypto Trading with a Python-Powered Twitter Bot and Gate.io Signals appeared first on relataly.com.

Forecasting Beer Sales with ARIMA in Python

Florian Follonier — Wed, 03 Feb 2021 22:23:08 +0000

Time series analysis and forecasting is a tough nut to crack, but the ARIMA model has been cracking it for decades. ARIMA, short for “Auto-Regressive Integrated Moving Average,” is a powerful statistical modeling technique for time series analysis. It’s particularly effective when the time series you’re analyzing follows a clear pattern, like seasonal changes in weather or sales. ARIMA has been used to forecast everything from beer sales to order quantities, and this tutorial will show you how to build your own ARIMA model in Python. You’ll be making predictions like a pro in no time!

This tutorial proceeds in two parts: The first part covers the concepts behind ARIMA. You will learn how ARIMA works, what Stationarity means, and when it is appropriate to use ARIMA. The second part is a Python hands-on tutorial that applies auto-ARIMA to the Sales Forecasting domain. We’ll be working with a time series of beer sales, and our goal is to predict how the beer sales quantities will evolve in the coming years. First, we check if the time series is stationary. Then we train an ARIMA forecasting model. Finally, we use the model to produce a sales forecast and measure the model’s performance.

About Sales Forecasting

Sales forecasting is a crucial business strategy that involves predicting future sales volumes for a product (for example, beer) or service. It leverages sophisticated statistical and analytical techniques, such as time series analysis or machine learning algorithms, to scrutinize historical sales data. By identifying trends and patterns within this data, businesses can make informed predictions about their future sales performance.

This strategic forecasting plays a pivotal role in business operations. It is instrumental in guiding key decisions surrounding production, inventory management, staffing, and various other operational elements. By honing in on accurate sales forecasting, businesses can strike the perfect balance – maintaining enough inventory to meet customer demand without overproducing or overstocking. This equilibrium ensures a smooth flow in the supply chain and avoids unnecessary costs tied to excess production or storage.

Furthermore, sales forecasting serves as a roadmap for business growth. It aids in identifying potential market opportunities and predicting future sales revenue. This valuable foresight enables businesses to strategically plan their expansion, ensuring resources are optimally utilized and future goals are met. With this in-depth understanding of sales forecasting, businesses can stay ahead of market trends, navigate through business challenges, and ultimately steer towards success.

Businesses rely on sales forecasting to make informed decisions about production, inventory management, staffing, and other key operational aspects. Image created with Midjourney

" data-image-caption="

Businesses rely on sales forecasting to make informed decisions about production, inventory management, staffing, and other key operational aspects. Image created with Midjourney

" data-large-file="https://www.relataly.com/wp-content/uploads/2023/03/brewery-arima-beer-sales-forecasting-python-tutorial-machine-learning-relataly-min.png" src="https://www.relataly.com/wp-content/uploads/2023/03/brewery-arima-beer-sales-forecasting-python-tutorial-machine-learning-relataly-min.png" alt="Businesses rely on sales forecasting to make informed decisions about production, inventory management, staffing, and other key operational aspects. Image created with Midjourney" class="wp-image-12602" srcset="https://www.relataly.com/wp-content/uploads/2023/03/brewery-arima-beer-sales-forecasting-python-tutorial-machine-learning-relataly-min.png 506w, https://www.relataly.com/wp-content/uploads/2023/03/brewery-arima-beer-sales-forecasting-python-tutorial-machine-learning-relataly-min.png 300w, https://www.relataly.com/wp-content/uploads/2023/03/brewery-arima-beer-sales-forecasting-python-tutorial-machine-learning-relataly-min.png 140w" sizes="(max-width: 506px) 100vw, 506px" />

Businesses rely on sales forecasting to make informed decisions about production, inventory management, staffing, and other key operational aspects. Image created with Midjourney

Introduction to ARIMA Time Series Modelling

ARIMA models provide an alternative approach to time series forecasting that differs significantly from machine learning methods. Working with ARIMA requires a good understanding of Stationarity and knowledge of the transformations used to make time-series data stationary. The concept of Stationarity is, therefore, first on our schedule.

The Concept of Stationarity

Stationarity is an essential concept in stochastic processes that describes the nature of a time series. We consider a time series strictly stationary if its statistical properties do not change over time. In this case, summary statistics, such as the mean and variance, do not change over time. However, the time-series data we encounter in the real world often show a trend or significant irregular fluctuations, making them non-stationary or weakly stationary.

So why is Stationarity such an essential concept for ARIMA? If a time series is stationary, we can assume that the past values of the time series are predictive of future development. In other words, a stationary time series exhibits consistent behavior that makes it predictable. On the other hand, a non-stationary time series is characterized by a kind of random behavior that will be difficult to capture in modeling. Namely, if random movements characterized the past, there is a high probability that the future will be no different.

Fortunately, in many cases, it is possible to transform a time series that is non-stationary into a stationary form and, in this way, build better prediction models.

A stationary Vs. a non-stationary time series

How to Test Whether a Time Series is Stationary

The first step in the ARIMA modeling approach is determining whether a time series is stationary. There are different ways to determine whether a time series is stationary:

Plotting: We can plot the time series and visually check if it shows consistent behavior or changes over a more extended period.
Summary statistics: We can split the time series into different periods and calculate the summary statistics, such as the variance. If these metrics are subject to significant changes, the time series is non-stationary. However, the results will also depend on the respective periods, leading to false conclusions.
Statistic tests: There are various tests to determine the stationary of a time series, such as Kwiatkowski–Phillips–Schmidt–Shin, Augmented Dickey-Fuller, or Phillips–Perron. These tests systematically check a time series and measure the results against the null hypothesis, providing an indicator of the trustworthiness of the results.

What is an (S)ARIMA Model?

As the name implies, ARIMA uses autoregression (AR), integration (differencing), and moving averages (MA) to fit a linear regression model to a time series.

ARIMA Parameters

The default notation for ARIMA is a model with parameters p, d, and q, whereby each parameter takes an integer value:

d (differencing): In the case of a non-stationary time series, there is a chance to remove a trend from the data by differencing once or several times, thus bringing the data to a stationary state. The model parameter d determines the order of the differentiation. A value of d = 0 simplifies the ARIMA model to an ARMA model, lacking the integration aspect. If this is the case, we do not need to integrate the function because the time series is already stationary.
p (order of the AR terms): The autoregressive process describes the dependent relationship between an observation and several lagged observations (lags). Predictions are then based on past data from the same time series using linear functions. p = 1 means the model uses values that lag by one period.
q (order of the MA terms): The parameter q determines the number of lagged forecast errors in the prediction equation. In contrast to the AR process, the MA process assumes that values at a future point in time depend on the errors made by predictions at current and past points in time. This means that it is not previous events that determine the predictions but rather the previous estimation or prediction errors used to calculate the following time series value.

SARIMA

In the real world, many time series have seasonal effects. Examples are monthly retail sales figures, temperature reports, weekly airline passenger data, etc. To consider this, we can specify a seasonal range (e.g., m=12 for monthly data) and additional seasonal AR or MA components for our model that deal with seasonality. Such a model is also called a SARIMA model, and we can define it as a model(p, d, q)(P, D, Q)[m].

Auto-(S)ARIMA

When working with ARIMA, we can set the model parameters manually or use auto-ARIMA and let the model search for the optimal parameters. We do this by varying the parameters and then testing against Stationarity. With the seasonal option enabled, the process tries to identify the optimal hyperparameters for the seasonal components of the model. Auto-ARIMA works by conducting differencing tests to determine the order of differencing, d and then fitting models with parameters in defined ranges, e.g., start_p, max_p as well as start_q, max_q. If our model has a seasonal component, we can also define parameter ranges for the seasonal part of the model.

Creating a Sales Forecast with ARIMA in Python

Having grasped the fundamental concepts behind ARIMA (AutoRegressive Integrated Moving Average), we’re now ready to dive into the practical aspect of crafting a sales forecasting model in Python. Utilizing ARIMA for forecasting sales data is an esteemed practice owing to the algorithm’s adeptness in modeling seasonal changes combined with long-term trends – a characteristic commonly exhibited by sales data.

In this tutorial, we’ll be employing a dataset representing the monthly beer sales across the United States from 1992 through 2018, recorded in millions of US dollars. Our objective is to construct a robust time series model using ARIMA to accurately predict future sales trends.

When it comes to the technological aspect, we’ll be using the Python-based ‘statsmodels’ and ‘pmdarima’ libraries to build our ARIMA sales forecasting model. So, if you’re ready to harness the power of Python and ARIMA for sales prediction, let’s get started!

The code is available on the GitHub repository.

View on GitHub Relataly GitHub Repo

A fluffy cat drinking beer after creating an ARIMA sales forecast. Image created with Midjourney

Prerequisites

Before we start coding, ensure you have set up your Python 3 environment and required packages. If you don’t have an environment, you can follow this tutorial to set up the Anaconda environment.

Also, make sure you install all required packages. In this tutorial, we will be working with the following standard packages:

In addition, we will be using the statsmodels library and pmdarima.

You can install packages using console commands:

pip install
conda install (if you are using the anaconda packet manager)

Step #1 Load the Sales Data to Our Python Project

In the initial step of this tutorial, we commence by setting up the necessary Python environment. We import several packages that we’ll be using for data manipulation, visualization, and implementing machine learning models. We then fetch the dataset we’ll be working with – the monthly beer sales in the United States from 1992 through 2018. This data is sourced from a publicly accessible URL and loaded into a pandas DataFrame.

# A tutorial for this file is available at www.relataly.com
# Tested with Python 3.88

# Setting up packages for data manipulation and machine learning
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import pmdarima as pm
from statsmodels.graphics.tsaplots import plot_acf
from statsmodels.tsa.seasonal import seasonal_decompose
import seaborn as sns
sns.set_style('white', { 'axes.spines.right': False, 'axes.spines.top': False})

# Link to the dataset: 
# https://www.kaggle.com/bulentsiyah/for-simple-exercises-time-series-forecasting

path = "https://raw.githubusercontent.com/flo7up/relataly_data/main/alcohol_sales/BeerWineLiquor.csv"
df = pd.read_csv(path)
df.head()

		date	beer
0	1/1/1992	1509
1	2/1/1992	1541
2	3/1/1992	1597
3	4/1/1992	1675
4	5/1/1992	1822

As shown above, the sales figures in this dataset stem from the first day of each month.

Step #2 Visualize the Time Series and Check it for Stationarity

Before modeling the sales data, we visualize the time series and test it for Stationarity. Visualization helps us choose the parameters for our ARIMA model, thus making it an essential step.

First, we will look at the different components of the time series. We do this by using the seasonal_decompose function of the statsmodels library.

# Decompose the time series
plt.rcParams["figure.figsize"] = (10,6)
result = seasonal_decompose(df['beer'], model='multiplicative', period = 12)
result.plot()
plt.show()

To test for Stationarity, we use the ADFuller test. It is common to run this test multiple times throughout a data science project. Therefore, we create a function that we can then reuse later.

def check_stationarity(df_sales, title_string, labels):
    # Visualize the data
    fig, ax = plt.subplots(figsize=(16, 8))
    plt.title(title_string, fontsize=14)
    if df_sales.index.size > 12:
        df_sales['ma_12_month'] = df_sales['beer'].rolling(window=12).mean()
        df_sales['ma_25_month'] = df_sales['beer'].rolling(window=25).mean()
        sns.lineplot(data=df_sales[['beer', 'ma_25_month', 'ma_12_month']], palette=sns.color_palette("mako_r", 3))
        plt.legend(title='Smoker', loc='upper left', labels=labels)
    else:
        sns.lineplot(data=df_sales[['beer']])
    
    plt.show()
    
    sales = df_sales['beer'].dropna()
    # Perform an Ad Fuller Test
    # the default alpha = .05 stands for a 95% confidence interval
    adf_test = pm.arima.ADFTest(alpha = 0.05) 
    print(adf_test.should_diff(sales))
    
df_sales = pd.DataFrame(df['beer'], columns=['beer'])
df_sales.index = pd.to_datetime(df['date']) 
title = "Beer sales in the US between 1992 and 2018 in million US$/month"
labels = ['beer', 'ma_12_month', 'ma_25_month']
check_stationarity(df_sales, title, labels)

The data does not appear to be stationary. We can see that our time series is steadily increasing and shows annual seasonality. The steady increase indicates a continuous growth in beer consumption over the last decades. The seasonality in the sales data likely results from people drinking more beer in summer than in other seasons.

Step #3 Exemplary Differencing and Autocorrelation

The chart from the previous section shows that our time series is non-stationary. The reason is that it follows a clear upward trend. We also know that the time series has a seasonal component. Therefore, we need to define additional parameters and construct a SARIMA model.

Before we use auto-correlation to determine the optimal parameters, we will try manual differencing to make the time series stationary. There is no guarantee that differencing works. It is essential to remember that differencing can sometimes also worsen prediction performance. So be careful, not to overdifference! We could also trust that the auto-ARIMA model chooses the best parameters for us. However, we should always validate the selected parameters.

The ideal differencing parameter is the least number of differencing steps to achieve a stationary time series. We will monitor the results with autocorrelation plots to check whether differencing was successful.

We print the autocorrelation for the original time series and after the first and second-order differencing.

# 3.1 Non-seasonal part
def auto_correlation(df, prefix, lags):
    plt.rcParams.update({'figure.figsize':(7,7), 'figure.dpi':120})
    
    # Define the plot grid
    fig, axes = plt.subplots(3,2, sharex=False)

    # First Difference
    axes[0, 0].plot(df)
    axes[0, 0].set_title('Original' + prefix)
    plot_acf(df, lags=lags, ax=axes[0, 1])

    # First Difference
    df_first_diff = df.diff().dropna()
    axes[1, 0].plot(df_first_diff)
    axes[1, 0].set_title('First Order Difference' + prefix)
    plot_acf(df_first_diff, lags=lags - 1, ax=axes[1, 1])

    # Second Difference
    df_second_diff = df.diff().diff().dropna()
    axes[2, 0].plot(df_second_diff)
    axes[2, 0].set_title('Second Order Difference' + prefix)
    plot_acf(df_second_diff, lags=lags - 2, ax=axes[2, 1])
    plt.tight_layout()
    plt.show()
    
auto_correlation(df_sales['beer'], '', 10)

(0.019143247561160443, False)

The charts above show that the time series becomes stationary after one order differencing. However, we can see that the lag goes into the negative very quickly, which indicates overdifferencing.

Next, we perform the same procedure for the seasonal part of our time series.

# 3.2 Seasonal part

# Reduce the timeframe to a single seasonal period
df_sales_s = df_sales['beer'][0:12]

# Autocorrelation for the seasonal part
auto_correlation(df_sales_s, '', 10)

# Check if the first difference of the seasonal period is stationary
df_diff = pd.DataFrame(df_sales_s.diff())
df_diff.index = pd.date_range(df_sales_s.diff().iloc[1], periods=12, freq='MS') 
check_stationarity(df_diff, "First Difference (Seasonal)", ['difference'])

(0.99, True)

After first order differencing, the seasonal part of the time series is stationary. The autocorrelation plot shows that the values go into the negative but remain within acceptable boundaries. Second-order differencing does not seem to improve these values. Consequently, we conclude that first-order differencing is a good choice for the D parameter.

Step #4 Finding an Optimal Model with Auto-ARIMA

Next, we auto-fit an ARIMA model to our time series. In this way, we ensure that we can later measure the performance of our model against a fresh set of data that the model has not seen so far. We will split our dataset into train and test in preparation for this.

Once we have created the train and test data sets, we can configure the parameters for the auto_arima stepwise optimization. By setting max_d = 1, we tell the model to test no-differencing and first-order differencing. Also, we set max_p and max_q to 3.

To deal with the seasonality in our time series, we set the “seasonal” parameter to True and the “m” parameter to 12 data points. We turn our model into a SARIMA model that allows us to configure additional D, P, and Q parameters. We define a max value for Q and P of 3. Previously we have already seen that further differencing does not improve the Stationarity. Therefore, we can set the value of D to 1.

After configuring the parameters, we next fit the model to the time series. The model will try to find the optimal parameters and choose the model with the least AIC.

# split into train and test
pred_periods = 30
split_number = df_sales['beer'].count() - pred_periods # corresponds to a prediction horizion  of 2,5 years
df_train = pd.DataFrame(df_sales['beer'][:split_number]).rename(columns={'beer':'y_train'})
df_test = pd.DataFrame(df_sales['beer'][split_number:]).rename(columns={'beer':'y_test'})

# auto_arima
model_fit = pm.auto_arima(df_train, test='adf', 
                         max_p=3, max_d=3, max_q=3, 
                         seasonal=True, m=12,
                         max_P=3, max_D=2, max_Q=3,
                         trace=True,
                         error_action='ignore',  
                         suppress_warnings=True, 
                         stepwise=True)

# summarize the model characteristics
print(model_fit.summary())

Performing stepwise search to minimize aic
 ARIMA(2,0,2)(1,1,1)[12] intercept   : AIC=inf, Time=3.89 sec
 ARIMA(0,0,0)(0,1,0)[12] intercept   : AIC=3383.210, Time=0.02 sec
 ARIMA(1,0,0)(1,1,0)[12] intercept   : AIC=3351.655, Time=0.38 sec
 ARIMA(0,0,1)(0,1,1)[12] intercept   : AIC=3364.350, Time=1.09 sec
 ARIMA(0,0,0)(0,1,0)[12]             : AIC=3604.145, Time=0.02 sec
 ARIMA(1,0,0)(0,1,0)[12] intercept   : AIC=3349.908, Time=0.11 sec
 ARIMA(1,0,0)(0,1,1)[12] intercept   : AIC=3351.532, Time=0.29 sec
 ARIMA(1,0,0)(1,1,1)[12] intercept   : AIC=3353.520, Time=1.24 sec
 ARIMA(2,0,0)(0,1,0)[12] intercept   : AIC=3312.656, Time=0.10 sec
 ARIMA(2,0,0)(1,1,0)[12] intercept   : AIC=3314.483, Time=0.57 sec
 ARIMA(2,0,0)(0,1,1)[12] intercept   : AIC=3314.378, Time=0.30 sec
 ARIMA(2,0,0)(1,1,1)[12] intercept   : AIC=3305.552, Time=3.02 sec
 ARIMA(2,0,0)(2,1,1)[12] intercept   : AIC=3291.425, Time=4.19 sec
 ARIMA(2,0,0)(2,1,0)[12] intercept   : AIC=3306.914, Time=3.06 sec
 ARIMA(2,0,0)(3,1,1)[12] intercept   : AIC=3276.501, Time=4.67 sec
 ARIMA(2,0,0)(3,1,0)[12] intercept   : AIC=3282.240, Time=5.24 sec
 ARIMA(2,0,0)(3,1,2)[12] intercept   : AIC=inf, Time=7.39 sec
 ARIMA(2,0,0)(2,1,2)[12] intercept   : AIC=inf, Time=4.74 sec
 ARIMA(1,0,0)(3,1,1)[12] intercept   : AIC=3313.877, Time=5.17 sec
 ARIMA(3,0,0)(3,1,1)[12] intercept   : AIC=3246.820, Time=5.72 sec
 ARIMA(3,0,0)(2,1,1)[12] intercept   : AIC=3255.313, Time=5.33 sec
 ARIMA(3,0,0)(3,1,0)[12] intercept   : AIC=3249.998, Time=6.77 sec
 ARIMA(3,0,0)(3,1,2)[12] intercept   : AIC=inf, Time=8.39 sec
 ARIMA(3,0,0)(2,1,0)[12] intercept   : AIC=3259.938, Time=3.55 sec
...
===================================================================================

Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).

Auto-ARIMA has determined that the best model is (3,0,0)(3,1,1). These results match the results from section 3, in which we manually performed differencing.

Step #5 Simulate the Time Series using in-sample Forecasting

Now that we have trained our model, we want to use it to simulate the entire time series. We will do this by calling the predict method in the sample function. The prediction will match the same period as the original time series with which we trained the model. Because the model predicts one step, the prediction results will naturally be close to the actual values.

# Generate in-sample Predictions
# The parameter dynamic=False means that the model makes predictions upon the lagged values.
# This means that the model is trained until a point in the time-series and then tries to predict the next value.
pred = model_fit.predict_in_sample(dynamic=False) # works only with auto-arima
df_train['y_train_pred'] = pred

# Calculate the percentage difference
df_train['diff_percent'] = abs((df_train['x_train'] - pred) / df_train['x_train'])* 100

# Print the predicted time-series
fig, ax1 = plt.subplots(figsize=(16, 8))
plt.title("In Sample Sales Prediction", fontsize=14)
sns.lineplot(data=df_train[['x_train', 'y_train_pred']], linewidth=1.0)

# Print percentage prediction errors on a separate axis (ax2)
ax2 = ax1.twinx() 
ax2.set_ylabel('Prediction Errors in %', color='purple', fontsize=14)  
ax2.set_ylim([0, 50])
ax2.bar(height=df_train['diff_percent'][20:], x=df_train.index[20:], width=20, color='purple', label='absolute errors')
plt.legend()
plt.show()

Next, we take a look at the prediction errors.

Step #6 Generate and Visualize a Sales Forecast

Now that we have trained an optimal model, we are ready to generate a sales forecast. First, we specify the number of periods that we want to predict. In addition, we create an index from the number of predictions adjacent to the original time series and continue it (prediction_index).

# Generate prediction for n periods, 
# Predictions start from the last date of the training data
test_pred = model_fit.predict(n_periods=pred_periods, dynamic=False)
df_test['y_test_pred'] = test_pred
df_union = pd.concat([df_train, df_test])
df_union.rename(columns={'beer':'y_test'}, inplace=True)

# Print the predicted time-series
fig, ax = plt.subplots(figsize=(16, 8))
plt.title("Test/Pred Comparison", fontsize=14)
sns.despine();
sns.lineplot(data=df_union[['y_train', 'y_train_pred', 'y_test', 'y_test_pred']], linewidth=1.0, dashes=False, palette='muted')
ax.set_xlim([df_union.index[150],df_union.index.max()])
plt.legend()
plt.show()

As shown above, our model’s forecast continues the seasonal pattern of the beer sales time series. On the one hand, this indicates that US beer sales will continue to rise and, on the other hand, that our model works just fine 🙂

Step #7 Measure the Performance of the Sales Forecasting Model

In this section, we will measure the performance of our ARIMA model. To learn more about this topic, check out this relataly article measuring regression performance.

The previous section’s simulation chart shows a few outliers among the prediction errors. Therefore, we focus our analysis on the percentage errors. Two helpful metrics are the mean absolute error (MAPE) and the mean absolute percentage error (MDAPE).

# Mean Absolute Percentage Error (MAPE)
MAPE = np.mean((np.abs(np.subtract(df_test['y_test'], df_test['y_test_pred'])/ df_test['y_test']))) * 100
print(f'Mean Absolute Percentage Error (MAPE): {np.round(MAPE, 2)} %')

# Median Absolute Percentage Error (MDAPE)
MDAPE = np.median((np.abs(np.subtract(df_test['y_test'], df_test['y_test_pred'])/ df_test['y_test'])) ) * 100
print(f'Median Absolute Percentage Error (MDAPE): {np.round(MDAPE, 2)} %')

Mean Absolute Percentage Error (MAPE): 3.94 %  Median Absolute Percentage Error (MDAPE): 3.49 %

The percent errors show that our ARIMA model achieves a decent predictive performance.

Summary

This Python tutorial has shown how to use SARIMA for sales forecasting. Sales forecasting is important for businesses because it can help them to make informed decisions about production, inventory management, and staffing, among other things. By accurately forecasting sales, businesses can ensure that they have the right amount of product available to meet customer sales, avoid overproduction and excess inventory, and plan for future growth. The use cases presented were forecasting beer sales, and we have used arima to analyze seasonal sales data.

In the first part, we have learned how ARIMA works, what Stationarity is and how to check if a time series is stationary. In the second part, we developed an ARIMA model in Python to create a forecast for US beer sales. For this purpose, we created an in-sample forecast and used Auto-tARIMA to find the optimal parameters for our sales forecasting model.

If you have any questions or suggestions, please let me know in the comments, and I will do my best to answer.

Now that we have learned to use ARIMA to forecast beer sales, you really deserved yourself a beer. Cheers! Image created with Midjourney

" data-image-caption="

Now that we have learned to use ARIMA to forecast beer sales, you really deserved yourself a beer. Cheers! Image created with Midjourney

" data-large-file="https://www.relataly.com/wp-content/uploads/2023/03/brewery-arima-beer-sales-forecasting-python-tutorial-machine-learning-relataly2-min.png" src="https://www.relataly.com/wp-content/uploads/2023/03/brewery-arima-beer-sales-forecasting-python-tutorial-machine-learning-relataly2-min.png" alt="Now that we have learned to use ARIMA to forecast beer sales, you really deserved yourself a beer. Cheers! Image created with Midjourney" class="wp-image-12603" srcset="https://www.relataly.com/wp-content/uploads/2023/03/brewery-arima-beer-sales-forecasting-python-tutorial-machine-learning-relataly2-min.png 506w, https://www.relataly.com/wp-content/uploads/2023/03/brewery-arima-beer-sales-forecasting-python-tutorial-machine-learning-relataly2-min.png 300w" sizes="(max-width: 506px) 100vw, 506px" />

Now that you have learned to use ARIMA to forecast beer sales, you really earned yourself a beer. Cheers! Image created with Midjourney

Sources and Further Reading

The links above to Amazon are affiliate links. By buying through these links, you support the Relataly.com blog and help to cover the hosting costs. Using the links does not affect the price.

Want to learn more about time series analysis and prediction?
Check out these recent relataly tutorials:

The post Forecasting Beer Sales with ARIMA in Python appeared first on relataly.com.

Mastering Multivariate Stock Market Prediction with Python: A Guide to Effective Feature Engineering Techniques

Florian Follonier — Mon, 29 Jun 2020 21:47:28 +0000

Are you interested in learning how multivariate forecasting models can enhance the accuracy of stock market predictions? Look no further! While traditional time series data provides valuable insights into historical trends, multivariate forecasting models utilize additional features to identify patterns and predict future price movements. This process, known as “feature engineering,” is a crucial step in creating accurate stock market forecasts.

In this article, we dive into the world of feature engineering and demonstrate how it can improve stock market predictions. We explore popular financial analysis metrics, including Bollinger bands, RSI, and Moving Averages, and show how they can be used to create powerful forecasting models.

But we don’t just stop at theory. We provide a hands-on tutorial using Python to prepare and analyze time-series data for stock market forecasting. We leverage the power of recurrent neural networks with LSTM layers, based on the Keras library, to train and test different model variations with various feature combinations.

By the end of this article, you’ll have a thorough understanding of feature engineering and how it can improve the accuracy of stock market predictions. So, buckle up and get ready to discover how multivariate forecasting models can take your stock market analysis to the next level!

New to time series modeling?
Consider starting with the following tutorial on univariate time series models: Stock-market forecasting using Keras Recurrent Neural Networks and Python.

Squirrels mastered the art of multivariate feature engineering for time series analysis a long time ago. You can do it too! Image generated with Midjourney

Feature Engineering for Stock Market Forecasting – Borrowing Features from Chart Analysis

The idea behind multivariate time series models is to feed the model with additional features that improve prediction quality. An example of such an additional feature is a “moving average.” Adding more features does not automatically improve predictive performance but increases the time needed to train the models. The challenge is to find the right combination of features and to create an input form that allows the model to recognize meaningful patterns. There is no way around conducting experiments and trying out feature combinations. This process of trial and error can be time-consuming. It is, therefore, helping to build upon established indicators.

In stock market forecasting, we can use indicators from chart analysis. This domain forecasts future prices by studying historical prices and trading volume. The underlying idea is that specific patterns or chart formations in the data can signal the timing of beneficial buying or selling decisions. We can borrow indicators from this discipline and use them as input features.

When we develop predictive machine learning models, the difference from chart analysis is that we do not aim to analyze the chart ourselves manually, but try to create a machine learning model, for example, a recurrent neural network, that does the job for us.

A multivariate time-series forecast, as we will create it in this article. Exemplary chart with technical indicators (Bollinger bands, RSI, and Double-EMA)

Stock Market Forecasting – Does this really Work?

It is essential to point out that the effectiveness of chart analysis and algorithmic trading is controversial. There is at least as much controversy about whether it is possible to predict the price of stock markets with neural networks. Various studies and researchers have examined the effectiveness of chart analysis with different results. One of the most significant points of criticism is that it cannot take external events into account. Nevertheless, many financial analysts consider financial indicators when making investment decisions, so a lot of money is moved simply because many people believe in statistical indicators.

So without knowing how well this will work, it is worth an attempt to feed a neural network with different financial indicators. But first and foremost, I see this as an excellent way to show how feature engineering works. Just make sure not to rely on the predictions of these models blindly.

Also: Stock Market Prediction using Univariate Recurrent Neural Networks

Selected Statistical Indicators

The following indicators are commonly used in chart analysis and may be helpful when creating forecasting models:

Relative Strength Index
Simple Moving Averages
Exponential Moving Averages
Bolliger Bands

Relative Strength Index (RSI)

The Relative Strength Index (RSI) is one of the most commonly used oscillating indicators. In 1978, Welles Wilder developed it to determine the momentum of price movements and compare the strength of price losses in a period with price gains. It can take percentage values between 0 and 100.

Reference lines determine how long an existing trend will last before expecting a trend reversal. In other words, when the price is heavily oversold or overbought, one should expect a trend reversal.

The reference line is at 40% (oversold) and 80% (overbought) with an upward trend.
The reference line is at 20% (oversold) and 60% (overbought) with a downtrend.

The formula for the RSI is as follows:

Calculate the sum of all positive and negative price changes in a period (e.g., 30 days):
We then calculate the mean value of the sums with the following formula:
Finally, we calculate the RSI with the following formula:

Simple Moving Averages (SMA)

Simple Moving Averages (SMA) is another technical indicator that financial analysts use to determine if a price trend will continue or reverse. The SMA is the average sum of all values within a certain period. Financial analysts pay close attention to the 200-day SMA (SMA-200). When the price crosses the SMA, this may signal a trend reversal. Furthermore, we often use SMAs for 50 (SMA-50) and 100 days (SMA-100) periods. In this regard, two popular trading patterns include the death cross and a golden cross.

A death cross occurs when the trend line of the SMA-50/100 crosses below the 200-day SMA. This suggests that a falling trend will likely accelerate downwards.
A golden cross occurs when the trend line of the SMA-50/100 crosses over the 200-day SMA, suggesting a rising trend will likely accelerate upwards.

We can use the SMA in the input shape of our model simply by measuring the distance between two trendlines.

Exponential Moving Averages (EMA)

The exponential moving average (EMA) is another lagging trend indicator. Like the SMA, the EMA measures the strength of a price trend. The difference between SMA and EMA is that the SMA assigns equal values to all price points, while the EMA uses a multiplier that weights recent prices higher.

The formula for the EMA is as follows: Calculating the EMA for a given data point requires past price values. For example, to calculate the SMA for today, based on 30 past values, we calculate the average price values for the past 30 days. We then multiply the result by a weighting factor that weighs the EMA. The formula for this multiplier is as follows: Smoothing factor / (1+ days)

It is common to use different smoothing factors. For a 30-day moving average, the multiplier would be [2/(30+1)]= 0.064.

As soon as we have calculated the EMA for the first data point, we can use the following formula to calculate the ema for all subsequent data points: EMA = Closing price x multiplier + EMA (previous day) x (1-multiplier)

Bollinger Bands

Bollinger Bands are a popular technical analysis tool used to identify market volatility and potential price movements in financial markets. They are named after their creator, John Bollinger.

Bollinger Bands consist of three lines that are plotted on a price chart. The middle line is a simple moving average (SMA) of the asset price over a specified period (typically 20 days). The upper and lower lines are calculated by adding and subtracting a multiple (usually two) of the standard deviation of the asset price from the middle line.

The upper band is calculated as: Middle band + (2 x Standard deviation) The lower band is calculated as: Middle band – (2 x Standard deviation)

The standard deviation is a measure of how much the asset price deviates from the average. When the asset price is more volatile, the bands widen, and when the price is less volatile, the bands narrow.

Traders use Bollinger Bands to identify potential buy or sell signals. When the price touches or crosses the upper band, it may be a sell signal, indicating that the asset is overbought. Conversely, when the price touches or crosses the lower band, it may be a buy signal, indicating that the asset is oversold.

Feature Engineering for Time Series Prediction Models in Python

In the following, this tutorial will guide you through the process of implementing a multivariate time series prediction model for the NASDAQ stock market index. Our aim is to equip you with the knowledge and practical skills required to create a powerful predictive model that can effectively forecast stock prices.

Throughout this tutorial, we will take you through a step-by-step approach to building a multivariate time series prediction model. You will learn how to implement and utilize different features to train and measure the performance of your model. Our goal is to ensure that you are not only able to understand the underlying concepts of multivariate time series prediction, but that you are also capable of applying these concepts in a practical setting.

The code is available on the GitHub repository.

View on GitHub Relataly GitHub Repo

Let’s do some feature engineering for machine learning!

" data-image-caption="

Let’s do some feature engineering for machine learning!

" data-large-file="https://www.relataly.com/wp-content/uploads/2023/03/robot-artificial-intelligence-colorful-midjourney-relataly-min.png" src="https://www.relataly.com/wp-content/uploads/2023/03/robot-artificial-intelligence-colorful-midjourney-relataly-min.png" alt="Let's do some feature engineering for machine learning!" class="wp-image-12671" srcset="https://www.relataly.com/wp-content/uploads/2023/03/robot-artificial-intelligence-colorful-midjourney-relataly-min.png 496w, https://www.relataly.com/wp-content/uploads/2023/03/robot-artificial-intelligence-colorful-midjourney-relataly-min.png 298w, https://www.relataly.com/wp-content/uploads/2023/03/robot-artificial-intelligence-colorful-midjourney-relataly-min.png 140w" sizes="(max-width: 496px) 100vw, 496px" />

Let’s do some feature engineering for machine learning!

Prerequisites

Before starting the coding part, make sure that you have set up your Python 3 environment and required packages. If you don’t have an environment, follow this tutorial to set up the Anaconda environment.

Also, make sure you install all required packages. In this tutorial, we will be working with the following standard packages:

In addition, we will be using Keras (2.0 or higher) with Tensorflow backend to train the neural network, the machine learning library scikit-learn, and the pandas-DataReader. You can install these packages using the following console commands:

pip install
conda install (if you are using the anaconda packet manager)

Step #1 Load the Data

Let’s start by setting up the imports and loading the data. Our Python project will use price data from the NASDAQ composite index (symbol: ^IXIC) from yahoo.finance.com.

# Time Series Forecasting - Feature Engineering For Multivariate Models (Stock Market Prediction Example)
# A tutorial for this file is available at www.relataly.com

import math # Mathematical functions  
import numpy as np # Fundamental package for scientific computing with Python 
import pandas as pd # Additional functions for analysing and manipulating data 
from datetime import date # Date Functions 
import matplotlib.pyplot as plt # Important package for visualization - we use this to plot the market data 
import matplotlib.dates as mdates # Formatting dates 
from sklearn.metrics import mean_absolute_error, mean_squared_error # Packages for measuring model performance / errors 
import tensorflow as tf
from tensorflow.keras.models import Sequential # Deep learning library, used for neural networks 
from tensorflow.keras.layers import LSTM, Dense, Dropout # Deep learning classes for recurrent and regular densely-connected layers 
from tensorflow.keras.callbacks import EarlyStopping # EarlyStopping during model training 
from sklearn.preprocessing import RobustScaler # This Scaler removes the median and scales the data according to the quantile range to normalize the price data  
#from keras.optimizers import Adam # For detailed configuration of the optimizer 
import seaborn as sns # Visualization
sns.set_style('white', { 'axes.spines.right': False, 'axes.spines.top': False})


# check the tensorflow version and the number of available GPUs
print('Tensorflow Version: ' + tf.__version__)
physical_devices = tf.config.list_physical_devices('GPU')
print("Num GPUs:", len(physical_devices))

# Setting the timeframe for the data extraction
end_date =  date.today().strftime("%Y-%m-%d")
start_date = '2010-01-01'

# Getting NASDAQ quotes
stockname = 'NASDAQ'
symbol = '^IXIC'

# You can either use webreader or yfinance to load the data from yahoo finance
# import pandas_datareader as webreader
# df = webreader.DataReader(symbol, start=start_date, end=end_date, data_source="yahoo")

import yfinance as yf #Alternative package if webreader does not work: pip install yfinance
df = yf.download(symbol, start=start_date, end=end_date)

# Quick overview of dataset
df.head()

Tensorflow Version: 2.5.0
Num GPUs: 1
[*********************100%***********************]  1 of 1 completed
			Open		High		Low	Close	Adj 		Close		Volume
Date						
2009-12-31	2292.919922	2293.590088	2269.110107	2269.149902	2269.149902	1237820000
2010-01-04	2294.409912	2311.149902	2294.409912	2308.419922	2308.419922	1931380000
2010-01-05	2307.270020	2313.729980	2295.620117	2308.709961	2308.709961	2367860000
2010-01-06	2307.709961	2314.070068	2295.679932	2301.090088	2301.090088	2253340000
2010-01-07	2298.090088	2301.300049	2285.219971	2300.050049	2300.050049	2270050000

Step #2 Explore the Data

Let’s take a quick look at the data by creating line charts for the columns of our data set.

# Plot line charts
df_plot = df.copy()

ncols = 2
nrows = int(round(df_plot.shape[1] / ncols, 0))

fig, ax = plt.subplots(nrows=nrows, ncols=ncols, sharex=True, figsize=(14, 7))
for i, ax in enumerate(fig.axes):
        sns.lineplot(data = df_plot.iloc[:, i], ax=ax)
        ax.tick_params(axis="x", rotation=30, labelsize=10, length=0)
        ax.xaxis.set_major_locator(mdates.AutoDateLocator())
fig.tight_layout()
plt.show()

Our initial dataset includes six features: High, Low, Open, Close, Volumen, and Adj Close.

Step #3 Feature Engineering

Now comes the exciting part – we will implement additional features. We use various indicators from chart analysis, such as averages for different periods and stochastic oscillators to measure price momentum.

# Indexing Batches
train_df = df.sort_values(by=['Date']).copy()

# Adding Month and Year in separate columns
d = pd.to_datetime(train_df.index)
train_df['Day'] = d.strftime("%d") 
train_df['Month'] = d.strftime("%m") 
train_df['Year'] = d.strftime("%Y") 
train_df

			Open		High		Low			Close		Adj Close	Volume		Day	Month	Year
Date									
2009-12-31	2292.919922	2293.590088	2269.110107	2269.149902	2269.149902	1237820000	31	12		2009
2010-01-04	2294.409912	2311.149902	2294.409912	2308.419922	2308.419922	1931380000	04	01		2010
2010-01-05	2307.270020	2313.729980	2295.620117	2308.709961	2308.709961	2367860000	05	01		2010
2010-01-06	2307.709961	2314.070068	2295.679932	2301.090088	2301.090088	2253340000	06	01		2010
2010-01-07	2298.090088	2301.300049	2285.219971	2300.050049	2300.050049	2270050000	07	01		2010

We create a set of indicators for the training data with the following code. However, we will make one more restriction in the next step since a model with all these indicators does not achieve good results and would take far too long to train on a local computer.

# Feature Engineering
def createFeatures(df):
    df = pd.DataFrame(df)

    
    df['Close_Diff'] = df['Adj Close'].diff()
        
    # Moving averages - different periods
    df['MA200'] = df['Close'].rolling(window=200).mean() 
    df['MA100'] = df['Close'].rolling(window=100).mean() 
    df['MA50'] = df['Close'].rolling(window=50).mean() 
    df['MA26'] = df['Close'].rolling(window=26).mean() 
    df['MA20'] = df['Close'].rolling(window=20).mean() 
    df['MA12'] = df['Close'].rolling(window=12).mean() 
    
    # SMA Differences - different periods
    df['DIFF-MA200-MA50'] = df['MA200'] - df['MA50']
    df['DIFF-MA200-MA100'] = df['MA200'] - df['MA100']
    df['DIFF-MA200-CLOSE'] = df['MA200'] - df['Close']
    df['DIFF-MA100-CLOSE'] = df['MA100'] - df['Close']
    df['DIFF-MA50-CLOSE'] = df['MA50'] - df['Close']
    
    # Moving Averages on high, lows, and std - different periods
    df['MA200_low'] = df['Low'].rolling(window=200).min()
    df['MA14_low'] = df['Low'].rolling(window=14).min()
    df['MA200_high'] = df['High'].rolling(window=200).max()
    df['MA14_high'] = df['High'].rolling(window=14).max()
    df['MA20dSTD'] = df['Close'].rolling(window=20).std() 
    
    # Exponential Moving Averages (EMAS) - different periods
    df['EMA12'] = df['Close'].ewm(span=12, adjust=False).mean()
    df['EMA20'] = df['Close'].ewm(span=20, adjust=False).mean()
    df['EMA26'] = df['Close'].ewm(span=26, adjust=False).mean()
    df['EMA100'] = df['Close'].ewm(span=100, adjust=False).mean()
    df['EMA200'] = df['Close'].ewm(span=200, adjust=False).mean()

    # Shifts (one day before and two days before)
    df['close_shift-1'] = df.shift(-1)['Close']
    df['close_shift-2'] = df.shift(-2)['Close']

    # Bollinger Bands
    df['Bollinger_Upper'] = df['MA20'] + (df['MA20dSTD'] * 2)
    df['Bollinger_Lower'] = df['MA20'] - (df['MA20dSTD'] * 2)
    
    # Relative Strength Index (RSI)
    df['K-ratio'] = 100*((df['Close'] - df['MA14_low']) / (df['MA14_high'] - df['MA14_low']) )
    df['RSI'] = df['K-ratio'].rolling(window=3).mean() 

    # Moving Average Convergence/Divergence (MACD)
    df['MACD'] = df['EMA12'] - df['EMA26']
    
    # Replace nas 
    nareplace = df.at[df.index.max(), 'Close']    
    df.fillna((nareplace), inplace=True)
    
    return df

Now that we have created several features, we will limit them. We can now choose from these features and test how different feature combinations affect model performance.

# List of considered Features
FEATURES = [
#             'High',
#             'Low',
#             'Open',
              'Close',
#             'Volume',
#             'Day',
#             'Month',
#             'Year',
#             'Adj Close',
#              'close_shift-1',
#              'close_shift-2',
#             'MACD',
#             'RSI',
#             'MA200',
#             'MA200_high',
#             'MA200_low',
            'Bollinger_Upper',
            'Bollinger_Lower',
#             'MA100',            
#             'MA50',
#             'MA26',
#             'MA14_low',
#             'MA14_high',
#             'MA12',
#             'EMA20',
#             'EMA100',
#             'EMA200',
#               'DIFF-MA200-MA50',
#               'DIFF-MA200-MA100',
#             'DIFF-MA200-CLOSE',
#             'DIFF-MA100-CLOSE',
#             'DIFF-MA50-CLOSE'
           ]

# Create the dataset with features
df_features = createFeatures(train_df)

# Shift the timeframe by 10 month
use_start_date = pd.to_datetime("2010-11-01" )
df_features = df_features[df_features.index > use_start_date].copy()

# Filter the data to the list of FEATURES
data_filtered_ext = df_features[FEATURES].copy()

# We add a prediction column and set dummy values to prepare the data for scaling
#data_filtered_ext['Prediction'] = data_filtered_ext['Close'] 
print(data_filtered_ext.tail().to_string())

# remove Date column before training
dfs = data_filtered_ext.copy()

# Create a list with the relevant columns
assetname_list = [dfs.columns[i-1] for i in range(dfs.shape[1])]

# Create the lineplot
fig, ax = plt.subplots(figsize=(16, 8))
sns.lineplot(data=data_filtered_ext[assetname_list], linewidth=1.0, dashes=False, palette='muted')

# Configure and show the plot    
ax.set_title(stockname + ' price chart')
ax.legend()
plt.show

 			Close  			Bollinger_Upper  Bollinger_Lower
Date                                                      
2022-05-18  11418.150391     13404.779247     11065.040772
2022-05-19  11388.500000     13285.741255     11005.463725
2022-05-20  11354.620117     13214.664450     10928.073538
2022-05-23  11535.269531     13075.594634     10920.185347
2022-05-24  11264.450195     13035.543222     10837.607755

Step #4 Scaling and Transforming the Data

Before training our model, we need to transform the data. This step includes scaling the data (to a range between 0 and 1) and dividing it into separate sets for training and testing the prediction model. Most of the code used in this section stems from the previous article on multivariate time-series prediction, which covers the steps to transform the data. So we don’t go into too much detail here.

# Calculate the number of rows in the data
nrows = dfs.shape[0]
np_data_unscaled = np.reshape(np.array(dfs), (nrows, -1))
print(np_data_unscaled.shape)

# Transform the data by scaling each feature to a range between 0 and 1
scaler = RobustScaler()
np_data = scaler.fit_transform(np_data_unscaled)

# Creating a separate scaler that works on a single column for scaling predictions
scaler_pred = RobustScaler()
df_Close = pd.DataFrame(data_filtered_ext['Close'])
np_Close_scaled = scaler_pred.fit_transform(df_Close)

Out: (2619, 6)

Once we have scaled the data, we will split the data into a train and test set. This step creates four datasets x_train and x_test, and y_train and y_test. x_train and x_test contain the data with our selected features. The two sets, y_train and y_test, have the actual values, which our model will try to predict.

# Set the sequence length - this is the timeframe used to make a single prediction
sequence_length = 50 # = number of neurons in the first layer of the neural network

# Split the training data into train and train data sets
# As a first step, we get the number of rows to train the model on 80% of the data 
train_data_len = math.ceil(np_data.shape[0] * 0.8)

# Create the training and test data
train_data = np_data[:train_data_len, :]
test_data = np_data[train_data_len - sequence_length:, :]

# The RNN needs data with the format of [samples, time steps, features]
# Here, we create N samples, sequence_length time steps per sample, and 6 features
def partition_dataset(sequence_length, data):
    x, y = [], []
    data_len = data.shape[0]

    for i in range(sequence_length, data_len):
        x.append(data[i-sequence_length:i,:]) #contains sequence_length values 0-sequence_length * columsn
        y.append(data[i, 0]) #contains the prediction values for validation,  for single-step prediction
    
    # Convert the x and y to numpy arrays
    x = np.array(x)
    y = np.array(y)
    return x, y

# Generate training data and test data
x_train, y_train = partition_dataset(sequence_length, train_data)
x_test, y_test = partition_dataset(sequence_length, test_data)

# Print the shapes: the result is: (rows, training_sequence, features) (prediction value, )
print(x_train.shape, y_train.shape)
print(x_test.shape, y_test.shape)

# Validate that the prediction value and the input match up
# The last close price of the second input sample should equal the first prediction value
print(x_train[1][sequence_length-1][0])
print(y_train[0])

Out:
(1914, 30, 3) (1914,) 
(486, 30, 3) (486,)

Step #5 Train the Time Series Forecasting Model

Now that we have prepared the data, we can train our forecasting model. For this purpose, we will use a recurrent neural network from the Keras library. A recurrent neural network (RNN) is a type of artificial neural network that can process sequential data, such as text, audio, or time series data. Unlike traditional feedforward neural networks, in which data flows through the network in only one direction, RNNs have connections that form a directed cycle, allowing information to flow in multiple directions and be processed in a temporal manner.

The model architecture of our RNN looks as follows:

LSTM layer that receives a mini-batch as input.
LSTM layer that has the same number of neurons as the mini-batch
Another LSTM layer that does not return the sequence
Dense layer with 32 neurons
Dense layer with one neuron that outputs the forecast

The architecture is not too complex and is suitable for experimenting with different features. I arrived at this architecture by trying out different layers and configurations. However, I did not spend too much time fine-tuning the architecture since this tutorial focuses on feature engineering.

During model training, the neural network processes several mini-batches. The shape of the mini-batch is defined by the number of features and the period chosen. Multiplying these two dimensions (number of features x number of time steps) gives the input shape of our model.

The following code defines the model architecture, trains the model, and then prints the training loss curve:

# Configure the neural network model
model = Sequential()

# Configure the Neural Network Model with n Neurons - inputshape = t Timestamps x f Features
n_neurons = x_train.shape[1] * x_train.shape[2]
print('timesteps: ' + str(x_train.shape[1]) + ',' + ' features:' + str(x_train.shape[2]))
model.add(LSTM(n_neurons, return_sequences=True, input_shape=(x_train.shape[1], x_train.shape[2]))) 
#model.add(Dropout(0.1))
model.add(LSTM(n_neurons, return_sequences=True))
#model.add(Dropout(0.1))
model.add(LSTM(n_neurons, return_sequences=False))
model.add(Dense(32))
model.add(Dense(1, activation='relu'))


# Configure the Model   
optimizer='adam'; loss='mean_squared_error'; epochs = 100; batch_size = 32; patience = 8; 

# uncomment to customize the learning rate
learn_rate = "standard" # 0.05
# adam = Adam(learn_rate=learn_rate) 

parameter_list = ['epochs ' + str(epochs), 'batch_size ' + str(batch_size), 'patience ' + str(patience), 'optimizer ' + str(optimizer) + ' with learn rate ' + str(learn_rate), 'loss ' + str(loss)]
print('Parameters: ' + str(parameter_list))

# Compile and Training the model
model.compile(optimizer=optimizer, loss=loss)
early_stop = EarlyStopping(monitor='loss', patience=patience, verbose=1)
history = model.fit(x_train, y_train, batch_size=batch_size, epochs=epochs, callbacks=[early_stop], shuffle = True,
                  validation_data=(x_test, y_test))

# Plot training & validation loss values
fig, ax = plt.subplots(figsize=(12, 6), sharex=True)
plt.plot(history.history["loss"])
plt.title("Model loss")
plt.ylabel("Loss")
plt.xlabel("Epoch")
ax.xaxis.set_major_locator(plt.MaxNLocator(epochs))
plt.legend(["Train", "Test"], loc="upper left")
plt.show()

timesteps: 50, features:1
Parameters: ['epochs 100', 'batch_size 32', 'patience 8', 'optimizer adam with learn rate standard', 'loss mean_squared_error']
Epoch 1/100
72/72 [==============================] - 9s 55ms/step - loss: 0.0990 - val_loss: 0.2985
Epoch 2/100
72/72 [==============================] - 3s 37ms/step - loss: 0.0932 - val_loss: 0.1768
Epoch 3/100
72/72 [==============================] - 3s 39ms/step - loss: 0.0931 - val_loss: 0.1246
Epoch 4/100
72/72 [==============================] - 3s 37ms/step - loss: 0.0931 - val_loss: 0.0902
Epoch 5/100
72/72 [==============================] - 3s 38ms/step - loss: 0.0929 - val_loss: 0.0846
Epoch 6/100
72/72 [==============================] - 3s 38ms/step - loss: 0.0930 - val_loss: 0.0611
Epoch 7/100
72/72 [==============================] - 3s 38ms/step - loss: 0.0929 - val_loss: 0.0498
Epoch 8/100
72/72 [==============================] - 3s 37ms/step - loss: 0.0928 - val_loss: 0.0208
Epoch 9/100
72/72 [==============================] - 3s 38ms/step - loss: 0.0929 - val_loss: 0.0588
Epoch 10/100
72/72 [==============================] - 3s 37ms/step - loss: 0.0928 - val_loss: 0.0437
Epoch 11/100
72/72 [==============================] - 3s 36ms/step - loss: 0.0928 - val_loss: 0.0192
Epoch 12/100
...
72/72 [==============================] - 3s 38ms/step - loss: 0.0925 - val_loss: 0.0094
Epoch 46/100
72/72 [==============================] - 3s 37ms/step - loss: 0.0925 - val_loss: 0.0113
Epoch 00046: early stopping

The loss drops quickly, and the training process looks promising.

Step #6 Evaluate Model Performance

If we test a feature, we also want to know how it impacts the performance of our model. Feature Engineering is therefore closely related to evaluating model performance. So, let’s check the prediction performance. For this purpose, we score the model with the test data set (x_test). Then we can compare the predictions with the actual values (y_test) in a lineplot.

# Get the predicted values
y_pred_scaled = model.predict(x_test)

# Unscale the predicted values
y_pred = scaler_pred.inverse_transform(y_pred_scaled)
y_test_unscaled = scaler_pred.inverse_transform(y_test.reshape(-1, 1))
y_test_unscaled.shape

# Mean Absolute Error (MAE)
MAE = mean_absolute_error(y_test_unscaled, y_pred)
print(f'Median Absolute Error (MAE): {np.round(MAE, 2)}')

# Mean Absolute Percentage Error (MAPE)
MAPE = np.mean((np.abs(np.subtract(y_test_unscaled, y_pred)/ y_test_unscaled))) * 100
print(f'Mean Absolute Percentage Error (MAPE): {np.round(MAPE, 2)} %')

# Median Absolute Percentage Error (MDAPE)
MDAPE = np.median((np.abs(np.subtract(y_test_unscaled, y_pred)/ y_test_unscaled)) ) * 100
print(f'Median Absolute Percentage Error (MDAPE): {np.round(MDAPE, 2)} %')

# The date from which on the date is displayed
display_start_date = "2019-01-01" 

# Add the difference between the valid and predicted prices
train = pd.DataFrame(dfs['Close'][:train_data_len + 1]).rename(columns={'Close': 'y_train'})
valid = pd.DataFrame(dfs['Close'][train_data_len:]).rename(columns={'Close': 'y_test'})
valid.insert(1, "y_pred", y_pred, True)
valid.insert(1, "residuals", valid["y_pred"] - valid["y_test"], True)
df_union = pd.concat([train, valid])

# Zoom in to a closer timeframe
df_union_zoom = df_union[df_union.index > display_start_date]

# Create the lineplot
fig, ax1 = plt.subplots(figsize=(16, 8))
plt.title("y_pred vs y_test")
plt.ylabel(stockname, fontsize=18)
sns.set_palette(["#090364", "#1960EF", "#EF5919"])
sns.lineplot(data=df_union_zoom[['y_pred', 'y_train', 'y_test']], linewidth=1.0, dashes=False, ax=ax1)

# Create the barplot for the absolute errors
df_sub = ["#2BC97A" if x > 0 else "#C92B2B" for x in df_union_zoom["residuals"].dropna()]
ax1.bar(height=df_union_zoom['residuals'].dropna(), x=df_union_zoom['residuals'].dropna().index, width=3, label='absolute errors', color=df_sub)
plt.legend()
plt.show()

Median Absolute Error (MAE): 547.23 
Mean Absolute Percentage Error (MAPE): 4.04 % 
Median Absolute Percentage Error (MDAPE): 3.73 %

On average, the predictions of our model deviate from the actual values by about one percent. Although one percent may not sound like a lot, the prediction errors can quickly accumulate to larger values.

Step #7 Overview of Selected Models

In writing this article, I tested various models based on different features. The neural network architecture remained unchanged. Likewise, I kept the hyperparameters the same except for the learning rate. Below are the results of these model variants:

Step #8 Conclusions

Estimating which indicators will lead to good results in advance is difficult. More indicators do not necessarily lead to better results because they increase the model complexity and add data without predictive power. This so-called noise makes it harder for the model to separate important influencing factors from less important ones. Also, each additional indicator increases the time needed to train the model. So there is no way around testing different variants.

Besides the feature, various hyperparameters such as the learning rate, optimizer, batch size, and the selected time frame of the data (sequence_length) impact the model’s performance. Tuning these hyperparameters can further improve model performance.

A learning rate of 0.05 achieves the best results from the tested configurations.
Of all features, only the Bollinger bands positively affected the model’s performance.
As expected, the performance tends to decrease with the number of features.
In our case, the hyperparameters seem to affect the performance of the models more than the choice of features.

Finally, we have optimized only a single parameter. We searched for optimal learning rates while leaving all other parameters unchanged, such as the optimizer, the neural network architecture, or the sequence length. Based on the results, we can draw several conclusions:

There is plenty of room for improvement and experimentation. With more time for experiments and computational power, it will undoubtedly be possible to identify better features and model configurations. So, have fun experimenting! 🙂

Summary

In this tutorial, we have delved into the fascinating world of feature engineering for stock market forecasting using Python. By exploring various features from chart analysis, such as RSI, moving averages, and Bollinger bands, we have developed multiple variants of a recurrent neural network that produce distinct prediction models.

Our experiments have shown that the choice of features can have a significant impact on the performance of the prediction model. Therefore, it’s essential to carefully select features and consider their potential impact on the model. Additionally, keep in mind that the most effective features for recognizing patterns in historical data will vary depending on the specific time series data being analyzed.

By following the crucial steps outlined in this tutorial, you now have the knowledge and tools to apply feature engineering techniques to any multivariate time series forecasting problem. With further experimentation and testing, you can fine-tune your models to achieve the best possible results for your specific use case.

We hope you found this tutorial both informative and helpful. If you have any questions or comments, don’t hesitate to reach out and let us know.

And if you want to learn more about feature preparation and exploration, check out my recent article on Exploratory Feature Preparation for Regression Models.

Sources and Further Reading

The links above to Amazon are affiliate links. By buying through these links, you support the Relataly.com blog and help to cover the hosting costs. Using the links does not affect the price.

Books on Applied Machine Learning

The links above to Amazon are affiliate links. By buying through these links, you support the Relataly.com blog and help to cover the hosting costs. Using the links does not affect the price.

The post Mastering Multivariate Stock Market Prediction with Python: A Guide to Effective Feature Engineering Techniques appeared first on relataly.com.

Stock Market Prediction using Multivariate Time Series and Recurrent Neural Networks in Python

Florian Follonier — Mon, 01 Jun 2020 17:20:46 +0000

Regression models based on recurrent neural networks (RNN) can recognize patterns in time series data, making them an exciting technology for stock market forecasting. What distinguishes these RNNs from traditional neural networks is their architecture. It consists of multiple layers of long-term, short-term memory (LSTM). These LSTM layers allow the model to learn patterns in a time series that occur over different periods and are often difficult for human analysts to detect. We can train such models with one feature (univariate forecasting models) or multiple features (multivariate models). Multivariate Models can take more data into account, and if we provide them with relevant features, they can make better predictions. This tutorial uses Python and Keras to implement a multivariate RNN for stock price prediction. We define the architecture of our regression model and then train this model to predict the NASDAQ index.

The remainder of this tutorial proceeds in two parts: We start with a brief intro in which we compare modeling univariate and multivariate time series data. Then we turn to the hands-on part, in which we prepare the multivariate time series data and use it to train a neural network in Python. The model is a recurrent neural network with LSTM layers that forecasts the NASDAQ stock market index. Finally, we evaluate the performance of our model and make a forecast for the next day.

Disclaimer

Stock market forecasting has become an exciting application for recurrent neural networks.

Univariate vs. Multivariate Time Series Models

Multivariate models and univariate models differ in the number of their input features. While univariate models consider only a single feature, multivariate models use several input variables (features). In stock market forecasting, we can create additional features from price history. Examples are performance indicators such as moving averages, the RSI, or the Sales Volume. We can also include features from other sources, for example, social media sentiment, weather forecasts, etc. Multivariate models that have additional relevant information available have a chance to outperform univariate models. However, this is only true if the features are relevant and are indicative of future price movements.

Preparing data for training univariate models is more straightforward than for multivariate models. If you are new to time series prediction, you might want to look at my earlier articles. These explain how to develop and evaluate univariate time series models:

Univariate Prediction Models

In time series regression, the standard approach is to train a model using past values from the time series that need to be predicted. The assumption is that the value of a time series at time t is closely related to the previous time steps t-1, t-2, t-3, and so on. This approach is similar to chart analysis, which involves identifying patterns in a price chart that can indicate future movements. Both approaches rely on the ability to identify recurring patterns in the data and make accurate predictions based on them. The performance of the model or analysis depends on the ability to identify these patterns and draw the right conclusions from them.

Several techniques can be used to improve the performance of time series regression models, including feature engineering, hyperparameter optimization, and ensemble methods. In addition to these techniques, it is also important to carefully evaluate the performance of the model using appropriate metrics, such as mean squared error or mean absolute error, and to continuously monitor the model’s performance to ensure it remains accurate over time.

Univariate Time Series Prediction

Multivariate Prediction Models

Predicting the price of a financial asset is a challenging task due to the numerous variables that can influence it, including economic cycles, political events, unforeseen occurrences, psychological factors, market sentiment, and even the weather. These variables are often interdependent, which makes statistical modeling even more complex. While multivariate models can take into account several factors, they are still a simplification of reality and may not fully capture the complexity of the market. On the other hand, univariate models only consider a single dependent variable, ignoring the other dimensions.

Even with good features, predicting financial prices can be difficult because patterns and market rules may change frequently. As a result, models may make mistakes. However, as Georg Box famously said, “All models are wrong, but some are useful.” Despite their limitations, multivariate models can provide a more detailed representation of reality compared to univariate models, and can still be useful in forecasting financial prices.

Multivariate Time Series Prediction

Implementing a Multivariate Time Series Prediction Model in Python

Now that we have a solid understanding of multivariate time series forecasting, it’s time to put our knowledge into practice by building a model using Python and TensorFlow. Specifically, we will create a multivariate recurrent neural network (RNN) to predict the NASDAQ stock market index. RNNs are well-suited for time series forecasting because they can process sequential data, considering the dependencies between past and future events.

To build our RNN model, we will need to go through several essential steps.

Creating features and scaling: This involves loading and preparing the time series data for modeling, including selecting the time period and relevant features and scaling the data.
Splitting the data: We split the data into train and test sets.
Sliding window approach: The time series data is sliced into mini-batches using the sliding window approach.
Model design and training: The appropriate architecture for the RNN model is chosen, and an optimization algorithm is used to adjust the model’s weights and biases to minimize prediction error.
Model validation and predictions: We evaluate the model’s performance by comparing the predicted values to the actual values of the NASDAQ index. In addition, we will use the model to make predictions about future events.
Unscaling the predictions: The predictions are unscaled to bring them back to their original scale.

The code is available on the GitHub repository.

View on GitHub Relataly Github Repo

Six Essential Steps for Developing a Multivariate Time Series Model

Prerequisites

Before starting the coding part, make sure that you have set up your Python 3 environment and required packages. If you don’t have a Python environment, follow the steps in this tutorial to set up the Anaconda environment.

Also, make sure you install all required packages. In this tutorial, we will be working with the following standard packages:

In addition, we will be using Keras (2.0 or higher) with Tensorflow backend, the machine learning library sci-kit-learn, and the pandas-DataReader.

You can install packages using console commands:

pip install
conda install (if you are using the anaconda packet manager)

Step #1 Load the Time Series Data

Let’s start by loading price data on the NASDAQ composite index (symbol: ^IXIC) from yahoo.finance.com into our Python project. To download the data, we use Pandas DataReader – a popular Python library that provides functions to extract data from various sources on the web. Alternatively, you can also use the “yfinance” library.

We provide the technical symbol for the NASDAQ index, “^IXIC.” Alternatively, you could use other asset symbols, for example, BTC-USD, to get price quotes for Bitcoin. In addition, we limit the data in the API request to the timeframe between 2010-01-01 and the current date.

Running the code below will load the data into a new DataFrame object. Be aware that input data and predictions will vary depending on when you execute the code.

# Time Series Forecasting - Multivariate Time Series Models for Stock Market Prediction

import math # Mathematical functions 
import numpy as np # Fundamental package for scientific computing with Python
import pandas as pd # Additional functions for analysing and manipulating data
from datetime import date, timedelta, datetime # Date Functions
from pandas.plotting import register_matplotlib_converters # This function adds plotting functions for calender dates
import matplotlib.pyplot as plt # Important package for visualization - we use this to plot the market data
import matplotlib.dates as mdates # Formatting dates
import tensorflow as tf
from sklearn.metrics import mean_absolute_error, mean_squared_error # Packages for measuring model performance / errors
from tensorflow.keras import Sequential # Deep learning library, used for neural networks
from tensorflow.keras.layers import LSTM, Dense, Dropout # Deep learning classes for recurrent and regular densely-connected layers
from tensorflow.keras.callbacks import EarlyStopping # EarlyStopping during model training
from sklearn.preprocessing import RobustScaler, MinMaxScaler # This Scaler removes the median and scales the data according to the quantile range to normalize the price data 
import seaborn as sns # Visualization
sns.set_style('white', { 'axes.spines.right': False, 'axes.spines.top': False})

# check the tensorflow version and the number of available GPUs
print('Tensorflow Version: ' + tf.__version__)
physical_devices = tf.config.list_physical_devices('GPU')
print("Num GPUs:", len(physical_devices))

# Setting the timeframe for the data extraction
end_date =  date.today().strftime("%Y-%m-%d")
start_date = '2010-01-01'

# Getting NASDAQ quotes
stockname = 'NASDAQ'
symbol = '^IXIC'

# You can either use webreader or yfinance to load the data from yahoo finance
# import pandas_datareader as webreader
# df = webreader.DataReader(symbol, start=start_date, end=end_date, data_source="yahoo")

import yfinance as yf #Alternative package if webreader does not work: pip install yfinance
df = yf.download(symbol, start=start_date, end=end_date)

# Create a quick overview of the dataset
df.head()

Tensorflow Version: 2.5.0
Num GPUs: 1
[*********************100%***********************]  1 of 1 completed
			Open		High		Low			Close		Adj Close	Volume
Date						
2009-12-31	2292.919922	2293.590088	2269.110107	2269.149902	2269.149902	1237820000
2010-01-04	2294.409912	2311.149902	2294.409912	2308.419922	2308.419922	1931380000
2010-01-05	2307.270020	2313.729980	2295.620117	2308.709961	2308.709961	2367860000
2010-01-06	2307.709961	2314.070068	2295.679932	2301.090088	2301.090088	2253340000
2010-01-07	2298.090088	2301.300049	2285.219971	2300.050049	2300.050049	2270050000

The data looks as expected and has the following columns:

High – the daily high
Low – the daily low
Open – the opening price
Close – the closing price
Volume – the daily trading volume
Adj Close – the adjacent closing price

Step #2 Explore the Data

Let’s first familiarize ourselves with the data before processing them further. Line plots are an excellent choice to gain a quick overview of time series data. By running the code below, we loop over the columns to plot a line chart for each column of the dataframe.

# Plot line charts
df_plot = df.copy()

ncols = 2
nrows = int(round(df_plot.shape[1] / ncols, 0))

fig, ax = plt.subplots(nrows=nrows, ncols=ncols, sharex=True, figsize=(14, 7))
for i, ax in enumerate(fig.axes):
        sns.lineplot(data = df_plot.iloc[:, i], ax=ax)
        ax.tick_params(axis="x", rotation=30, labelsize=10, length=0)
        ax.xaxis.set_major_locator(mdates.AutoDateLocator())
fig.tight_layout()
plt.show()

The line plots look as expected. We continue with preprocessing and feature engineering.

Step #3 Feature Selection and Scaling

Before we can train the neural network, we need to transform the data into a processable shape. In this section, we perform the following tasks:

Selecting features
Scaling the data to a standard value range

3.1 Selecting Features

First, we will select the features upon which we want to train our neural network. The selection and engineering of relevant feature variables is a complex topic. We could also create additional features such as moving averages, but I want to keep things simple. Therefore, we select features that are already present in our data. To learn more about feature engineering for stock market prediction, check out the relataly feature engineering tutorial.

Feature Selection of Multivariate Time Series Models

Running the code below selects the features. We add a dummy column to our record called “Predictions,” which will help us later when we need to reverse the scaling of our data.

# Indexing Batches
train_df = df.sort_values(by=['Date']).copy()

# List of considered Features
FEATURES = ['High', 'Low', 'Open', 'Close', 'Volume'
            #, 'Month', 'Year', 'Adj Close'
           ]

print('FEATURE LIST')
print([f for f in FEATURES])

# Create the dataset with features and filter the data to the list of FEATURES
data = pd.DataFrame(train_df)
data_filtered = data[FEATURES]

# We add a prediction column and set dummy values to prepare the data for scaling
data_filtered_ext = data_filtered.copy()
data_filtered_ext['Prediction'] = data_filtered_ext['Close']

# Print the tail of the dataframe
data_filtered_ext.tail()

FEATURE LIST
['High', 'Low', 'Open', 'Close', 'Volume']
			High			Low				Open			Close			Volume		Prediction
Date						
2022-05-09	11990.610352	11574.940430	11923.030273	11623.250000	5911380000	11623.250000
2022-05-10	11944.940430	11566.280273	11900.339844	11737.669922	6199090000	11737.669922
2022-05-11	11844.509766	11339.179688	11645.570312	11364.240234	6120860000	11364.240234
2022-05-12	11547.330078	11108.759766	11199.250000	11370.959961	6647400000	11370.959961
2022-05-13	11856.709961	11510.259766	11555.969727	11805.000000	5868610000	11805.000000

3.2 Scaling the Multivariate Input Data

Another necessary step in data preparation for neural networks is scaling the input data. Scaling will increase training times and improve model accuracy. The scikit-learn package offers different scaling approaches. We use the MinMaxScaler to scale the input data to a range between 0 and 1.

A model that is trained on scaled data will also produce scaled predictions. Therefore, when we make predictions later with our model, we must not forget to scale the predictions back. The scaler_model will adapt to the shape of the data (6-dimensional). However, our predictions will be one-dimensional. Because the scaler has a fixed input shape, we cannot simply reuse it for unscaling our model predictions. To unscale the predictions later, we create an additional scaler that works on a single feature column (scaler_pred).

# Get the number of rows in the data
nrows = data_filtered.shape[0]

# Convert the data to numpy values
np_data_unscaled = np.array(data_filtered)
np_data = np.reshape(np_data_unscaled, (nrows, -1))
print(np_data.shape)

# Transform the data by scaling each feature to a range between 0 and 1
scaler = MinMaxScaler()
np_data_scaled = scaler.fit_transform(np_data_unscaled)

# Creating a separate scaler that works on a single column for scaling predictions
scaler_pred = MinMaxScaler()
df_Close = pd.DataFrame(data_filtered_ext['Close'])
np_Close_scaled = scaler_pred.fit_transform(df_Close)

Out: (2619, 6)

Step #4 Transforming the Multivariate Data

Next, we train our multivariate regression model based on a three-dimensional data structure. The first dimension is the sequences, the second dimension is the time steps (mini-batches), and the third dimension is the features. The illustration below shows the steps to bring the multivariate data into a shape our neural model can process during training. We must keep this form and perform the same steps when using the model to create a forecast.

An essential step in the preparation process is slicing the data into multiple input data sequences with associated target values. We write a simple Python script that uses a “sliding window.” This approach moves a window through the time series data, adding a sequence of multiple data points to the input data with each step. The target value (e.g., Closing Price) follows this sequence, and we store it in a separate target dataset. Then we push the window one step further and repeat these activities. This process results in a data set with many input sequences (mini-batches), each with a corresponding target value in the target record. This process applies both to the training and the test data.

Sliding Window

We will apply the sliding window approach to our data. The result is a training set (x_train) containing 2258 input sequences, each with 50 steps and six features. The related target dataset (y_train) has 2258 target values.

# Set the sequence length - this is the timeframe used to make a single prediction
sequence_length = 50

# Prediction Index
index_Close = data.columns.get_loc("Close")

# Split the training data into train and train data sets
# As a first step, we get the number of rows to train the model on 80% of the data 
train_data_len = math.ceil(np_data_scaled.shape[0] * 0.8)

# Create the training and test data
train_data = np_data_scaled[0:train_data_len, :]
test_data = np_data_scaled[train_data_len - sequence_length:, :]

# The RNN needs data with the format of [samples, time steps, features]
# Here, we create N samples, sequence_length time steps per sample, and 6 features
def partition_dataset(sequence_length, data):
    x, y = [], []
    data_len = data.shape[0]
    for i in range(sequence_length, data_len):
        x.append(data[i-sequence_length:i,:]) #contains sequence_length values 0-sequence_length * columsn
        y.append(data[i, index_Close]) #contains the prediction values for validation,  for single-step prediction
    
    # Convert the x and y to numpy arrays
    x = np.array(x)
    y = np.array(y)
    return x, y

# Generate training data and test data
x_train, y_train = partition_dataset(sequence_length, train_data)
x_test, y_test = partition_dataset(sequence_length, test_data)

# Print the shapes: the result is: (rows, training_sequence, features) (prediction value, )
print(x_train.shape, y_train.shape)
print(x_test.shape, y_test.shape)

# Validate that the prediction value and the input match up
# The last close price of the second input sample should equal the first prediction value
print(x_train[1][sequence_length-1][index_Close])
print(y_train[0])

(2474, 50, 5) (2474,)
(630, 50, 5) (630,)
0.02049456793614579
0.02049456793614579

Step #5 Train the Multivariate Prediction Model

Once we have the data prepared and ready, we can train our model. The architecture of our neural network consists of the following four layers:

An LSTM layer, which takes our mini-batches as input and returns the whole sequence
Another LSTM layer that takes the sequence from the previous layer but only returns five values
Dense layer with five neurons
A final dense layer that outputs the predicted value

The number of neurons in the first layer must equal the size of a minibatch of the input data. Each minibatch in our dataset consists of a matrix with 50 steps and six features. Thus, the input layer of our recurrent neural network consists of 300 neurons. Keeping this architecture in mind is essential because, later, we need to bring the data into the same shape when we want to predict a new dataset. Running the code below creates the model architecture and compiles the model.

# Configure the neural network model
model = Sequential()

# Model with n_neurons = inputshape Timestamps, each with x_train.shape[2] variables
n_neurons = x_train.shape[1] * x_train.shape[2]
print(n_neurons, x_train.shape[1], x_train.shape[2])
model.add(LSTM(n_neurons, return_sequences=True, input_shape=(x_train.shape[1], x_train.shape[2]))) 
model.add(LSTM(n_neurons, return_sequences=False))
model.add(Dense(5))
model.add(Dense(1))

# Compile the model
model.compile(optimizer='adam', loss='mse')

Running the code below starts the training process.

# Training the model
epochs = 50
batch_size = 16
early_stop = EarlyStopping(monitor='loss', patience=5, verbose=1)
history = model.fit(x_train, y_train, 
                    batch_size=batch_size, 
                    epochs=epochs,
                    validation_data=(x_test, y_test)
                   )
                    
                    #callbacks=[early_stop])

Output exceeds the size limit. Open the full output data in a text editor
Epoch 1/50
155/155 [==============================] - 6s 19ms/step - loss: 6.7374e-04 - val_loss: 0.0011
Epoch 2/50
155/155 [==============================] - 2s 13ms/step - loss: 6.6207e-05 - val_loss: 8.4700e-04
Epoch 3/50
155/155 [==============================] - 2s 14ms/step - loss: 5.0667e-05 - val_loss: 6.6467e-04
Epoch 4/50
155/155 [==============================] - 2s 14ms/step - loss: 5.8446e-05 - val_loss: 6.8575e-04
Epoch 5/50
155/155 [==============================] - 2s 14ms/step - loss: 4.8430e-05 - val_loss: 8.4892e-04
Epoch 6/50
155/155 [==============================] - 2s 14ms/step - loss: 7.1283e-05 - val_loss: 8.2255e-04
Epoch 7/50
155/155 [==============================] - 2s 15ms/step - loss: 6.0554e-05 - val_loss: 8.0583e-04
Epoch 8/50
155/155 [==============================] - 2s 15ms/step - loss: 5.5977e-05 - val_loss: 4.5830e-04
Epoch 9/50
155/155 [==============================] - 2s 15ms/step - loss: 4.2453e-05 - val_loss: 6.1866e-04
Epoch 10/50
155/155 [==============================] - 2s 14ms/step - loss: 3.5722e-05 - val_loss: 4.5288e-04
Epoch 11/50
155/155 [==============================] - 2s 14ms/step - loss: 4.1409e-05 - val_loss: 8.5975e-04
Epoch 12/50
155/155 [==============================] - 3s 18ms/step - loss: 7.0007e-05 - val_loss: 5.0300e-04
Epoch 13/50
...
Epoch 49/50
155/155 [==============================] - 2s 13ms/step - loss: 2.7064e-05 - val_loss: 2.8202e-04
Epoch 50/50
155/155 [==============================] - 2s 13ms/step - loss: 2.9009e-05 - val_loss: 2.5486e-04

Let’s take a quick look at the loss curve.

# Plot training & validation loss values
fig, ax = plt.subplots(figsize=(16, 5), sharex=True)
sns.lineplot(data=history.history["loss"])
plt.title("Model loss")
plt.ylabel("Loss")
plt.xlabel("Epoch")
ax.xaxis.set_major_locator(plt.MaxNLocator(epochs))
plt.legend(["Train", "Test"], loc="upper left")
plt.grid()
plt.show()

The loss drops quickly to a lower plateau, which signals that the model has improved throughout the training process.

Step #6 Evaluate Model Performance

Once we have trained the neural network regression model, we want to measure its performance. As mentioned in section 3, we first have to reverse the scaling of the predictions. Afterward, we calculate different error metrics, MAE, MAPE, and MDAPE. Then we will compare the predictions in a line plot with the actual values. For more information on measuring the performance of regression models, see this relataly article.

# Get the predicted values
y_pred_scaled = model.predict(x_test)

# Unscale the predicted values
y_pred = scaler_pred.inverse_transform(y_pred_scaled)
y_test_unscaled = scaler_pred.inverse_transform(y_test.reshape(-1, 1))

# Mean Absolute Error (MAE)
MAE = mean_absolute_error(y_test_unscaled, y_pred)
print(f'Median Absolute Error (MAE): {np.round(MAE, 2)}')

# Mean Absolute Percentage Error (MAPE)
MAPE = np.mean((np.abs(np.subtract(y_test_unscaled, y_pred)/ y_test_unscaled))) * 100
print(f'Mean Absolute Percentage Error (MAPE): {np.round(MAPE, 2)} %')

# Median Absolute Percentage Error (MDAPE)
MDAPE = np.median((np.abs(np.subtract(y_test_unscaled, y_pred)/ y_test_unscaled)) ) * 100
print(f'Median Absolute Percentage Error (MDAPE): {np.round(MDAPE, 2)} %')

Median Absolute Error (MAE): 175.28
Mean Absolute Percentage Error (MAPE): 1.48 %
Median Absolute Percentage Error (MDAPE): 1.16 %

The MAPE is 22.15, which means that the mean of our predictions deviates from the actual values by 3.12%. The MDAPE is 2.88 % and a bit lower than the mean, thus indicating there are some outliers among the prediction errors. 50% of the predictions deviate by more than 2.88%, and 50% differ by less than 2.88% from the actual values.

Next, we create a line plot showing the forecast and compare it to the actual values. Adding a bar plot to the chart helps highlight the deviations of the predictions from the actual values. Running the code below creates the line plot.

# The date from which on the date is displayed
display_start_date = "2019-01-01" 

# Add the difference between the valid and predicted prices
train = pd.DataFrame(data_filtered_ext['Close'][:train_data_len + 1]).rename(columns={'Close': 'y_train'})
valid = pd.DataFrame(data_filtered_ext['Close'][train_data_len:]).rename(columns={'Close': 'y_test'})
valid.insert(1, "y_pred", y_pred, True)
valid.insert(1, "residuals", valid["y_pred"] - valid["y_test"], True)
df_union = pd.concat([train, valid])

# Zoom in to a closer timeframe
df_union_zoom = df_union[df_union.index > display_start_date]

# Create the lineplot
fig, ax1 = plt.subplots(figsize=(16, 8))
plt.title("y_pred vs y_test")
plt.ylabel(stockname, fontsize=18)
sns.set_palette(["#090364", "#1960EF", "#EF5919"])
sns.lineplot(data=df_union_zoom[['y_pred', 'y_train', 'y_test']], linewidth=1.0, dashes=False, ax=ax1)

# Create the bar plot with the differences
df_sub = ["#2BC97A" if x > 0 else "#C92B2B" for x in df_union_zoom["residuals"].dropna()]
ax1.bar(height=df_union_zoom['residuals'].dropna(), x=df_union_zoom['residuals'].dropna().index, width=3, label='residuals', color=df_sub)
plt.legend()
plt.show()

The line plot shows that the forecast is close to the actual values but partially deviates from it. The deviations between actual values and predictions are called residuals. For our mode, they seem to be most significant during periods of increased market volatility and least during periods of steady market movement, which makes sense because sudden movements are generally more difficult to predict.

Step #7 Predict the Next Day’s Price

After training the neural network, we want to forecast the stock market for the next day. For this purpose, we extract a new dataset from the Yahoo-Finance API and preprocess it as we did for model training.

We trained our model with mini-batches of 50-steps and six features. Thus, we must also provide the model with 50-steps when making the forecast. As before, we transform the data into the shape of 1 x 50 x 6, whereby the last figure is the number of feature columns. After generating the forecast, we unscale the stock market predictions back to the original range of values.

df_temp = df[-sequence_length:]
new_df = df_temp.filter(FEATURES)

N = sequence_length

# Get the last N day closing price values and scale the data to be values between 0 and 1
last_N_days = new_df[-sequence_length:].values
last_N_days_scaled = scaler.transform(last_N_days)

# Create an empty list and Append past N days
X_test_new = []
X_test_new.append(last_N_days_scaled)

# Convert the X_test data set to a numpy array and reshape the data
pred_price_scaled = model.predict(np.array(X_test_new))
pred_price_unscaled = scaler_pred.inverse_transform(pred_price_scaled.reshape(-1, 1))

# Print last price and predicted price for the next day
price_today = np.round(new_df['Close'][-1], 2)
predicted_price = np.round(pred_price_unscaled.ravel()[0], 2)
change_percent = np.round(100 - (price_today * 100)/predicted_price, 2)

plus = '+'; minus = ''
print(f'The close price for {stockname} at {end_date} was {price_today}')
print(f'The predicted close price is {predicted_price} ({plus if change_percent > 0 else minus}{change_percent}%)')

The close price for NASDAQ on 2021-06-27 was 14360.39. The predicted closing price is 14232.8095703125 (-0.9%)

Summary

This tutorial has shown multivariate time series modeling for stock market prediction in Python. We trained a neural network regression model for predicting the NASDAQ index. Before training our model, we performed several steps to prepare the data. The steps included splitting the data and scaling them. In addition, we created and tested various new features from the original time series data to account for the multivariate modeling approach. You now have the knowledge and code to conduct further experiments with the features of your choice.

Multivariate time series forecasting is a complex topic. You might want to take the time to retrace the different steps. Especially the transformation of the data can be challenging. The best way to learn is to practice. Therefore I encourage you to develop more time series models and experiment with other data sources.

Another interesting approach to stock market prediction uses candlestick images and convolutional neural networks. If this topic interests you, check out the following article: Deep reinforcement learning stock market trading, utilizing a CNN with candlestick images

I am always trying to learn and improve. If you want to give feedback or have remarks, feel free to share them in the comments.

Stockmarket forecasting with a neural network is about identifying meaningful patterns. Be aware that there is no guarantee that these patterns are present in the data. Image created with Midjourney.

Sources and Further Reading

The links above to Amazon are affiliate links. By buying through these links, you support the Relataly.com blog and help to cover the hosting costs. Using the links does not affect the price.

The post Stock Market Prediction using Multivariate Time Series and Recurrent Neural Networks in Python appeared first on relataly.com.