Building Multivariate Time Series Models for Stock Market Prediction with Python

Time series prediction has become a major domain for the application of machine learning and more specifically recurrent neural networks. Well-designed multivariate prediction models are now able to recognize patterns in large amounts of data, allowing them to make more accurate predictions than humans could. This has opened up new possibilities to generate signals for automated purchasing and selling decisions of financial assets. In recent years, such algorithmic trading has led to increasing interest among financial institutions and fintechs in machine learning. This blog post continues the learning journey towards time series analysis and introduces the multivariate modeling of stock market data.

This article starts with a short introduction to modeling univariate and multivariate time series data before showing how to implement a multivariate model in Python for stock market forecasting. The code example uses a recurrent neural network. We train this model with a decade of price data from the NASDAQ stock market index and make a forecast for the next day.

Warning: Stock markets can be highly volatile and are generally difficult to predict. The prediction model developed in this post only serves to illustrate a use case for Time Series models. It should not be assumed that a simple neural network as described in this blog post is capable of fully mapping the complexity of price development.

Univariate vs Multivariate Time Series

In general, time series models can be distinguished whether they are one-dimensional (univariate) or multidimensional (multivariate).

Univariate Time Series

Univariate time-series data, as the name suggests, focuses on a single dependent variable. The basic assumption behind the univariate prediction approach is that the value of a time-series at time-step t is closely related to the values at the previous time-steps t-1, t-2, t-3, and so on.

Univariate models are easier to develop than multivariate models. If you are not familiar with time series prediction, you might want to take a look at my earlier articles first, in which explain the steps to develop and evaluate univariate time series models:

The dependent variable in stock market forecasting is usually the closing or opening price of a financial asset. A forecasting model that is trained solely on the basis of price developments attempts, similar to chart analysis, to identify patterns and formations in the price chart that have provided indications of future price developments in the past. However, patterns and the rules of the market are constantly changing, so models inevitably make mistakes. But so do people…

Multivariate Time Series

Forecasting the price of a financial asset is a complex topic. In general, the price is determined by an endless number of variables. Economic cycles, political developments, unforeseen events, psychological factors, market sentiment, and even the weather, all these variables will more or less exert an influence on the price. In addition, many of these variables are interdependent, which makes statistical modelling even more complex.

A univariable forecast model reduces this complexity to a minimum – a single factor. Other dimensions are ignored. Although a multivariate model can only take into account a fraction of the influencing factors, it can still take several factors into account simultaneously. For example, a multivariate stock market forecasting model can consider not only the relationship to the closing price, but also the opening price, moving averages, daily highs, the price of other stocks, and so on. Multivariate models are not able to fully cover the complexity of the market. However, they offer a more detailed abstraction of reality than univariate models, which is why they tend to provide more accurate predictions.

Note: The selection and engineering of relevant feature variables is an art in itself, and I plan to address this topic in a separate blog post in the near future.

Python implementation

In the following, this post will guide you through a number of steps to create a multivariate recurrent neuronal network that predicts the price of the NASDAQ stock market index.

The example covers the following steps:

  1. Load the data
  2. Explore the data
  3. Feature selection and scaling
  4. Transforming the data
  5. Train a multivariate time-series prediction model
  6. Evaluate model performance
  7. Ideas on additional features

Python Environment

This tutorial assumes that you have setup your python environment. I recommend using the Anaconda environment. If you have not yet set the environment up, you can follow this tutorial.

It is also assumed that you have the following packages installed: keras (2.0 or higher) with Tensorflow backend, numpy, pandas, matplot, sklearn. The packages can be installed using the console command:

pip install <packagename>

1) Load the data

Let’s start by setting up the imports.

# Remote data access for pandas
import pandas_datareader as webreader
# Mathematical functions 
import math 
# Fundamental package for scientific computing with Python
import numpy as np 
# Additional functions for analysing and manipulating data
import pandas as pd 
# Date Functions
from datetime import date, timedelta, datetime
# This function adds plotting functions for calender dates
from pandas.plotting import register_matplotlib_converters
# Important package for visualization - we use this to plot the market data
import matplotlib.pyplot as plt 
# Formatting dates
import matplotlib.dates as mdates
# Packages for measuring model performance / errors
from sklearn.metrics import mean_absolute_error, mean_squared_error
# Deep learning library, used for neural networks
from keras.models import Sequential 
# Deep learning classes for recurrent and regular densely-connected layers
from keras.layers import LSTM, Dense, Dropout
# EarlyStopping during model training
from keras.callbacks import EarlyStopping
# This Scaler removes the median and scales the data according to the quantile range to normalize the price data 
from sklearn.preprocessing import RobustScaler

In the next step, we load price data on the NASDAQ composite index (symbol: ^IXIC) from yahoo.finance.com into our Python project. “^IXIC” is the technical symbol for NASDAQ. Alternatively you could use any other asset symbol such as BTC-USD to get price quotes for Bitcoin.

To download the data, we’ll use pandas datareader – a popular library that provides functions to extract data from various Internet sources into a pandas DataFrame. The following code extracts the price data.

# Setting the timeframe for the data extraction
today = date.today()
date_today = today.strftime("%Y-%m-%d")
date_start = '2010-01-01'

# Getting NASDAQ quotes
stockname = 'NASDAQ'
symbol = '^IXIC'
df = webreader.DataReader(
    symbol, start=date_start, end=date_today, data_source="yahoo"
)

# Create a quick overview of the dataset
train_dfs = df.copy()
train_dfs
Our dataset - historical data on the NASDAQ
price data on NASDAQ

We have limited the data in the API request to the timeframe between 2010-01-01 and the current date. So when you execute the code, the results will show a larger period as in this tutorial.

2) Explore the data

Whenever you are working with a new data set, it is a good idea to familiarize yourself with the data before you start transforming it. Histograms are particularly well suited for this purpose for time series data. The following code creates a histogram for the different columns of our NASDAQ dataset:

# Plot each column
register_matplotlib_converters()
nrows = 3
ncols = int(round(train_dfs.shape[1] / nrows, 0))
fig, ax = plt.subplots(nrows=nrows, ncols=ncols, sharex=True, figsize=(16, 7))
fig.subplots_adjust(hspace=0.3, wspace=0.3)
x = train_dfs.index
columns = train_dfs.columns
f = 0
for i in range(nrows):
    for j in range(ncols):
        ax[i, j].xaxis.set_major_locator(mdates.YearLocator())
        assetname = columns[f]
        y = train_dfs[assetname]
        f += 1
        ax[i, j].plot(x, y, color='#039dfc', label=stockname, linewidth=1.0)
        ax[i, j].set_title(assetname)
        ax[i, j].tick_params(axis="x", rotation=90, labelsize=10, length=0)   
#plt.show()
Histograms of our feature columns
Histograms of our feature columns

The histograms should look as expected. So we can continue with pre-processing and feature engineering.

3) Feature selection and scaling

In this section, we will transform the data and bring it into the shape needed by our recurrent neural network. We begin by removing the date index and adding a number index instead. This makes it easier for to later reference to specific columns. In addition, we add columns for Month and Year.

# Indexing Batches
train_df = train_dfs.sort_values(by=['Date']).copy()

# We safe a copy of the dates index, before we need to reset it to numbers
date_index = train_df.index

# Adding Month and Year in separate columns
d = pd.to_datetime(train_df.index)
train_df['Month'] = d.strftime("%m") 
train_df['Year'] = d.strftime("%Y") 

# We reset the index, so we can convert the date-index to a number-index
train_df = train_df.reset_index(drop=True).copy()
train_df.head(5)

Next we will choose the features upon which we want to train our neural network. We could also create additional features such as moving averages and add them to our dataset. But this form of feature engineering is a topic for itself and I will cover it in a separate blog post. So here we will concentrate on selecting a number of continuous numeric features that already exist in the original dataset.

Preprocessing
Feature Selection during Preprocessing

The following code selects the features. We also add an additional column to our record called “Predictions”. The reason for this column is that we have to scale the data in the next step.

# List of considered Features
FEATURES = ['High', 'Low', 'Open', 'Close', 'Volume', 'Month', 
            #'Year',
            #'Adj Close'
           ]

print('FEATURE LIST')
print([f for f in FEATURES])

# Create the dataset with features and filter the data to the list of FEATURES
data = pd.DataFrame(train_df)
data_filtered = data[FEATURES]

# We add a prediction column and set dummy values to prepare the data for scaling
data_filtered_ext = data_filtered.copy()
data_filtered_ext['Prediction'] = data_filtered_ext['Close'] 

# Print the tail of the dataframe
data_filtered_ext.tail()

When working with neural networks, a standard approach is to scale the data. Data scaling is a process of bringing model data in a format that makes training the neural network faster and improves prediction accuracy. The method of scaling data is similar to data normalization. In the following we use the RobustScaler of the scikitlearn package. This scaling function is robust against outliers. It removes the median and scales the data according to the quantile range. If a neural network is trained with scaled data, the model’s predictions will also be scaled. Therefore, we need to unscale the predictions later. For this purpose, we create an additional scaler that works on a single feature column (scaler_pred).

# Calculate the number of rows in the data
nrows = data_filtered.shape[0]
np_data_unscaled = np.array(data_filtered)
np_data_unscaled = np.reshape(np_data_unscaled, (nrows, -1))
print(np_data_unscaled.shape)

# Transform the data by scaling each feature to a range between 0 and 1
scaler = RobustScaler()
np_data = scaler.fit_transform(np_data_unscaled)

# Creating a separate scaler that works on a single column for scaling predictions
scaler_pred = RobustScaler()
df_Close = pd.DataFrame(data_filtered_ext['Close'])
np_Close_scaled = scaler_pred.fit_transform(df_Close)
Out: (2619, 6)

4) Transforming the data

Before we can train a neural network, the data must now be brought into shape in the next step. The process of changing the shape of the dataset is illustrated below:

Data preprocessing
Data preprocessing

In multivariate time series prediction, the model is trained on many sequences to predict the value of the time steps following the sequences. Accordingly, the data structure required for this is three-dimensional. The first dimension is the sequences, the second dimension is the time steps in a sequence and the third dimension is the features.

To bring our data into this structure, we need to walt through the training set step by step. In this process, we will create a separate sequence for each time step. This process of slicing the training dataset into several mini batches is illustrated below:

The sequences are taken from the data and added to a three dimensional array x_train. In our case, x_train contains about 1996 sequences, each containing 100 time steps and 7 features.

At the same time we create another data set y_train with the target variable. This dataset contains the corresponding prediction values for all training and test sequences. Consequently, x_train and y_train have the same length. For the test dataset we proceed in the same way, resulting in the datasets x_test and y_test.

#Settings
sequence_length = 100

# Split the training data into x_train and y_train data sets
# Get the number of rows to train the model on 80% of the data 
train_data_len = math.ceil(np_data.shape[0] * 0.8) #2616

# Create the training data
train_data = np_data[0:train_data_len, :]
x_train, y_train = [], []
# The RNN needs data with the format of [samples, time steps, features].
# Here, we create N samples, 100 time steps per sample, and 2 features
for i in range(100, train_data_len):
    x_train.append(train_data[i-sequence_length:i,:]) #contains 100 values 0-100 * columsn
    y_train.append(train_data[i, 0]) #contains the prediction values for validation
    
# Convert the x_train and y_train to numpy arrays
x_train, y_train = np.array(x_train), np.array(y_train)

# Create the test data
test_data = np_data[train_data_len - sequence_length:, :]

# Split the test data into x_test and y_test
x_test, y_test = [], []
test_data_len = test_data.shape[0]
for i in range(sequence_length, test_data_len):
    x_test.append(test_data[i-sequence_length:i,:]) #contains 100 values 0-100 * columsn
    y_test.append(test_data[i, 0]) #contains the prediction values for validation
# Convert the x_train and y_train to numpy arrays
x_test, y_test = np.array(x_test), np.array(y_test)

# Convert the x_train and y_train to numpy arrays
x_test = np.array(x_test); y_test = np.array(y_test)
    
print(x_train.shape, y_train.shape)
print(x_test.shape, y_test.shape)
(1996, 100, 6) (1996,) 
(523, 100, 6) (523,)

5) Train the prediction model

Now we are ready to create and train our recurrent neural network prediction model. To ensure that the model can handle the shape of the input data, we need to carefully choose the number of input neurons. We will choose the input shape in a way that the number of neurons matches the size of the datasets that the model processes during training. In our case, each minibatch consists of a matrix with 100 timesteps and 6 features.

# Configure the neural network model
model = Sequential()

# Model with 100 Neurons 
# inputshape = 100 Timestamps, each with x_train.shape[2] variables
n_neurons = x_train.shape[1] * x_train.shape[2]
print(n_neurons, x_train.shape[1], x_train.shape[2])
model.add(LSTM(n_neurons, return_sequences=False, 
               input_shape=(x_train.shape[1], x_train.shape[2]))) 
model.add(Dense(1, activation='relu'))

# Compile the model
model.compile(optimizer='adam', loss='mean_squared_error')
# Training the model
epochs = 5
early_stop = EarlyStopping(monitor='loss', patience=2, verbose=1)
history = model.fit(x_train, y_train, batch_size=16, 
                    epochs=epochs, callbacks=[early_stop])
Epoch 1/5
1996/1996 [==============================] - 81s 41ms/step - loss: 0.1228
Epoch 2/5
1996/1996 [==============================] - 83s 42ms/step - loss: 0.1219
Epoch 3/5
1996/1996 [==============================] - 82s 41ms/step - loss: 0.1219
Epoch 4/5
1996/1996 [==============================] - 92s 46ms/step - loss: 0.1219
Epoch 5/5
1996/1996 [==============================] - 89s 45ms/step - loss: 0.1219

Taking a look at the training loss curve, we see that loss drops quickly to a lower plateau, which is a sign that the model has improved over the course of the training process.

# Plot training & validation loss values
fig, ax = plt.subplots(figsize=(5, 5), sharex=True)
plt.plot(history.history["loss"])
plt.title("Model loss")
plt.ylabel("Loss")
plt.xlabel("Epoch")
ax.xaxis.set_major_locator(plt.MaxNLocator(epochs))
plt.legend(["Train", "Test"], loc="upper left")
plt.grid()
plt.show()
Loss Curve

6) Evaluate model performance

Now that we have trained our model, it is time to take a closer look at its performance. To do this, we will first calculate two error metrics, MAPE and MDAPE. Then we will compare the predictions in a histogram with the actual values.

# Get the predicted values
predictions = model.predict(x_test)

# Mean Absolute Percentage Error (MAPE)
MAPE = np.mean((np.abs(np.subtract(y_test, predictions)/ y_test))) * 100
print('Mean Absolute Percentage Error (MAPE): ' + str(np.round(MAPE, 2)) + ' %')

# Median Absolute Percentage Error (MDAPE)
MDAPE = np.median((np.abs(np.subtract(y_test, predictions)/ y_test)) ) * 100
print('Median Absolute Percentage Error (MDAPE): ' + str(np.round(MDAPE, 2)) + ' %')
Mean Absolute Percentage Error (MAPE): 22.14 %
Median Absolute Percentage Error (MDAPE): 17.9 %

The MAPE is 22.15, this means that the mean of our predictions deviates from the actual values by 22.14%. The MDAPE is 17.9 % and a bit lower than the mean. This indicates that there are some outliers among the prediction errors. 50% of the predictions deviate by more than 17.9% from the actual values. 50% of the predictions deviate by less than 17.9% from the actual values.

Next, we can plot the predictions and compare them to the actual values. Therefore, we will undo the scaling for the predictions and then add the unscaled predictions to the original dataset.

# Get the predicted values
pred_unscaled = scaler_pred.inverse_transform(predictions)

# The date from which on the date is displayed
display_start_date = pd.Timestamp('today') - timedelta(days=500)

# Add the date column
data_filtered_sub = data_filtered.copy()
data_filtered_sub['Date'] = date_index

# Add the difference between the valid and predicted prices
train = data_filtered_sub[:train_data_len + 1]
valid = data_filtered_sub[train_data_len:]
valid.insert(1, "Prediction", pred_unscaled.ravel(), True)
valid.insert(1, "Difference", valid["Prediction"] - valid["Close"], True)

# Zoom in to a closer timeframe

valid = valid[valid['Date'] > display_start_date]
train = train[train['Date'] > display_start_date]

# Visualize the data
fig, ax1 = plt.subplots(figsize=(22, 10), sharex=True)
xt = train['Date']; yt = train[["Close"]]
xv = valid['Date']; yv = valid[["Close", "Prediction"]]
plt.title("Predictions vs Actual Values", fontsize=20)
plt.ylabel(stockname, fontsize=18)
plt.plot(xt, yt, color="#039dfc", linewidth=2.0)
plt.plot(xv, yv["Prediction"], color="#E91D9E", linewidth=2.0)
plt.plot(xv, yv["Close"], color="black", linewidth=2.0)
plt.legend(["Train", "Test Predictions", "Actual Values"], loc="upper left")

# # Create the bar plot with the differences
x = valid['Date']
y = valid["Difference"]

# Create custom color range for positive and negative differences
valid.loc[y >= 0, 'diff_color'] = "#2BC97A"
valid.loc[y < 0, 'diff_color'] = "#C92B2B"

plt.bar(x, y, width=0.8, color=valid['diff_color'])
plt.grid()
plt.show()

I have added a bar plot which shows the deviations of the predictions from the actual values in color. The barplot is red if the predictions are above the actual values and green if the predictions are below.

The histogram shows that the deviations between the actual values and the predictions vary greatly. The deviations are greatest during periods of increased market volatility and least during periods of steady market movement.

7) Predict next day’s price

Now we can use our multivariate time series model to make a forecast for the next day. For this purpose we will extract a new dataset from the Yahoo-Finance API. Our model was trained with mini batches of 100 time steps. So in order for our model to make a prediction for tomorrow, we need to feed it with data for the last 100 time steps. To do this we send a new API request to Yahoo Finance. Then we can transform the data into the shape 1 x 100 x 6, whereby the last figure is the number of feature columns.

Since the stock exchange is only open on weekdays, we have to set the period in our API request to a little more than 100 days. To make sure we get enough data, I set the period to 200. Then we can limit the period to exactly 100 time steps. The following code will do this for us and then run the prediction model. Finally, we scale the prediction back to the original range of values before we output the result.

# Get fresh data until today and create a new dataframe with only the price data
date_start = date.today() - timedelta(days=200)
print(date_start)
new_df = webreader.DataReader(symbol, data_source='yahoo', start=date_start, end=date_today)
d = pd.to_datetime(new_df.index)
new_df['Month'] = d.strftime("%m") 
new_df['Year'] = d.strftime("%Y") 
new_df = new_df.filter(FEATURES)

# Get the last 100 day closing price values and scale the data to be values between 0 and 1
last_100_days = new_df[-100:].values
last_100_days_scaled = scaler.transform(last_100_days)

# Create an empty list and Append past 100 days
X_test_new = []
X_test_new.append(last_100_days_scaled)

# Convert the X_test data set to a numpy array and reshape the data
pred_price_scaled = model.predict(np.array(X_test_new))
pred_price_unscaled = scaler_pred.inverse_transform(pred_price_scaled)

# Print last price and predicted price for the next day
price_today = round(new_df['Close'][-1], 2)
predicted_price = round(pred_price_unscaled.ravel()[0], 2)
percent = round(100 - (predicted_price * 100)/price_today, 2)

a = '+'
if percent > 0:
    a = '-'

print('The close price for ' + stockname + ' at ' + str(today) + ' was: ' + str(price_today_str))
print('The predicted close price is: ' + str(pred_price_unscaled_str) + ' (' + a + str(percent) + '%)')
The close price for NASDAQ at 2020-06-01 was: 9490.0
The predicted close price is: 9357.5 (-1.39%)

Summary

Congratulations! You have learned in this tutorial how to create a multivariate time series model and use it to make a prediction for the NASDAQ index. Before I started this article, the idea of working with a multivariate time series seemed relatively simple. Now that I have written the article, I have to say that this is quite a complex topic. So please take the time to retrace the different steps. Especially the transformation of the data can be a complex matter. For example, you can apply the described approach to a data set of your choice. The best way to learn is to practice and create your own models, so I hope that the above Python implementation will be useful for you.

If you have suggestions or questions, please share them in the comments section.

Further articles on Machine Learning and Data Science:

Follow Florian Müller:

Data Scientist & Machine Learning Consultant

Hi, my name is Florian! I am a Zurich-based Data Scientist with a passion for Artificial Intelligence and Machine Learning. After completing my PhD in Business Informatics at the University of Bremen, I started working as a Machine Learning Consultant for the swiss consulting firm ipt. When I'm not working on use cases for our clients, I work on own analytics projects and report on them in this blog.

Leave a Reply

Your email address will not be published. Required fields are marked *