# Stock Market Prediction using Multivariate Time Series and Recurrent Neural Networks in Python

Time series forecasting has become a popular domain for applying deep learning technologies and recurrent neural networks in recent years. Regression models based on recurrent neural networks can recognize patterns in large data sets and thus make more accurate predictions than humans. What makes them so good at this are their layers of long-term, short-term memory (LSTM). These enable them to learn patterns in time-series data that occur over different periods. This article shows how we can use such recurrent neural networks for modeling multivariate time series data in Python. The example given is stock market prediction which is currently seeing a lot of attention.

The remainder of this article is structured as follows: We start with a brief intro to modeling univariate and multivariate time series data. Then we turn to the hands-on part. In this part, we prepare the multivariate time series data and use it to train a neural network in Python. Our model is a recurrent neural network with LSTM layers that forecasts the NASDAQ stock market index. Finally, we evaluate the performance of our model and make a forecast for the next day.

## Univariate vs Multivariate Time Series Models

While univariate models consider a single feature in their predictions, multivariate models take several features into account. Depending on the application area, additional features can be calculated from the time series (e.g., moving averages) or stem from different sources (e.g., prices of other stocks). Multivariate models that have additional relevant information available achieve superior performance.

#### Univariate Prediction Models

Univariate time series models focus on a single dependent variable. The value of a time series at time t is assumed to be closely related to the values at the previous time steps t-1, t-2, t-3, etc.

Preparing data for training univariate models is more straightforward than for multivariate models. If you are new to time series prediction, you might want to check out my earlier articles. These explain how to develop and evaluate univariate time series models:

The dependent variable in stock market prediction is usually the closing or opening price of a financial asset (for example, a stock or an index.) The training of univariate forecasting models solely relies on past price movements. The idea is that the model can identify past price movements that indicate how prices will develop in the future. The approach is similar to chart analysis, where one tries to identify recurring formations in a price chart. However, patterns and the rules of the market are constantly changing. Models thus inevitably make mistakes. Nevertheless, to quote Georg Box, “All models are wrong, but some are useful.”

#### Multivariate Prediction Models

Forecasting the price of a financial asset is a complex challenge. In general, the price is determined by an endless number of variables. Economic cycles, political developments, unforeseen events, psychological factors, market sentiment, and even the weather, all these variables will more or less exert an influence on the price. In addition, many of these variables are interdependent, which makes statistical modeling even more complex.

A univariable forecast model reduces this complexity to a minimum – a single factor and ignores the other dimensions. A multivariate model is a simplification as well, but it can take several factors into account. For example, a multivariate stock market prediction model can consider the relationship between the closing price and the opening price, moving averages, daily highs, the price of other stocks, and so on. Multivariate models are not able to fully cover the complexity of the market. However, they offer a more detailed abstraction of reality than univariate models. Multivariate models thus tend to provide more accurate predictions than univariate models.

The selection and engineering of relevant feature variables is a complex topic in itself. I have covered feature engineering for time series models in a separate article.

## Implementing a Multivariate Time Series Prediction Model in Python

In the following, we will develop a multivariate recurrent neuronal network in Python for time series prediction. The goal is to predict the price of the NASDAQ stock market index, but please do not expect to succeed in this task. Your results will vary from this article, depending on the time when you execute the code.

As always, you can download the Python code from the relataly GitHub repository:

### Prerequisites

Before we start the coding part, make sure that you have set up your Python 3 environment and required packages. If you don’t have an environment set up yet, you can follow the steps in this tutorial to set up the Anaconda environment.

Also, make sure you install all required packages. In this tutorial, we will be working with the following standard packages:

In addition, we will be using Keras (2.0 or higher) with Tensorflow backend, the machine learning library sci-kit-learn, and the pandas-DataReader.

You can install packages using console commands:

• pip install <package name>
• conda install <package name> (if you are using the anaconda packet manager)

### Step #1 Load the Time Series Data

Let’s start by loading price data on the NASDAQ composite index (symbol: ^IXIC) from yahoo.finance.com into our Python project. “^IXIC” is the technical symbol for NASDAQ. Alternatively, you could use any other asset symbol such as BTC-USD to get price quotes for Bitcoin.

To download the data, we use Pandas DataReader – a popular Python library that provides functions to extract data from various sources on the web. Alternatively, you can also use the “yfinance” library.

```import math # Mathematical functions
import numpy as np # Fundamental package for scientific computing with Python
import pandas as pd # Additional functions for analysing and manipulating data
from datetime import date, timedelta, datetime # Date Functions
from pandas.plotting import register_matplotlib_converters # This function adds plotting functions for calender dates
import matplotlib.pyplot as plt # Important package for visualization - we use this to plot the market data
import matplotlib.dates as mdates # Formatting dates
from sklearn.metrics import mean_absolute_error, mean_squared_error # Packages for measuring model performance / errors
from keras.models import Sequential # Deep learning library, used for neural networks
from keras.layers import LSTM, Dense, Dropout # Deep learning classes for recurrent and regular densely-connected layers
from keras.callbacks import EarlyStopping # EarlyStopping during model training
from sklearn.preprocessing import RobustScaler, MinMaxScaler # This Scaler removes the median and scales the data according to the quantile range to normalize the price data
import seaborn as sns

# Setting the timeframe for the data extraction
today = date.today()
date_today = today.strftime("%Y-%m-%d")
date_start = '2010-01-01'

# Getting NASDAQ quotes
stockname = 'NASDAQ'
symbol = '^IXIC'

# You can either use webreader or yfinance to load the data from yahoo finance

import yfinance as yf #Alternative package if webreader does not work: pip install yfinance

# Create a quick overview of the dataset
df```

We have limited the data in the API request to the timeframe between 2010-01-01 and the current date.

### Step #2 Explore the Data

Let’s first familiarize ourselves with the data before taking further steps. When dealing with time-series data, line plots are the best choice to gain an initial understanding of our data.

```# Plot line charts
df_plot = df.copy()

list_length = df_plot.shape
ncols = 2
nrows = int(round(list_length / ncols, 0))

fig, ax = plt.subplots(nrows=nrows, ncols=ncols, sharex=True, figsize=(14, 7))
for i in range(0, list_length):
ax = plt.subplot(nrows,ncols,i+1)
sns.lineplot(data = df_plot.iloc[:, i], ax=ax)
ax.set_title(df_plot.columns[i])
ax.tick_params(axis="x", rotation=30, labelsize=10, length=0)
ax.xaxis.set_major_locator(mdates.AutoDateLocator())
fig.tight_layout()
plt.show()```

The line plots look as expected. We continue with pre-processing and feature engineering.

### Step #3 Scaling and Feature Selection

Next, we will transform the data into the shape required by our recurrent neural network. We begin by converting the date index into an integer index. This makes it easier to reference specific columns later.

```# Indexing Batches
train_df = df.sort_values(by=['Date']).copy()

# We safe a copy of the dates index, before we need to reset it to numbers
date_index = train_df.index

# Adding Month and Year in separate columns
# d = pd.to_datetime(train_df.index)
# train_df['Month'] = d.strftime("%m")
# train_df['Year'] = d.strftime("%Y")

# We reset the index, so we can convert the date-index to a number-index
train_df = train_df.reset_index(drop=True).copy()

Next, we will select the features upon which we want to train our neural network. We could also create additional features such as moving averages, but I want to keep things simple. Therefore, we select features that are already present in our data.

The following code selects the features. However, when we train a prediction model on the scaled data, this will also result in scaled predictions. Therefore, we need to unscale the projections later. We add a dummy column to our record called “Predictions,” which will help us later when we need to reverse the scaling of our data.

```# List of considered Features
FEATURES = ['High', 'Low', 'Open', 'Close', 'Volume'
]

print('FEATURE LIST')
print([f for f in FEATURES])

# Create the dataset with features and filter the data to the list of FEATURES
data = pd.DataFrame(train_df)
data_filtered = data[FEATURES]

# We add a prediction column and set dummy values to prepare the data for scaling
data_filtered_ext = data_filtered.copy()
data_filtered_ext['Prediction'] = data_filtered_ext['Close']

# Print the tail of the dataframe
data_filtered_ext.tail()```

When working with neural networks, a best practice is to scale the data. By scaling the data, it is often possible to increase training times and improve model accuracy.

In the following, we use the MinMaxScaler from the sci-kit learn library to scale the input data to a range between 0 and 1. Because the scaler_model will adapt to the shape of the data (6 dimensional), we cannot simply reuse the scaler later when we want to unscale our model predictions (1 dimensional). For this reason, we create an additional scaler that works on a single feature column (scaler_pred).

```# Get the number of rows in the data
nrows = data_filtered.shape

# Convert the data to numpy values
np_data_unscaled = np.array(data_filtered)
np_data = np.reshape(np_data_unscaled, (nrows, -1))
print(np_data.shape)

# Transform the data by scaling each feature to a range between 0 and 1
scaler = MinMaxScaler()
np_data_scaled = scaler.fit_transform(np_data_unscaled)

# Creating a separate scaler that works on a single column for scaling predictions
scaler_pred = MinMaxScaler()
df_Close = pd.DataFrame(data_filtered_ext['Close'])
np_Close_scaled = scaler_pred.fit_transform(df_Close)```
`Out: (2619, 6)`

### Step #4 Transforming the Data

In multivariate regression models are trained on a three-dimensional data structure. The first dimension is the sequences, the second dimension is the time steps (mini-batches), and the third dimension is the features.

The illustration below shows the steps to bring the multivariate data into a shape that our neural model can process during training. We must keep this form and perform the same steps when using the model to create a forecast.

An essential step in time series prediction is to slice the data into multiple input data sequences with associated target values. For this process, we use a sliding windows algorithm. This algorithm moves a window step by step through the time series data, adding a sequence of multiple data points to the input data with each step. In addition, the algorithm stores the target value (e.g., Closing Price) following this sequence in a separate target data set. Then the algorithm pushes the window one step further and repeats these activities. In this way, the algorithm creates a data set with many input sequences (mini-batches), each of which has a corresponding target value in the target record. This process applies both to the creation of the training and the test data.

We will apply the sliding window approach to our data. The result is a training set (x_train) that contains 2258 input sequences, and each has 50 time-steps and six features. The corresponding target dataset (y_train) contains 2258 target values.

```# Set the sequence length - this is the timeframe used to make a single prediction
sequence_length = 50

# Prediction Index
index_Close = data.columns.get_loc("Close")

# Split the training data into train and train data sets
# As a first step, we get the number of rows to train the model on 80% of the data
train_data_len = math.ceil(np_data_scaled.shape * 0.8)

# Create the training and test data
train_data = np_data_scaled[0:train_data_len, :]
test_data = np_data_scaled[train_data_len - sequence_length:, :]

# The RNN needs data with the format of [samples, time steps, features]
# Here, we create N samples, sequence_length time steps per sample, and 6 features
def partition_dataset(sequence_length, data):
x, y = [], []
data_len = data.shape
for i in range(sequence_length, data_len):
x.append(data[i-sequence_length:i,:]) #contains sequence_length values 0-sequence_length * columsn
y.append(data[i, index_Close]) #contains the prediction values for validation,  for single-step prediction

# Convert the x and y to numpy arrays
x = np.array(x)
y = np.array(y)
return x, y

# Generate training data and test data
x_train, y_train = partition_dataset(sequence_length, train_data)
x_test, y_test = partition_dataset(sequence_length, test_data)

# Print the shapes: the result is: (rows, training_sequence, features) (prediction value, )
print(x_train.shape, y_train.shape)
print(x_test.shape, y_test.shape)

# Validate that the prediction value and the input match up
# The last close price of the second input sample should equal the first prediction value
print(x_train[sequence_length-1][index_Close])
print(y_train)```

### Step #5 Train the Multivariate Prediction Model

After we have prepared the data, we can train the recurrent neural network for stock market prediction. The architecture of our neural network consists of the following four layers:

• LSTM layer, which takes our mini-batches as input and returns the whole sequence
• LSTM layer that takes the sequence from the previous layer, but only return 5 values
• Dense layer with 5 neurons
• final dense layer that outputs the predicted value

The number of neurons in the first layer must be equal to the size of a minibatch of the input data. Each minibatch in our dataset consists of a matrix with 50 steps and six features. Thus, the input layer of our recurrent neural network consists of 300 neurons.

```# Configure the neural network model
model = Sequential()

# Model with n_neurons = inputshape Timestamps, each with x_train.shape variables
n_neurons = x_train.shape * x_train.shape
print(n_neurons, x_train.shape, x_train.shape)

# Compile the model

We start the training process by running the following code:

```# Training the model
epochs = 50
batch_size = 16
early_stop = EarlyStopping(monitor='loss', patience=5, verbose=1)
history = model.fit(x_train, y_train,
batch_size=batch_size,
epochs=epochs,
validation_data=(x_test, y_test)
)

#callbacks=[early_stop])```
```Epoch 1/5
1996/1996 [==============================] - 81s 41ms/step - loss: 0.1228
Epoch 2/5
1996/1996 [==============================] - 83s 42ms/step - loss: 0.1219
Epoch 3/5
1996/1996 [==============================] - 82s 41ms/step - loss: 0.1219
Epoch 4/5
1996/1996 [==============================] - 92s 46ms/step - loss: 0.1219
Epoch 5/5
1996/1996 [==============================] - 89s 45ms/step - loss: 0.1219```

Let’s take a quick look at the loss curve.

```# Plot training & validation loss values
fig, ax = plt.subplots(figsize=(20, 10), sharex=True)
plt.plot(history.history["loss"])
plt.title("Model loss")
plt.ylabel("Loss")
plt.xlabel("Epoch")
ax.xaxis.set_major_locator(plt.MaxNLocator(epochs))
plt.legend(["Train", "Test"], loc="upper left")
plt.grid()
plt.show()```

The loss drops quickly to a lower plateau, which signals that the model has improved throughout the training process.

### Step #6 Evaluate Model Performance

Now that we have trained our model, it is time to look at its performance. But, first, we have to reverse the scaling for the predictions. We will calculate three error metrics, MAE, MAPE, and MDAPE. Then we will compare the predictions in a line plot with the actual values.

```# Get the predicted values
y_pred_scaled = model.predict(x_test)

# Unscale the predicted values
y_pred = scaler_pred.inverse_transform(y_pred_scaled)
y_test_unscaled = scaler_pred.inverse_transform(y_test.reshape(-1, 1))

# Mean Absolute Error (MAE)
MAE = mean_absolute_error(y_test_unscaled, y_pred)
print(f'Median Absolute Error (MAE): {np.round(MAE, 2)}')

# Mean Absolute Percentage Error (MAPE)
MAPE = np.mean((np.abs(np.subtract(y_test_unscaled, y_pred)/ y_test_unscaled))) * 100
print(f'Mean Absolute Percentage Error (MAPE): {np.round(MAPE, 2)} %')

# Median Absolute Percentage Error (MDAPE)
MDAPE = np.median((np.abs(np.subtract(y_test_unscaled, y_pred)/ y_test_unscaled)) ) * 100
print(f'Median Absolute Percentage Error (MDAPE): {np.round(MDAPE, 2)} %')```

The MAPE is 22.15, which means that the mean of our predictions deviates from the actual values by 3.12%. The MDAPE is 2.88 % and a bit lower than the mean, thus indicating there are some outliers among the prediction errors. 50% of the predictions deviate by more than 2.88%, and 50% of deviate by less than 2.88% from the actual values.

Next, we create a line plot that shows the forecast and compare it to the actual values. Adding a bar plot to the chart helps highlight the deviations of the predictions from the actual values.

```# The date from which on the date is displayed
display_start_date = pd.Timestamp('today') - timedelta(days=500)

data_filtered_sub = data_filtered.copy()
data_filtered_sub['Date'] = date_index

# Add the difference between the valid and predicted prices
train = data_filtered_sub[:train_data_len + 1]
valid = data_filtered_sub[train_data_len:]
valid.insert(1, "Prediction", y_pred.ravel(), True)
valid.insert(1, "Difference", valid["Prediction"] - valid["Close"], True)

# Zoom in to a closer timeframe
valid = valid[valid['Date'] > display_start_date]
train = train[train['Date'] > display_start_date]

# Visualize the data
fig, ax1 = plt.subplots(figsize=(22, 10), sharex=True)
xt = train['Date']; yt = train[["Close"]]
xv = valid['Date']; yv = valid[["Close", "Prediction"]]
plt.title("Predictions vs Actual Values", fontsize=20)
plt.ylabel(stockname, fontsize=18)
plt.plot(xt, yt, color="#039dfc", linewidth=2.0)
plt.plot(xv, yv["Prediction"], color="#E91D9E", linewidth=2.0)
plt.plot(xv, yv["Close"], color="black", linewidth=2.0)
plt.legend(["Train", "Test Predictions", "Actual Values"], loc="upper left")

# # Create the bar plot with the differences
x = valid['Date']
y = valid["Difference"]

# Create custom color range for positive and negative differences
valid.loc[y >= 0, 'diff_color'] = "#2BC97A"
valid.loc[y < 0, 'diff_color'] = "#C92B2B"

plt.bar(x, y, width=0.8, color=valid['diff_color'])
plt.grid()
plt.show()```

The line plot shows that the forecast is close to the actual values but also shows some deviations. The deviations are most significant during periods of increased market volatility and least during periods of steady market movement.

### Step #7 Predict Next Day’s Price

After we have trained the neural network, we can forecast the stock market for the next day. For this purpose, we extract a new dataset from the Yahoo-Finance API and preprocess it in the same way as we did for model training.

We trined our model with mini-batches of 50 time-steps and six features. Thus, we also need to provide the model with 50-time steps when making the forecast. As before, we transform the data into the shape 1 x 50 x 6, whereby the last figure is the number of feature columns. After generating the forecast, we unscale the stock market predictions back to the original range of values.

```df_temp = df[-sequence_length:]
new_df = df_temp.filter(FEATURES)

N = sequence_length

# Get the last N day closing price values and scale the data to be values between 0 and 1
last_N_days = new_df[-sequence_length:].values
last_N_days_scaled = scaler.transform(last_N_days)

# Create an empty list and Append past N days
X_test_new = []
X_test_new.append(last_N_days_scaled)

# Convert the X_test data set to a numpy array and reshape the data
pred_price_scaled = model.predict(np.array(X_test_new))
pred_price_unscaled = scaler_pred.inverse_transform(pred_price_scaled.reshape(-1, 1))

# Print last price and predicted price for the next day
price_today = np.round(new_df['Close'][-1], 2)
predicted_price = np.round(pred_price_unscaled.ravel(), 2)
change_percent = np.round(100 - (price_today * 100)/predicted_price, 2)

plus = '+'; minus = ''
print(f'The close price for {stockname} at {today} was {price_today}')
print(f'The predicted close price is {predicted_price} ({plus if change_percent > 0 else minus}{change_percent}%)')```
```The close price for NASDAQ at 2021-06-27 was 14360.39
The predicted close price is 14232.8095703125 (-0.9%)
```

## Summary

Congratulations, you made it to the end of this article! This article prepared multivariate time series data to train a regression model that predicts the NASDAQ index. You learned to prepare multivariate time series data to train a recurrent neural network with LSTM layers in Python. Finally, we evaluated the performance of our model and visualized it in a line chart.

Multivariate time series forecasting is a complex topic, so you might want to take the time to retrace the different steps. Especially the transformation of the data can be challenging. I believe the best way to learn is to practice and gather your own experiences. Therefore I encourage you to create your models and experiment with other sources of data.

Feel free to share your remarks or feedback in the comments section.

## Author

• Hi, I am Florian, a Zurich-based consultant for AI and Data. Since the completion of my Ph.D. in 2017, I have been working on the design and implementation of ML use cases in the Swiss financial sector. I started this blog in 2020 with the goal in mind to share my experiences and create a place where you can find key concepts of machine learning and materials that will allow you to kick-start your own Python projects.

### 5 Responses

1. ##### Dominic Johnson

Hi, Florian, great post! This has really helped me.

I have 2 questions regarding the train_data and test_data, and inclusion of a validation dataset.

1. With separating the data into train_data and test_data with the code:

train_data = np_data_scaled[0:train_data_len, :]
test_data = np_data_scaled[train_data_len – sequence_length:, :]

Does this create data leakage as the test_data will include data from the train set, or is this prevented due to the use of a sliding window?

2. With respect to question 1, how can I include a validation set for hyperparameter tuning?

Many thanks!

2. 3. ##### pranab

how it is predicting for next ,you are taking data till same day and predicting for the same day.

4. ##### David Campbell

how i can predict daywise for next 15 days?