Building a Simple Univariate Model for Stock Market Prediction using Keras Recurrent Neural Networks and Python

Stock market prediction: a time series forecasting problem

Forecasting the price of financial assets has fascinated researchers and analysts for many decades. While traditional prediction methods of technical analysis and fundamental analysis are still widely used, interest is now increasingly steering towards automated predictions with machine learning. A major reason is the advent of comprehensive libraries for machine learning such as Keras or Scikit Learn, which have made it much easier to develop machine learning models and use them. In fact, today, anyone with some programming knowledge can develop a neural network. This blog post covers the essential steps to build a predictive model for Stock Market Prediction using Python and the Machine Learning library Keras. The model will be based on a Neural Network (NN) and generate predictions for the S&P500 index.

Stock market prediction
The bulls and the bears

The prediction approach described in this tutorial is known as single-step single-variate time series forecasting. This approach is similar to technical chart analysis in the sense that it assumes that predicting the price of an asset is fundamentally a time series problem. The goal is to identify patterns in a time series that indicate how the series will develop in the future.

In this tutorial, we predict the value for a single time-step (1 day). In other words, we consider a single time series of data (single-variate). However, it would also be possible to predict multiple steps or increase the length of the time-step. In both cases, the predictions will range further into the future. I have covered this topic in a separate post on time series forecasting.

Warning: Stock markets can be highly volatile and are generally difficult to predict. The prediction model developed in this post only serves to illustrate a use case for Time Series models. It should not be assumed that a simple neural network as described in this blog post is capable of fully mapping the complexity of price development.

About Neural Networks

Recurrent Neural Networks (RNN) are particularly useful for analyzing time series. An RRN is a specific form of a Neural Network. In contrast to a feed-forward Neural Network, where all the information flows from left to right, RNNs use Long-short-term memory (LSTM)-layers that allow them to recirculate output results back and forth through the network. In the field of time series analysis, this is particularly useful, as it enables an RNN to learn patterns that occur over different time periods, e.g., days and months, and potentially overlap, thus often resulting in more accurate predictions.

Exemplary model of a four-layered Neural Network
Exemplary model of a four-layered Neural Network

The Model that we will develop in this post will use an RRN architecture with LSTM layers to predict the closing price of the S&P500 index. To build such a NN we need Python programming, the Anaconda environment and some Python packages for data manipulation and analytics.

Understanding Neural Networks in all depth is not a prerequisite for this tutorial. But if you want to learn more about their architecture and functioning, I can recommend you this YouTube video.

Overview

The steps that will be covered in this tutorial are as follows:

  1. Loading the data
  2. Plotting the price chart
  3. Splitting the data
  4. Creating the input shape
  5. Building the Keras NN model
  6. Training the model
  7. Evaluating model performance
  8. Tying it all together
  9. Stock market prediction – looking one day ahead

Python Environment

This tutorial assumes that you have setup your python environment. I personally use the Anaconda environment. If you have not yet set the environment up, you can follow this tutorial.

It is also assumed that you have the following packages installed: keras (2.0 or higher) with Tensorflow backend, numpy, pandas, matplot, sklearn. The packages can be installed using the console command:

pip install <packagename>

1) Loading the time series data

Let’s start by setting up the imports and loading the price data from yahoo.finance.com via an API.

# Remote data access for pandas
import pandas_datareader as webreader
# Mathematical functions 
import math 
# Fundamental package for scientific computing with Python
import numpy as np 
# Additional functions for analysing and manipulating data
import pandas as pd 
# Date Functions
from datetime import date, timedelta
# This function adds plotting functions for calender dates
from pandas.plotting import register_matplotlib_converters
# Important package for visualization - we use this to plot the market data
import matplotlib.pyplot as plt 
# Formatting dates
import matplotlib.dates as mdates
# Packages for measuring model performance / errors
from sklearn.metrics import mean_absolute_error, mean_squared_error
# Tools for predictive data analysis. We will use the MinMaxScaler to normalize the price data 
from sklearn.preprocessing import MinMaxScaler 
# Deep learning library, used for neural networks
from keras.models import Sequential 
# Deep learning classes for recurrent and regular densely-connected layers
from keras.layers import LSTM, Dense

To extract the data, we’ll use pandas datareader – a popular library that provides function to extract data from various Internet sources into a pandas DataFrames.

The following code extracts the price data for the S&P500 index from yahoo finance. If you wonder what “^GSPC” means, this is the symbol for the S&P500, which is a stock market index of the 500 biggest stocks listed in the US stock market. You could also change the symbol to get data for other assets, e.g., BTC-USD for Bitcoin. The data is limited to the timeframe between 2010-01-01 and the current date. So when you execute the code, the results will show a larger period as in this tutorial.

# Setting the timeframe for the data extraction
today = date.today()
date_today = today.strftime("%Y-%m-%d")
date_start = '2010-01-01'

# Getting S&P500 quotes
stockname = 'S&P500'
symbol = '^GSPC'
df = webreader.DataReader(
    symbol, start=date_start, end=date_today, data_source="yahoo"
)

# Taking a look at the shape of the dataset
print(df.shape)
df.head(5)
(2596, 6)
The data upon which our model will be trained and validated

2) Plotting the price chart

When you load a new data set into your project, it is often a good idea to familiarize yourself with the data before taking any further steps. When you work with time series data, the primary way to do this is by visually viewing the data in a histogram. Use the following code to create the histogram for the S&P500 data.

# Plotting the data
register_matplotlib_converters()
years = mdates.YearLocator() 
fig, ax1 = plt.subplots(figsize=(16, 6))
ax1.xaxis.set_major_locator(years)
x = df.index
y = df['Close']
ax1.fill_between(x, 0, y, color='#b9e1fa')
ax1.legend([stockname], fontsize=12)
plt.title(stockname + ' from '+ date_start + ' to ' + date_today, fontsize=16)
plt.plot(y, color='#039dfc', label='S&P500', linewidth=1.0)
plt.ylabel('S&P500 Points', fontsize=12)
plt.show()
Historic data on the price of S&P500
Historic data on the price of S&P500

If you follow the course of the stock markets a little, the chart above will certainly look familiar to you.

3) Splitting the data

The NN will be trained on a decade of market price data. The prediction for the next day is then based on the last 100 days of market prices. Before we can begin with the training of the NN, we need to split the data into separate test sets for training and validation, and ensure that it is in the right shape. As illustrated below, we will use 80% of the data as training data and keep 20% as test data to later evaluate the performance of our model.

Building a machine learning model typically involves splitting data into train and test
Splitting data into train and test

Because it’s a best practice, we will also use the MinMaxScaler to normalize the price values in our data to a range between 0 and 1.

# Create a new dataframe with only the Close column and convert to numpy array
data = df.filter(['Close'])
npdataset = data.values

# Get the number of rows to train the model on 80% of the data 
training_data_length = math.ceil(len(npdataset) * 0.8)

# Transform features by scaling each feature to a range between 0 and 1
mmscaler = MinMaxScaler(feature_range=(0, 1))
scaled_data = mmscaler.fit_transform(npdataset)
scaled_data
scaled data

4) Creating the input shape

Our neural network will have two layers, an input layer and an output layer. The shape of the input data needs to correspond with the number of neurons in the input layer of the neural network. This means, we also have to decide on the architecture of the neuronal network before we can begin to bring our data in the right shape.

Next, we create the training data based on which we will train our neural network. This is a bit tricky, as we need cannot train our model simply on the whole series of data. Instead we create multiple slices of the training data (x_train), so called mini-batches. When we train the model, the neural network processes the mini-batch one by one and creates a separate forecast for each mini-batch. The shape of the data is illustrated below:

Sample dataset for time series forecasting divided into several train batches
Sample dataset for time series forecasting divided into several train batches

A neural network learns by adjusting the strength of the connections between neurons (weights) to reduce its prediction errors. For the evaluation of the forecast quality, the model needs a second list (y_train), which contains the valid price values from our ground truth. During training, the model will compare the values from this list with the predictions with the ground truth to calculate the training error and minimize it over time.

The code below will carry out the steps to prepare the data:

# Create a scaled training data set
train_data = scaled_data[0:training_data_length, :]

# Split the data into x_train and y_train data sets
x_train = []
y_train = []
trainingdatasize = len(train_data) 
for i in range(100, trainingdatasize):
    x_train.append(train_data[i-100: i, 0]) #contains 100 values 0-100
    y_train.append(train_data[i, 0]) #contains all other values

# Convert the x_train and y_train to numpy arrays
x_train = np.array(x_train)
y_train = np.array(y_train)

# Reshape the data
x_train = np.reshape(x_train, (x_train.shape[0], x_train.shape[1], 1))
print(x_train.shape)
print(y_train.shape)
(1966, 100, 1)
(1966,)

x_train contains 1966 mini-batches. Each contains a series of quotes for 100 dates. In y_train we have 1966 validation values – one for each mini-batch. Be aware that numbers depend on the timeframe, and will vary depending on when you execute the code.

5) Building the model

Our neuronal network model will be defined as a sequence of multiple layers. Before we can train the model, we first need to decide over the architecture of the model. Above all, the architecture comprises the type and number of layers, as well as the number of neurons in each layer.

How can we determine the number of layers? Selecting the correct number of layers from the start on is difficult or even impossible. It is common that one simply tries different architectures and finds out what works best in a process of try and error. Then the architecture and the performance of the model are tested and refined in multiple iterations.

We will use a fully-connected network structure with four layers. The architecture combines two layers of the LSTM class with two layers of the Dense class. I have chosen this architecture, because it is comparably simple and a good start to tackle time series problems. To define this structure, use the Dense class and LSTM class from the Keras deep-learning library.

Architecture of our recurrent Neural Network
Architecture of our recurrent Neural Network

And how do we determine the number of neurons? In general the number of neurons in the first layers needs at least cover the size of the input data. Our input comprises values for 100 dates. Thus the input shape needs to have at least 100 neurons – one for each value. In the last layer we will have only 1 neuron, which means that our prediction will contain a single price point for one time-step.

# Configure the neural network model
model = Sequential()

# Model with 100 Neurons - inputshape = 100 Timestamps
model.add(LSTM(100, return_sequences=True, input_shape=(x_train.shape[1], 1))) 
model.add(LSTM(100, return_sequences=False))
model.add(Dense(25, activation='relu'))
model.add(Dense(1))

# Compile the model
model.compile(optimizer='adam', loss='mean_squared_error')

6) Training the model

Now it’s time to fit the model to the data. The training time may vary between seconds and minutes depending on the computing power of your system. For instance, on my local notebook processor (intel Core i7), the training time is usually a couple of minutes.

# Training the model
model.fit(x_train, y_train, batch_size=16, epochs=25)
Training the model

Great! Our model is now fitted to the training data.

7) Evaluating model performance

So how does our stock market prediction model perform? To evaluate the model performance, we need to feed the model with the test data. Remember, to ensure that we can evaluate the model performance without bias, we have only used 80% of the data for training and kept 20% of the data for testing.

The 20% of the data can now be used to create two test datasets: x_test and y_test. x_test contains a number of series with data on 100% price points. y_test contains the validation results for each series. The two test sets are created following the same procedure as with the creation of the training data.

# Create a new array containing scaled test values
test_data = scaled_data[training_data_length - 100:, :]

# Create the data sets x_test and y_test
x_test = []
y_test = npdataset[training_data_length:, :]
for i in range(100, len(test_data)):
    x_test.append(test_data[i-100:i, 0])

# Convert the data to a numpy array
x_test = np.array(x_test)

# Reshape the data, so that we get an array with multiple test datasets
x_test = np.reshape(x_test, (x_test.shape[0], x_test.shape[1], 1))

Next we feed the test data (x_test) to the model to get some predictions. Because the input data was scaled to a range between 0 and 1, we need to inverse the MinMaxScaling from the predictions, before we can interpret the results.

#Get the predicted values and inverse the scaling
predictions = model.predict(x_test)
predictions = mmscaler.inverse_transform(predictions)

There are different indicators that can help us to evaluate the performance of our model. The forecast error is calculated as the difference between valid test data (y_test) and predictions.

# Calculate the mean absolute error (MAE)
mae = mean_absolute_error(predictions, y_test)
print('MAE: ' + str(round(mae, 1)))

# Calculate the root mean squarred error (RMSE)
rmse = np.sqrt(np.mean(predictions - y_test)**2)
print('RMSE: ' + str(round(rmse, 1)))
MAE: 32.0
RMSE: 17.6

The mean of the forecast error (MAE) can be negative or positive. In case it is positive , our predictions tend to lie below the valid values. For our model, the calculated MAE is (32.0). From the MAE we can tell that our model generally tends to predict values that are a bit too pessimistic.

The mean squared error (RMSE) is always positive. Larger errors tend to have a strong impact on the RMSE, as they are squared. In our case the RMSE is 17.6, which is an indication that the prediction error is relatively constant. In other words, the predictions are mostly not totally wrong.

8) Plotting test predictions

Visualizing test predictions helps in the process of evaluating the model. Therefore we will plot predicted and valid values.

# The date from which on the date is displayed
display_start_date = "2018-01-01" 

# Add the difference between the valid and predicted prices
train = data[:training_data_length + 1]
valid = data[training_data_length:]
valid.insert(1, "Predictions", predictions, True)
valid.insert(1, "Difference", valid["Predictions"] - valid["Close"], True)

# Zoom in to a closer timeframe
valid = valid[valid.index > display_start_date]
train = train[train.index > display_start_date]

# Visualize the data
fig, ax1 = plt.subplots(figsize=(22, 8), sharex=True)
xt = train.index; yt = train[["Close"]]
xv = valid.index; yv = valid[["Close", "Predictions"]]
plt.title("Predictions vs Ground Truth", fontsize=20)
plt.ylabel(stockname, fontsize=18)
plt.plot(yt, color="#039dfc", linewidth=2.0)
plt.plot(yv["Predictions"], color="#E91D9E", linewidth=2.0)
plt.plot(yv["Close"], color="black", linewidth=2.0)
plt.legend(["Train", "Train Predictions", "Ground Truth"], loc="upper left")

# Fill between plotlines
ax1.fill_between(xt, 0, yt["Close"], color="#b9e1fa")
ax1.fill_between(xv, 0, yv["Predictions"], color="#F0845C")
ax1.fill_between(xv, yv["Close"], yv["Predictions"], color="grey") 

# Create the bar plot with the differences
x = valid.index
y = valid["Difference"]
plt.bar(x, y, width=5, color="black")
plt.grid()
plt.show()

Let’s take a closer look at the plot. From the plot, we can see that the orange zone contains the test predictions. The difference between test predictions and ground truth is colored in grey. As already indicated by the different performance measures, we can see that the predictions are typically near the ground truth.

Stock Market Prediction model: Prediction vs Ground Truth
Prediction vs Ground Truth

We have also added the absolute errors on the bottom. Where the difference is negative, the predicted value was too optimistic. Where the difference is positive, the predictive value was too pessimistic.

9) Stock market prediction – looking one day ahead

Now the time has come to make some real predictions. Running the code below will create a new test set based on the series of recent prices. We will use this testset as the input for our prediction model. In return, we will receive s a prediction for a single time-step, which in our case is the next day.

# Get fresh data until today and create a new dataframe with only the price data
price_quote = webreader.DataReader(symbol, data_source='yahoo', start=date_start, end=date_today)
new_df = price_quote.filter(['Close'])

# Get the last 100 day closing price values and scale the data to be values between 0 and 1
last_100_days = new_df[-100:].values
last_100_days_scaled = mmscaler.transform(new_df[-100:].values)

# Create an empty list and Append past 100 days
X_test = []
X_test.append(last_100_days_scaled)

# Convert the X_test data set to a numpy array and reshape the data
X_test = np.array(X_test)
X_test = np.reshape(X_test, (X_test.shape[0], X_test.shape[1], 1))

# Get the predicted scaled price, undo the scaling and output the predictions
pred_price = model.predict(X_test)
pred_price = mmscaler.inverse_transform(pred_price)
date_tomorrow = date.today() + timedelta(days=1)
print('The price for ' + stockname + ' at ' + date_today + ' was: ' + str(round(df.at[df.index.max(), 'Close'])))
print('The predicted ' + stockname + ' price at date ' + str(date_tomorrow) + ' is: ' + str(round(pred_price[0, 0], 0)))
The price for S&P500 at 2020-04-04 was: 2578.0
The predicted S&P500 price at date 2020-04-05 is: 2600.0

So, the model predicts a value of 2600.0 for the S&P500 at 2020-04-07.

Summary

In this tutorial you have learned to create, train and test a four-layered recurrent neural network for stock market prediction using Python and Keras. Finally, we have used this model to make a prediction for the S&P500 stock market index. You can easily create models for other assets by replacing the stock symbol with another stock code. A list of common symbols for stocks or index fonds is available on yahoo.finance.com. Just don’t forget to retrain the model with a fresh copy of the price data.

The model created in this post makes predictions for single time steps. If you want to learn how to make time series predictions that range further, you might want to check out the part II of this tutorial series:

The knowledge you have gained in this blog post, can be applied to time series problems in many other application areas such as in the prediction of shop orders, network or health signals, website traffic, and many more.

When you have enjoyed this tutorial, let me know in the comments, but I am also happy to receive your feedback if not 🙂

Follow Florian Müller:

Data Scientist & Machine Learning Consultant

Hi, my name is Florian! I am a Zurich-based Data Scientist with a passion for Artificial Intelligence and Machine Learning. After completing my PhD in Business Informatics at the University of Bremen, I started working as a Machine Learning Consultant for the swiss consulting firm ipt. When I'm not working on use cases for our clients, I work on own analytics projects and report on them in this blog.

4 Responses

Leave a Reply