Stock Market Prediction using Python

Stock Market Prediction using Univariate Time Series Models based on Recurrent Neural Networks with Python

Predicting the price of financial assets has fascinated researchers and analysts for many decades. While the traditional prediction methods of technical analysis and fundamental analysis are still widely used, interest is increasingly turning to machine-generated predictions based on deep learning. A contributing factor is that libraries for deep learning, such as Keras or Scikit-Learn, provide easy access to powerful prediction algorithms. With these libraries, anyone with a few programming skills can develop a neural network today. This article shows how that works by creating a univariate model for stock market forecasting in Python. Our model will be a Keras neural network with LSTM layers that produces single-step forecasts for the S&P500 stock market index.

This article is structured as follows: We begin with a brief introduction to univariate modeling and neural networks. Then we start with the coding part and go through all the steps to train a neural network, including data ingestion, data preprocessing, and the design, training, testing, and usage of a predictive neural network model.

Stock market prediction in python
The bulls and the bears

Single-Step Univariate Stock Market Prediction

The prediction approach described in this article is known as single-step single-variate time series forecasting. This approach is similar to technical chart analysis in that it assumes that predicting the price of an asset is fundamentally a time series problem. The goal is to identify patterns in a time series that indicate how the series will develop in the future.

In this tutorial, we predict the value for a single time-step (1 day). In other words, we consider a single time series of data (single-variate). However, it would also be possible to predict multiple steps or increase the length of the time-step. In both cases, the predictions will range further into the future. I have covered this topic in a separate post on time series forecasting.

We will develop a univariate prediction model that predicts a single feature on historical prices for a specific period. More complex models are multivariate and use additional features such as moving averages, momentum indicators, or market sentiment. I have covered multi-variate stock market prediction in a separate tutorial.

Basics of Neural Networks

Recurrent Neural Networks (RNN) are mighty for analyzing time series. An RRN is a specific form of a neural network. In contrast to a feed-forward neural network, where all the information flows from left to right, RNNs use Long-short-term memory (LSTM)-layers that allow them to recirculate output results back and forth through the network. In the field of time series analysis, this is particularly useful, as it enables an RNN to learn patterns that occur over different periods, e.g., days and months, and potentially overlap, thus often resulting in more accurate predictions.

Exemplary model of a four-layered Neural Network
An exemplary model of a four-layered Neural Network

This article develops a univariate model that uses an RRN architecture with LSTM layers to predict the closing price of the S&P500 index. We need Python programming, the Anaconda environment, and Python packages for data manipulation and analytics to build such a neural network.

Understanding Neural Networks in all depth is not a prerequisite for this tutorial. But if you want to learn more about their architecture and functioning, I can recommend this YouTube video.

Implementation a Univariate Regression Model using Keras Recurrent Neural Networks

In the following, we develop a single-variate neural network model that forecasts the S&P500 stock market index.

Prerequisites

Before we start the coding part, make sure that you have set up your Python 3 environment and required packages. If you don’t have an environment set up yet, you can follow the steps in this article to set up the Anaconda environment.

Also, make sure you install all required packages. In this tutorial, we will be working with the following standard packages: 

In addition, we will be using Keras (2.0 or higher) with Tensorflow backend and the machine learning library scikit-learn.

You can install packages using console commands:

  • pip install <package name>
  • conda install <package name> (if you are using the anaconda packet manager)

Step #1 Load the Data

Let’s start by setting up the imports and loading the price data from yahoo.finance.com via an API. To extract the data, we’ll use the pandas DataReader package – a popular library that provides a function to extract data from various Internet sources into pandas DataFrames. Note that if pandas DataReader does not work, you can use the yfinance package.

The following code extracts the price data for the S&P500 index from yahoo finance. If you wonder what “^GSPC” means, this is the symbol for the S&P500, which is a stock market index of the 500 most extensive stocks listed in the US stock market. You can use the symbols of other assets, e.g., BTC-USD for Bitcoin. The data is limited to the timeframe between 2010-01-01 and the current date. So when you execute the code, the results will show a more significant period as in this tutorial.

import math # Mathematical functions 
import numpy as np # Fundamental package for scientific computing with Python
import pandas as pd # For analysing and manipulating data
from datetime import date, timedelta # Date Functions
from pandas.plotting import register_matplotlib_converters # Adds plotting functions for calender dates
import matplotlib.pyplot as plt # For visualization
import matplotlib.dates as mdates # Formatting dates
from sklearn.metrics import mean_absolute_error, mean_squared_error # For measuring model performance / errors
from sklearn.preprocessing import MinMaxScaler #to normalize the price data 
from keras.models import Sequential # Deep learning library, used for neural networks
from keras.layers import LSTM, Dense # Deep learning classes for recurrent and regular densely-connected layers

# Setting the timeframe for the data extraction
today = date.today()
date_today = today.strftime("%Y-%m-%d")
date_start = '2010-01-01'

# Getting S&P500 quotes
stockname = 'S&P500'
symbol = '^GSPC'

# Remote data access
# import pandas_datareader as webreader
# df = webreader.DataReader(symbol, start=date_start, end=date_today, data_source="yahoo")
import yfinance as yf # Used if webreader does not work: pip install yfinance
df = yf.download(symbol, start=date_start, end=date_today)

# Taking a look at the shape of the dataset
print(df.shape)
df.head(5)
(2596, 6)
The data upon which our stock market prediction model will be trained and validated

Step #2 Plot the Price Chart

When you load a new data set into your project, it is often a good idea to familiarize yourself with the data before taking any further steps. When you work with time-series data, visually viewing the data in a line plot is the primary way to do this. Use the following code to create the line plot for the S&P500 data.

# Plotting the data
register_matplotlib_converters()
years = mdates.YearLocator() 
fig, ax1 = plt.subplots(figsize=(16, 6))
ax1.xaxis.set_major_locator(years)
x = df.index
y = df['Close']
ax1.fill_between(x, 0, y, color='#b9e1fa')
ax1.legend([stockname], fontsize=12)
plt.title(stockname + ' from '+ date_start + ' to ' + date_today, fontsize=16)
plt.plot(y, color='#039dfc', label='S&P500', linewidth=1.0)
plt.ylabel('S&P500 Points', fontsize=12)
plt.show()
Historic stock market price data of the S&P500 index
Historical data on the price of S&P500

If you follow the course of the stock markets a little, the chart above might look familiar to you.

Step #3 Splitting the Data

We will train the NN on a decade of market price data. Then we predict the price of the next day based on the last 50 days of market prices. Before we can begin with the training of the NN, we need to split the data into separate test sets for training and validation and ensure that it is in the right shape. As illustrated below, we will use 80% of the data as training data and keep 20% as test data to later evaluate the performance of our univariate model.

Building a machine learning model typically involves splitting data into train and test
Splitting data into train and test

Because it’s a best practice, we will also use the MinMaxScaler to normalize the price values in our data to a range between 0 and 1.

# Feature Selection - Only Close Data
data = df.filter(['Close'])
data_unscaled = data.values

# Get the number of rows to train the model on 80% of the data 
training_data_length = math.ceil(len(data_unscaled) * 0.8)

# Transform features by scaling each feature to a range between 0 and 1
mmscaler = MinMaxScaler(feature_range=(0, 1))
np_data = mmscaler.fit_transform(data_unscaled)
scaled data

Step #4 Creating the Input Shape

Our neural network will have two layers, an input layer and an output layer. The input data shape needs to correspond with the number of neurons in the neural network’s input layer. Therefore, we also have to decide on the neural network architecture before bringing our data in the right shape.

4.1 Designing the Input Shape

Next, we create the training data based on which we will train our neural network. For this, we make multiple slices of the training data (x_train), so-called mini-batches. During the training process, the neural network processes the mini-batch one by one and creates a separate forecast for each mini-batch. The illustration below shows the shape of the data:

Sample dataset for time series forecasting divided into several train batches
Sample dataset for time series forecasting divided into several train batches.

Neural networks learn in an iterative process. In this process, the algorithm reduces the prediction errors by adjusting connection strength between the neurons (weights). The model needs a second list (y_train) to evaluate the forecast quality, containing the valid price values from our ground truth. During training, the model compares the predictions with the ground truth and calculates the training error to minimize it over time.

4.2 Data Preprocessing

The code below will carry out the steps to prepare the data:

# Set the sequence length - this is the timeframe used to make a single prediction
sequence_length = 50

# Prediction Index
index_Close = data.columns.get_loc("Close")
print(index_Close)
# Split the training data into train and train data sets
# As a first step, we get the number of rows to train the model on 80% of the data 
train_data_len = math.ceil(np_data.shape[0] * 0.8)

# Create the training and test data
train_data = np_data[0:train_data_len, :]
test_data = np_data[train_data_len - sequence_length:, :]

# The RNN needs data with the format of [samples, time steps, features]
# Here, we create N samples, sequence_length time steps per sample, and 6 features
def partition_dataset(sequence_length, data):
    x, y = [], []
    data_len = data.shape[0]
    for i in range(sequence_length, data_len):
        x.append(data[i-sequence_length:i,:]) #contains sequence_length values 0-sequence_length * columsn
        y.append(data[i, index_Close]) #contains the prediction values for validation (3rd column = Close),  for single-step prediction
    
    # Convert the x and y to numpy arrays
    x = np.array(x)
    y = np.array(y)
    return x, y

# Generate training data and test data
x_train, y_train = partition_dataset(sequence_length, train_data)
x_test, y_test = partition_dataset(sequence_length, test_data)

# Print the shapes: the result is: (rows, training_sequence, features) (prediction value, )
print(x_train.shape, y_train.shape)
print(x_test.shape, y_test.shape)

# Validate that the prediction value and the input match up
# The last close price of the second input sample should equal the first prediction value
print(x_test[1][sequence_length-1][index_Close])
print(y_test[0])
(1966, 50, 1)
(1966,)

x_train contains 1966 mini-batches. Each contains a series of quotes for 50 dates. In y_train, we have 1966 validation values – one for each mini-batch. Be aware that numbers depend on the timeframe and will vary depending on when you execute the code.

Step #5 Designing the Model Architecture

Before we can train the model, we first need to decide on the architecture of the model. Above all, the architecture comprises the type and number of layers and the number of neurons in each layer.

5.1 Choosing Layers

How can we determine the number of layers? Selecting the correct number of layers from the start is difficult or even impossible. A common approach is to try different architectures and find out what works best by trial and error. Then the architecture and the performance of the univariate model are tested and refined in multiple iterations.

We will use a fully connected network structure with four layers. The architecture combines two layers of the LSTM class with two layers of the Dense class. I have chosen this architecture because it is comparably simple and a good start tackling time series problems. To define this structure, use the Dense class and LSTM class from the Keras deep-learning library.

Basic architecture of our recurrent neural network
The architecture of our recurrent Neural Network

5.2 Choosing the Number of Neurons

And how do we determine the number of neurons? In general, the number of neurons in the first layers needs to cover the input data’s size. Our input comprises values for 50 dates. Thus the input shape needs to have at least 50 neurons – one for each value. In the last layer, we will have only one neuron, which means that our prediction will contain a single price point for a single time step.

# Configure the neural network model
model = Sequential()

neurons = sequence_length

# Model with sequence_length Neurons 
# inputshape = sequence_length Timestamps
model.add(LSTM(neurons, return_sequences=True, input_shape=(x_train.shape[1], 1))) 
model.add(LSTM(neurons, return_sequences=False))
model.add(Dense(25, activation='relu'))
model.add(Dense(1))

# Compile the model
model.compile(optimizer='adam', loss='mean_squared_error')

Step #6 Train the Univariate Model

Now it’s time to fit the model to the data. The training time may vary between seconds and minutes, depending on the computing power of your system. For instance, on my local notebook processor (Intel Core i7), the training time is usually a couple of minutes.

# Training the model
model.fit(x_train, y_train, batch_size=16, epochs=25)
Training the model

We have fitted our model to the training data.

Step #7 Making Test Predictions

So how does our stock market prediction model perform? To evaluate the model performance, we need to feed the model with the test data. For this purpose, we provide the test data (x_test) that we have generated in a previous step to the model to get some predictions. We need to keep in mind that initially, we have scaled the input data to a range between 0 and 1. Therefore, we need to inverse the MinMaxScaling from the predictions before interpreting the results.

#Get the predicted values and inverse the scaling
predictions = model.predict(x_test)
predictions = mmscaler.inverse_transform(predictions)

Step #8 Evaluate Model Performance

Different indicators can help us to evaluate the performance of our model. We calculate the forecast error by subtracting valid test data (y_test) from predictions.

# Calculate the mean absolute error (MAE)
mae = mean_absolute_error(predictions, y_test)
print('MAE: ' + str(round(mae, 1)))

# Calculate the root mean squarred error (RMSE)
rmse = np.sqrt(np.mean(predictions - y_test)**2)
print('RMSE: ' + str(round(rmse, 1)))
MAE: 32.0
RMSE: 17.6

The mean of the forecast error (MAE) can be negative or positive. In case it is positive, our predictions tend to lie below the valid values. For our model, the calculated MAE is (32.0). From the MAE, we can tell that our model generally tends to predict a bit too pessimistic.

The mean squared error (RMSE) is always positive. More significant errors tend to have a substantial impact on the RMSE, as they are squared. In our case, the RMSE is 17.6, which is an indication that the prediction error is relatively constant. In other words, the predictions are mostly not entirely wrong.

Visualizing test predictions helps in the process of evaluating the model. Therefore we will plot predicted and valid values.

# The date from which on the date is displayed
display_start_date = "2018-01-01" 

# Add the difference between the valid and predicted prices
train = data[:training_data_length + 1]
valid = data[training_data_length:]
valid.insert(1, "Predictions", predictions, True)
valid.insert(1, "Difference", valid["Predictions"] - valid["Close"], True)

# Zoom in to a closer timeframe
valid = valid[valid.index > display_start_date]
train = train[train.index > display_start_date]

# Visualize the data
fig, ax1 = plt.subplots(figsize=(22, 10), sharex=True)
xt = train.index; yt = train[["Close"]]
xv = valid.index; yv = valid[["Close", "Predictions"]]
plt.title("Predictions vs Ground Truth", fontsize=20)
plt.ylabel(stockname, fontsize=18)
plt.plot(yt, color="#039dfc", linewidth=2.0)
plt.plot(yv["Predictions"], color="#E91D9E", linewidth=2.0)
plt.plot(yv["Close"], color="black", linewidth=2.0)
plt.legend(["Train", "Test Predictions", "Ground Truth"], loc="upper left")

# Fill between plotlines
ax1.fill_between(xt, 0, yt["Close"], color="#b9e1fa")
ax1.fill_between(xv, 0, yv["Predictions"], color="#F0845C")
ax1.fill_between(xv, yv["Close"], yv["Predictions"], color="grey") 

# Create the bar plot with the differences
x = valid.index
y = valid["Difference"]
plt.bar(x, y, width=5, color="black")
plt.grid()
plt.show()

We can see that the orange zone contains the test predictions. The grey area marks the difference between test predictions and ground truth. As already indicated by the different performance measures, we can see that the predictions are typically near the ground truth.

Stock Market Prediction model: Prediction vs Ground Truth
Prediction vs. Ground Truth

We have also added the absolute errors on the bottom. Where the difference is negative, the predicted value was too optimistic. Where the difference is positive, the predictive value was too pessimistic.

Step #8 Stock Market Prediction – Predicting a Single Day Ahead

Now that we have tested our model, we can use it to make a prediction. For this, we use a new data set as the input for our prediction model. The model returns a forecast for a single time-step, which in our case is the next day.

# Get fresh data until today
df_new = df.filter(['Close'])

# Get the last N day closing price values and scale the data to be values between 0 and 1
last_days_scaled = mmscaler.transform(df_new[-sequence_length:].values)

# Create an empty list and Append past n days
X_test = []
X_test.append(last_days_scaled)

# Convert the X_test data set to a numpy array and reshape the data
X_test = np.array(X_test)
X_test = np.reshape(X_test, (X_test.shape[0], X_test.shape[1], 1))

# Get the predicted scaled price, undo the scaling and output the predictions
pred_price = model.predict(X_test)
pred_price_unscaled = mmscaler.inverse_transform(pred_price)

# Print last price and predicted price for the next day
price_today = round(df_new['Close'][-1], 2)
predicted_price = round(pred_price_unscaled.ravel()[0], 2)
percent = round(100 - (predicted_price * 100)/price_today, 2)

plus = '+'; minus = '-'
print(f'The close price for {stockname} at {today} was {price_today}')
print(f'The predicted close price is {predicted_price} ({plus if percent > 0 else minus}{percent}%)')
The price for S&P500 at 2020-04-04 was: 2578.0
The predicted S&P500 price at date 2020-04-05 is: 2600.0

So, the model predicts a value of 2600.0 for the S&P500 at 2020-04-07.

Summary

In this tutorial, you have learned to create, train and test a four-layered recurrent neural network for stock market prediction using Python and Keras. Finally, we have used this model to predict the S&P500 stock market index. You can easily create models for other assets by replacing the stock symbol with another stock code. A list of common symbols for stocks or stock indexes is available on yahoo.finance.com. Just don’t forget to retrain the model with a fresh copy of the price data.

The model created in this post makes predictions for a single time step. If you want to learn how to make time-series predictions that range further, you might want to check out the part II of this tutorial series: Creating a Multistep Forecast in Python.

I hope you enjoyed this article. Let me know in the comments if you have questions!

Author

  • Hi, I am Florian, a Zurich-based consultant for AI and Data. Since the completion of my Ph.D. in 2017, I have been working on the design and implementation of ML use cases in the Swiss financial sector. I started this blog in 2020 with the goal in mind to share my experiences and create a place where you can find key concepts of machine learning and materials that will allow you to kick-start your own Python projects.

Leave a Reply