multivariate time series forecasting with python

Stock Market Prediction – Adjusting Time Series Prediction Intervals in Python

In a previous article on stock market forecasting, we have created a forecast for the S&P500 stock market index using a neural network and Python. The prediction interval used in this previous article was a single day. However, many time-series prediction problems require us to make predictions that range further ahead, say, several days, weeks, or months. This article shows how to adjust the prediction intervals to create a single-step forecast for a longer time frame.

The remainder of this article is structured as follows. We begin with a brief overview of different methods that we can use to adjust the time series prediction interval. Once we are familiar with these concepts, we turn to the practical part in Python. We will train a simple neural network on stock market data. After validating the performance of our model, we will prepare the data in a way that allows us to forecast a single but more extended step into the future.

Different Ways to Adjust Prediction Intervals

When forecasting one step, the prediction interval is the point in time for which a prediction model will simulate the next value. There are three different ways to change the prediction interval:

  • Single-step Forecasting with bigger timesteps: In a single-stage forecasting approach, the length of a time step is defined by the input data. For example, a model that uses daily prices as input data will also provide daily forecasts. Changing the length of the input steps will change the output steps to the same extent. We will cover this forecasting approach in this article.
  • Multi-Step Rolling Forecasting: Another way is to train the model on its own output. We do this by maintaining the predictions from one output and reuse them as input in the subsequent training run. In this way, with each iteration, the predictions range one time step further ahead. Based on daily input time-steps, after seven iterations, the model will have provided the output for a weekly prediction. I have covered this topic in a separate post.
  • Deep Multi-Output Forecasting: A third option is to create a NN that does not only predict a single output value for one time-step but instead provides a whole series of predictions for multiple timesteps.

Predicting the Price of the S&P500 One Week Ahead

In this article, we will cover the first pathway: single-step forecasting with more extended timesteps. For this purpose, we reuse most of the code from the previous article on univariate single-step daily forecasting.

As always, you can find the Python code on GitHub.

Prerequisites

Before we start the coding part, make sure that you have set up your Python 3 environment and required packages. If you don’t have an environment yet, you can follow these steps to set up the Anaconda environment.

Also, make sure you install all required packages. In this tutorial, we will be working with the following standard packages: 

In addition, we will be using Keras (2.0 or higher) with Tensorflow backend, the machine learning library sci-kit-learn, and pandas DataReader to interact with the yahoo finance API.

You can install packages using console commands:

  • pip install <package name>
  • conda install <package name> (if you are using the anaconda packet manager)

Step #1 Load the Data

In the following, we will modify the prediction interval of the neural network model that we have developed in a previous post. As a result, the model will generate predictions for the market price of the S&P500 Index that range one week ahead.

As before, we start loading the stock market data via an API.

# Mathematical functions 
import math 
# Fundamental package for scientific computing with Python
import numpy as np 
# Additional functions for analysing and manipulating data
import pandas as pd 
# Date Functions
from datetime import date, timedelta
# This function adds plotting functions for calender dates
from pandas.plotting import register_matplotlib_converters
# Important package for visualization - we use this to plot the market data
import matplotlib.pyplot as plt 
# Formatting dates
import matplotlib.dates as mdates
# Packages for measuring model performance / errors
from sklearn.metrics import mean_absolute_error, mean_squared_error
# Tools for predictive data analysis. We will use the MinMaxScaler to normalize the price data 
from sklearn.preprocessing import MinMaxScaler 
# Deep learning library, used for neural networks
from keras.models import Sequential 
# Deep learning classes for recurrent and regular densely-connected layers
from keras.layers import LSTM, Dense
from keras.callbacks import EarlyStopping

# Setting the timeframe for the data extraction
today = date.today()
date_today = today.strftime("%Y-%m-%d")
date_start = '2010-01-01'

# Getting S&P500 quotes
stockname = 'S&P500'
symbol = '^GSPC'

# You can either use webreader or yfinance to load the data from yahoo finance
# import pandas_datareader as webreader
# df = webreader.DataReader(symbol, start=date_start, end=date_today, data_source="yahoo")

import yfinance as yf #Alternative package if webreader does not work: pip install yfinance
df = yf.download(symbol, start=date_start, end=date_today)

df.head(5)

Step #2 Adjusting the Shape of the Input Data and Exploration

Now we have a DataFrame that contains the daily price quotes for the S&P500. Next, we prepare the data so that it includes the weekly price quotes. If we want our model to provide weekly price predictions, we need to change the data so that the input contains weekly price quotes. A simple way to achieve this is to iterate through the rows and only keep every 7th row.

# Changing the data structure to a dataframe with weekly price quotes
df["index1"] = range(1, len(df) + 1)
rownumber = df.shape[0]
lst = list(range(rownumber))
list_of_relevant_numbers = lst[0::7]
df_weekly = df[df["index1"].isin(list_of_relevant_numbers)]
df_weekly.head(5)
Time Series Prediction: Initial dataset
Initial dataset

After this, we quickly create a line plot to validate that everything looks as expected.

# Visualizing the data
register_matplotlib_converters()
years = mdates.YearLocator()
fig, ax1 = plt.subplots(figsize=(16, 6))
ax1.xaxis.set_major_locator(years)
x = df_weekly.index
y = df_weekly["Close"]
#ax1.fill_between(x, 0, y, color="#b9e1fa")
ax1.legend([stockname], fontsize=16)
plt.title(stockname + " Price between " + date_start + " and " + date_today, fontsize=20)
plt.plot(y, color="#039dfc", label="S&P500", linewidth=1.0)
plt.ylabel("S&P500 Points", fontsize=20)
plt.show()
weekly data on the S&P500 index

Step #3 Preprocess the Data

Before we can train the neural network, we first need to define the shape of the training data. We use weekly price quotes and define an input_sequence_length of 50 weeks.

# Feature Selection - Only Close Data
train_df = df.filter(['Close'])
data_unscaled = df.values

# Transform features by scaling each feature to a range between 0 and 1
mmscaler = MinMaxScaler(feature_range=(0, 1))
np_data = mmscaler.fit_transform(data_unscaled)

# Creating a separate scaler that works on a single column for scaling predictions
scaler_pred = MinMaxScaler()
df_Close = pd.DataFrame(df['Close'])
np_Close_scaled = scaler_pred.fit_transform(df_Close)

# Set the sequence length - this is the timeframe used to make a single prediction
sequence_length = 50

# Prediction Index
index_Close = train_df.columns.get_loc("Close")

# Split the training data into train and train data sets
# As a first step, we get the number of rows to train the model on 80% of the data 
train_data_length = math.ceil(np_data.shape[0] * 0.8)

# Create the training and test data
train_data = np_data[0:train_data_length, :]
test_data = np_data[train_data_length - sequence_length:, :]

# The RNN needs data with the format of [samples, time steps, features]
# Here, we create N samples, sequence_length time steps per sample, and 6 features
def partition_dataset(sequence_length, train_df):
    x, y = [], []
    data_len = train_df.shape[0]
    for i in range(sequence_length, data_len):
        x.append(train_df[i-sequence_length:i,:]) #contains sequence_length values 0-sequence_length * columsn
        y.append(train_df[i, index_Close]) #contains the prediction values for validation (3rd column = Close),  for single-step prediction
    
    # Convert the x and y to numpy arrays
    x = np.array(x)
    y = np.array(y)
    return x, y

# Generate training data and test data
x_train, y_train = partition_dataset(sequence_length, train_data)
x_test, y_test = partition_dataset(sequence_length, test_data)

# Print the shapes: the result is: (rows, training_sequence, features) (prediction value, )
print(x_train.shape, y_train.shape)
print(x_test.shape, y_test.shape)

# Validate that the prediction value and the input match up
# The last close price of the second input sample should equal the first prediction value
print(x_test[1][sequence_length-1][index_Close])
print(y_test[0])

Out: (245, 50, 1)

Step #4 Building a Time Series Prediction Model

The first layer of neurons in our neural network needs to fit the input values from the data. Therefore, we also need to put 50 neurons in place – one neuron for each input price quote.

Time Series Prediction: The Architecture of our Neural Network Model
The model architecture of the recurrent neural network

We use the following input arguments for the model fit:

  • x_train: Vector, matrix, or array of training data. Can also be a list (as in our case) if the model has multiple inputs.
  • y_train: Vector, matrix, or array of target data. This is the labeled data, the model tries to predict, so in other words, these are the results to x_train.
  • epochs: Integer value that defines how many times the model goes through the training set.
  • batch size: Integer value that defines the number of samples that will be propagated through the network. After each propagation, the network adjusts the weights of the nodes in each layer.
# Configure the neural network model
model = Sequential()

# Model with n_neurons Neurons
n_neurons = x_train.shape[1] * x_train.shape[2]
print(n_neurons, x_train.shape[1], x_train.shape[2])
model.add(LSTM(n_neurons, return_sequences=True, input_shape=(x_train.shape[1], x_train.shape[2])))
model.add(LSTM(n_neurons, return_sequences=False))
model.add(Dense(25, activation="relu"))
model.add(Dense(1))

# Compile the model
model.compile(optimizer="adam", loss="mean_squared_error")
Multiple epochs of model training
Multiple epochs of model training

Step #5 Evaluate Model Performance

Next, we validate the model by calculating the mean squared error and the root-mean-squared error for our predictions. In addition, we plot the data to see how well our model has performed over the training timeframe.

# Get the predicted values
y_pred_scaled = model.predict(x_test)
y_pred = scaler_pred.inverse_transform(y_pred_scaled)
y_test_unscaled = scaler_pred.inverse_transform(y_test.reshape(-1, 1))

# Mean Absolute Error (MAE)
MAE = mean_absolute_error(y_test_unscaled, y_pred)
print(f'Median Absolute Error (MAE): {np.round(MAE, 2)}')

# Mean Absolute Percentage Error (MAPE)
MAPE = np.mean((np.abs(np.subtract(y_test_unscaled, y_pred)/ y_test_unscaled))) * 100
print(f'Mean Absolute Percentage Error (MAPE): {np.round(MAPE, 2)} %')

# Median Absolute Percentage Error (MDAPE)
MDAPE = np.median((np.abs(np.subtract(y_test_unscaled, y_pred)/ y_test_unscaled)) ) * 100
print(f'Median Absolute Percentage Error (MDAPE): {np.round(MDAPE, 2)} %')

# The date from which on the date is displayed
display_start_date = "2018-01-01" 

# Add the difference between the valid and predicted prices
train = train_df[:train_data_length + 1]
valid = train_df[train_data_length:]
valid.insert(1, "Predictions", y_pred, True)
valid.insert(1, "Difference", valid["Predictions"] - valid["Close"], True)

# Zoom in to a closer timeframe
valid = valid[valid.index > display_start_date]
train = train[train.index > display_start_date]

# Visualize the data
fig, ax = plt.subplots(figsize=(16, 8), sharex=True)

plt.title("Predictions vs Ground Truth", fontsize=20)
plt.ylabel(stockname, fontsize=18)
plt.plot(train["Close"], color="#039dfc", linewidth=1.0)
plt.plot(valid["Predictions"], color="#E91D9E", linewidth=1.0)
plt.plot(valid["Close"], color="black", linewidth=1.0)
plt.legend(["Train", "Test Predictions", "Ground Truth"], loc="upper left")

# Create the bar plot with the differences
valid.loc[valid["Difference"] >= 0, 'diff_color'] = "#2BC97A"
valid.loc[valid["Difference"] < 0, 'diff_color'] = "#C92B2B"
plt.bar(valid.index, valid["Difference"], width=0.8, color=valid['diff_color'])

plt.show()
Out: 114.9 (RMSE) / Out: 155.9 (ME)
Time Series Prediction: S&P500 Predictions vs Actual Values

At the bottom, we can see the differences between predictions and valid data. Positive values signal that the projections were too optimistic. Negative values mean that the predictions were too pessimistic and that the actual value turned out to be higher than the prediction.

Step #6 Predicting for the Next Week

Now we can use the model to predict next week’s price for the S&P500.

# Get fresh data until today and create a new dataframe with only the price data
#date_start = date.today() - timedelta(days=50)
new_df = df[-sequence_length:] #webreader.DataReader(symbol, data_source='yahoo', start=date_start, end=date_today)
#d = pd.to_datetime(new_df.index)
# new_df['Month'] = d.strftime("%m") 
# new_df['Year'] = d.strftime("%Y") 
new_df = new_df

N = sequence_length

# Get the last N day closing price values and scale the data to be values between 0 and 1
last_N_days = new_df[-sequence_length:].values
last_N_days_scaled = mmscaler.transform(last_N_days)

# Create an empty list and Append past N days
X_test_new = []
X_test_new.append(last_N_days_scaled)

# Convert the X_test data set to a numpy array and reshape the data
pred_price_scaled = model.predict(np.array(X_test_new))
pred_price_unscaled = scaler_pred.inverse_transform(pred_price_scaled.reshape(-1, 1))

# Print last price and predicted price for the next day
price_today = np.round(new_df['Close'][-1], 2)
predicted_price = np.round(pred_price_unscaled.ravel()[0], 2)
change_percent = np.round(100 - (price_today * 100)/predicted_price, 2)

plus = '+'; minus = ''
print(f'The close price for {stockname} at {today} was {price_today}')
print(f'The predicted close price is {predicted_price} ({plus if change_percent > 0 else minus}{change_percent}%)')

Out: 2652.3

So for the 9th of April 2020, the model predicts that the S&P500 will close at:

2652.3

Considering that today’s (2nd of April 2020) price is 2528 points, our model expects the S&P to gain roughly 124 points in the coming seven days. Of course, this is by no means financial advice. As we have seen before, our model is often wrong.

Summary

This article has shown how to adjust the prediction intervals for a time series forecasting model. We have created a neural network that predicts the price of the S&P500 one week in advance. Finally, we trained and validated the model and made a forecast for the next week.

Varying the input shape is a quick approach to changing the forecasting time steps. However, increasing the length of the time steps also reduces the amount of data we can use for training and testing. In our case, we still have enough data available. But in other cases, where less data is available, this can become a problem. In such a case, the preferred method is to use a rolling forecast approach or create a multi-output forecast.

I hope this article was helpful. Should you have questions or remarks, let me know in the comments.

Author

  • Hi, I am Florian, a Zurich-based consultant for AI and Data. Since the completion of my Ph.D. in 2017, I have been working on the design and implementation of ML use cases in the Swiss financial sector. I started this blog in 2020 with the goal in mind to share my experiences and create a place where you can find key concepts of machine learning and materials that will allow you to kick-start your own Python projects.

Leave a Reply