Supervised Learning Archives - relataly.com

Univariate Stock Market Forecasting using Facebook Prophet in Python

Florian Follonier — Thu, 15 Dec 2022 22:54:34 +0000

Have you ever wondered how Facebook predicts the future? Meet Facebook Prophet, the open-source time series forecasting tool developed by Facebook’s Core Data Science team. Built on top of the PyStan library, Facebook Prophet offers a simple and intuitive interface for creating forecasts using historical data. What sets Facebook Prophet apart is its highly modular design, allowing for a range of customizable components that can be combined to create a wide variety of forecasting models. This makes it perfect for modeling data with strong seasonal effects, like daily or weekly patterns, and it can handle missing data and outliers with ease. In this tutorial, we will take a closer look at the capabilities of Facebook Prophet and see how it can be used to make accurate predictions.

We begin with a brief discussion of how the Facebook Prophet decomposes a time series into different components. Then we turn to the hands-on part. you can use its model in Python to generate a stock market forecast. We will train our Facebook Prophet model using the historical price of the Coca-Cola stock. We will also cover different options to customize the model settings.

Disclaimer: This article does not constitute financial advice. Stock markets can be very volatile and are generally difficult to predict. Predictive models and other forms of analytics applied in this article only illustrate machine learning use cases.

Facebook Prophet – an open-source tool for time series forecasting

What is Facebook Prophet?

Facebook Prophet is a tool that can be used to make predictions about future events based on historical data. It was developed by Taylor and Letham, 2017, who later made it available as an open-source project. The authors developed Facebook Prophet to solve various business forecasting problems without requiring much prior knowledge. In this way, the framework addresses a significant problem many companies face today. They have various prediction problems (e.g., capacity and demand forecasting) but face a skill gap when it comes to generating reliable forecasts with techniques such as ARIMA or neural networks. Compared to that, Facebook Prophet requires minimal fine-tuning and can deal with various challenges, including seasonality, outliers, and changing trend lines. This allows Facebook Prophet to handle a wide range of forecasting problems flexibly. Before we dive into the hands-on part, let’s gain a quick overview of how Facebook Prophet works.

Also: Stock Market Prediction using Multivariate Time Series

time series forecasting with facebook prophet python tutorial

" data-image-caption="

time series forecasting with facebook prophet python tutorial

" data-large-file="https://www.relataly.com/wp-content/uploads/2023/02/an_ancient_prophet_looking_into_a_crystal_ball.png" src="https://www.relataly.com/wp-content/uploads/2023/02/an_ancient_prophet_looking_into_a_crystal_ball-1024x1024.png" alt="time series forecasting with facebook prophet python tutorial" class="wp-image-12356" srcset="https://www.relataly.com/wp-content/uploads/2023/02/an_ancient_prophet_looking_into_a_crystal_ball.png 1024w, https://www.relataly.com/wp-content/uploads/2023/02/an_ancient_prophet_looking_into_a_crystal_ball.png 300w, https://www.relataly.com/wp-content/uploads/2023/02/an_ancient_prophet_looking_into_a_crystal_ball.png 140w, https://www.relataly.com/wp-content/uploads/2023/02/an_ancient_prophet_looking_into_a_crystal_ball.png 768w" sizes="(max-width: 1024px) 100vw, 1024px" />

Time-series forecasting with Facebook Prophet. Image generated with Midjourney.

How Facebook Prophet Works

Facebook Prophet uses a technique called additive regression to model time series data. This involves breaking the time series into a series of components:

Trends
Seasonality
Holiday

Traditional time series methods such as (S)ARIMA base their prediction on a model that weights the linear sum of past observations or lags. Facebook’s Prophet is similar in that it uses a decreasing weight for past observations. This means current observations have a higher significance for the model than those that date back a long time. It then models each component separately using a combination of linear and non-linear functions. Finally, Facebook Prophet combines these components to form the complete forecast model. Let’s take a closer look at these components and how Facebook Prophet handles them.

A) Dealing with Trends

Time series often have a trendline. However, even more often, a time series will not follow a single trend, but it has several trend components that are separated by breakpoints. Facebook Prophet tries to handle these trends in several ways. First, the model tries to identify the breakpoints (knots) in a time series that divide different periods. Each breakpoint separates two periods with different trendlines. Facebook Prophet then uses these inflection points between periods to fit the model to the data and create the forecast. In addition, trendlines do not have to be linear but can also be logarithmic. This is all done automatically, but it is also possible to specify breakpoints manually.

B) Seasonality

Facebook Prophet works very well when the data shows a strong seasonal pattern. It uses Fourier transformations (adding different sine and cosine frequencies) to account for daily, weekly and yearly seasonality. The Facebook Prophet model is flexible on the type of data you have by allowing you to adjust the seasonal components of your data. By default, Facebook Prophet assumes daily data with weekly and yearly seasonal effects. If your data differentiates from this standard, for example, you have weekly data with monthly seasonality, then you need to adjust the number of terms accordingly.

C) Holiday

Every year, public holidays can lead to strong deviations in a time series; for example, thinking of computing power, demand more people will visit the Facebook website. The Facebook Prophet model also accounts for such special events by allowing us to specify binary indicators that mark whether a certain day is a public holiday. If you have other non-holiday events that occur yearly, you can use this indicator for the same purpose. Usually, Facebook Prophet will automatically remove outliers from the data. But if an outlier occurs on a day highlighted as a public holiday, Facebook Prophet will adjust its model accordingly.

Hyperparameter Tuning and Customization

Facebook Prophet includes additional optimization techniques, such as Bayesian optimization, to automatically tune the model’s hyperparameters, such as the length of the seasonal period, to improve its accuracy. Once the model is trained, it can be used to predict future values in the time series. However, users with a strong domain knowledge may prefer to tweak these parameters themselves, and Facebook Prophet provides several functions for this purpose. It also includes a range of tools for model evaluation and diagnostics, as well as for visualizing the model and the input data.

Also: Using Random Search to Tune the Hyperparameters of a Random Decision Forest with Python

Application Domains

Facebook Prophet is a powerful forecasting tool that has been specifically designed to make forecasting easy. As mentioned, Prophet is easy to use and can flexibly handle various forecasting problems. In addition, it requires very little preprocessing to generate accurate forecasts. As a result of these advantages, Facebook Prophet has been adopted by various application domains. Some possible application domains for Facebook Prophet include:

Sales forecasting: Facebook Prophet can be used to predict future sales of a product or service, based on historical sales data. This can be useful for businesses to plan their inventory and staffing, and to make informed decisions about future investments and growth.
Financial forecasting: Facebook Prophet can be used to predict future stock prices, currency exchange rates, or other financial metrics. This can be useful for investors and financial analysts to make informed decisions about the market.
Traffic forecasting: Facebook Prophet can be used to predict future traffic on a website or mobile app based on historical data. This can be useful for businesses to plan for capacity and optimize their servers and infrastructure.
Energy consumption forecasting: Facebook Prophet can be used to predict future energy consumption based on historical data. This can be useful for utilities and energy companies to plan for demand and optimize their generation and distribution.

When to Use Facebook Prophet?

Although Facebook Prophet is applicable in any domain where time series data is available, it is most effective when certain conditions are met. These include univariate time series data with prominent seasonal effects and an extensive historical record spanning multiple seasons. Facebook Prophet is especially beneficial when dealing with large quantities of historical data that require efficient analysis and quick, accurate predictions of future trends.

Also: Rolling Time Series Forecasting: Creating a Multi-Step Prediction

Using Facebook Prophet to Forecast the Coca-Cola Stock Price in Python

In this hands-on tutorial, we’ll use Facebook Prophet and Python to create a forecast for the Coca-Cola stock price. We have chosen Coca-Cola as an example because the Coca-Cola share is known to be a cyclical stock. As such, its chart reflects a seasonal pattern, different periods, and varying trend lines. We train our model on historical price data and then predict the next data points half-year in advance. In addition, we will discuss how we could finetune our model to improve the accuracy of the predictions further. This involves the following steps:

Collect historical stock data for CocaCola and familiarize ourselves with the data.
Use Facebook Prophet to fit a model to the data.
Use the model to make predictions about the future stock price of Coca-Cola.
Visualize model components and predictions.
Manually adjust the model to improve the model fit.

By following these steps, we will try to gain insights into the future performance of Coca-Cola stock. Let’s get started!

As always, you can find the code of this tutorial on the GitHub repository.

View on GitHub Relataly GitHub Repo

Prerequisites

Before you proceed, ensure that you have set up your Python environment (3.8 or higher) and the required packages. If you don’t have an environment, consider following this tutorial to set up the Anaconda environment.

Also, make sure you install all required Python packages. We will be working with the following standard Python packages:

pandas
seaborn
matplotlib

In addition, we will use the Facebook Prophet library that goes by the library name “prophet.” You can install these packages using the following commands:

pip install 
conda install  (if you are using the anaconda packet manager)

Step #1 Loading Packages and API Key

Let’s begin by loading the required Python packages and historical price quotes for the Coca-Cola stock. We will obtain the data from the yahoo finance API. Note that the API will return several columns of data, including, opening, average, and closing prices. We will only use the closing price.

# Tested with Python 3.8.8, Matplotlib 3.5, Seaborn 0.11.1, numpy 1.19.5, plotly 4.1.1, cufflinks 0.17.3, prophet 1.1.1, CmdStan 2.31.0
import pandas as pd 
import matplotlib.pyplot as plt 
import numpy as np 
from math import log, exp 
from datetime import date, timedelta, datetime
import seaborn as sns
sns.set_style('white', {'axes.spines.right': False, 'axes.spines.top': False})
from scipy.stats import norm
from prophet import Prophet
from prophet.plot import add_changepoints_to_plot
import cmdstanpy
cmdstanpy.install_cmdstan()
cmdstanpy.install_cmdstan(compiler=True)
# Setting the timeframe for the data extraction
end_date =  date.today().strftime("%Y-%m-%d")
start_date = '2010-01-01'
# Getting quotes
stockname = 'Coca Cola'
symbol = 'KO'
# You can either use webreader or yfinance to load the data from yahoo finance
# import pandas_datareader as webreader
# df = webreader.DataReader(symbol, start=start_date, end=end_date, data_source="yahoo")
import yfinance as yf #Alternative package if webreader does not work: pip install yfinance
df = yf.download(symbol, start=start_date, end=end_date)
# Quick overview of dataset
print(df.head())

[*********************100%***********************]  1 of 1 completed
                 Open       High        Low      Close  Adj Close    Volume
Date                                                                       
2010-01-04  28.580000  28.610001  28.450001  28.520000  19.081614  13870400
2010-01-05  28.424999  28.495001  28.070000  28.174999  18.850786  23172400
2010-01-06  28.174999  28.219999  27.990000  28.165001  18.844103  19264600
2010-01-07  28.165001  28.184999  27.875000  28.094999  18.797268  13234600
2010-01-08  27.730000  27.820000  27.375000  27.575001  18.449350  28712400

Once we have downloaded the data, we create a line plot of the closing price to familiarize ourselves with the time series data. Note that Facebook Prophet works on a single input signal only (univariate data). This input will be the closing price. For illustration purposes, we add a moving average to the chart. However, the moving average makes it easier to spot trends and seasonal patterns, it will not be used to fit the model.

# Visualize the original time series
rolling_window=25
y_a_add_ma = df['Close'].rolling(window=rolling_window).mean() 
fig, ax = plt.subplots(figsize=(20,5))
sns.lineplot(data=df, x=df.index, y='Close', color='skyblue', linewidth=0.5, label='Close')
sns.lineplot(data=df, x=df.index, y=y_a_add_ma, 
    linewidth=1.0, color='royalblue', linestyle='--', label=f'{rolling_window}-Day MA')

The chart shows a long-term upward trend interrupted by phases of downturns. In addition, between 2010 and 2018, we can see some cyclical movements. At some points, we can spot clear breakpoints, for example, in 2019 and mid-2020.

Step #2 Preparing the Data

Next, we prepare our data for model training. Propjet has a strict condition on how the input columns must be named. In order to use Facebook Prophet, your data needs to be in a time series format with the time as the index and the value as the first column. In addition, column names need to adhere to the following naming convention:

ds for the timestamp
y for the metric columns, which in our case is the closing price

So before we proceed, we must rename the columns in our dataframe. In addition, we will remove the index and drop NA values.

df_x = df[['Close']].copy()
df_x['ds'] = df.index.copy()
df_x.rename(columns={'Close': 'y'}, inplace=True)
df_x.reset_index(inplace=True, drop=True)
df_x.dropna(inplace=True)
df_x.tail(9)

		y			ds
3257	63.139999	2022-12-09
3258	63.970001	2022-12-12
3259	63.990002	2022-12-13

Now we have a simple dataframe with ds and y as the only variables.

Step #3 Model Fitting and Forecasting

Next, let’s fit our forecasting model to the time series data. Afterward, we can make predictions about future values in the series. However, before we do this, we need to define our prediction interval.

3.1 Setting the Prediction Interval

The prediction interval is a measure of uncertainty in a forecast made with Facebook Prophet. It indicates the range within which the true value of the forecasted quantity is expected to fall a certain percentage of the time. For example, a 95% prediction interval means that the true value of the forecasted quantity is expected to fall within the given range 95% of the time.

In Facebook Prophet, the prediction interval is controlled by the interval_width parameter, which can be set when calling the predict method. The default value for interval_width is 0.80. This means that the true value of the forecasted quantity is expected to fall within the prediction interval 80% of the time. We can adjust the value of interval_width to change the width of the prediction interval as desired. In the example below, we use a prediction interval of 0.85.

3.2 Fit the Model

Next, let’s fit our model and generate a one-year forecast. First, we need to instantiate our model with by calling Prophet(). Then we use model.fit(df) to fit this model to the historical price quotes of the Coca-Cola stock. Once, we have done that, we use the model instance model.make_future_dataframe() to create an extended dataframe (future_df). This dataframe has been extended with records for a one-year period. The records are empty dummy values ready to be filled with the real forecast. We then pass this dummy dataframe to the model.predict(df) function, Facebook Prophet creates the forecast and fills up the dummy dataframe with the forecast values.

For the sake of reusability, I have encapsulated the entire process into a wrapper function. This will allow us to run quick experiments with different parameter values.

# This function fits the prophet model to the input data and generates a forecast
def fit_and_forecast(df, periods, interval_width, changepoint_range=0.8):
    # set the uncertainty interval
    Prophet(interval_width=interval_width)
    # Instantiate the model
    model = Prophet(changepoint_range=changepoint_range)
    # Fit the model
    model.fit(df)
    # Create a dataframe with a given number of dates
    future_df = model.make_future_dataframe(periods=periods)
    # Generate a forecast for the given dates
    forecast_df = model.predict(future_df)
    #print(forecast_df.head())
    return forecast_df, model, future_df
# Forecast for 365 days with full data
forecast_df, model, future_df = fit_and_forecast(df_x, 365, 0.95)
print(forecast_df.columns)
forecast_df[['yhat_lower', 'yhat_upper', 'yhat']].head(5)

Index(['ds', 'trend', 'yhat_lower', 'yhat_upper', 'trend_lower', 'trend_upper',
       'additive_terms', 'additive_terms_lower', 'additive_terms_upper',
       'weekly', 'weekly_lower', 'weekly_upper', 'yearly', 'yearly_lower',
       'yearly_upper', 'multiplicative_terms', 'multiplicative_terms_lower',
       'multiplicative_terms_upper', 'yhat'],
      dtype='object')
	yhat_lower	yhat_upper	yhat
0	24.468273	28.944286	26.691615
1	24.496074	29.146425	26.706924
2	24.513424	28.829159	26.682213
3	24.358048	28.767209	26.667476
4	24.487963	28.839966	26.666242

Voila, we have generated a one-year forecast.

Step #4 Analyzing the Forecast

Next, let’s visualize our forecast and discuss what we see. The most simple way is to create the plot with a standard Facebook Prophet function.

Also: Measuring Regression Errors with Python

model.plot(forecast_df, uncertainty=True)

So what do we see? The forecast shows that our model does not simply predict a straight line and instead has generated a more sophisticated forecast that displays an upward cyclical trend with higher highs and higher lows.

The black dots are the data points from the historical data to which we have fit our model.
The dark blue line is the most likely path.
The light blue lines are the upper and lower boundaries of the prediction interval. We have set the prediction interval to 0.85, which means there is a probability of 85% the actual values will fall into this range.
In total, the model seems confident that the price of Coca-Cola stock will rise within the next year (no financial advice). However, as we will see later, the forecast depends on where the model sees the breakpoints.

In case, you want to create a custom plot, you can use the function below.

# Visualize the Forecast
def visualize_the_forecast(df_f, df_o):
    rolling_window = 20
    yhat_mean = df_f['yhat'].rolling(window=rolling_window).mean() 
    # Thin out the ground truth data for illustration purposes
    df_lim = df_o
    # Print the Forecast
    fig, ax = plt.subplots(figsize=[20,7])
    sns.lineplot(data=df_f, x=df_f.ds, y=yhat_mean, ax=ax, label='predicted path', color='blue')
    sns.lineplot(data=df_lim, x=df_lim.ds, y='y', ax=ax, label='ground_truth', color='orange')
    #sns.lineplot(data=df_f, x=df_f.ds, y='yhat_lower', ax=ax, label='yhat_lower', color='skyblue', linewidth=1.0)
    #sns.lineplot(data=df_f, x=df_f.ds, y='yhat_upper', ax=ax, label='yhat_upper', color='coral', linewidth=1.0)
    plt.fill_between(df_f.ds, df_f.yhat_lower, df_f.yhat_upper, color='lightgreen')
    plt.legend(framealpha=0)
    ax.set(ylabel=stockname + " stock price")
    ax.set(xlabel=None)
visualize_the_forecast(forecast_df, df_x)

Step #5 Analyzing Model Components

We can gain a better understanding of different model components by using the plot_components function. This method creates a plot showing the trend, weekly and yearly seasonality, and any additional user-defined seasonalities of the forecast. This can be useful for understanding the underlying patterns in the data and for diagnosing potential issues with the model.

model.plot_components(forecast_df)

The first chart shows the trendlines that the model sees within different periods. The trendlines are separated by breakpoints about, which we will talk in the next section. When we look at the second plot, we can see no price changes during the weekend. This is plausible, considering that the stock markets are closed over the weekend. The third chart is most interesting, as it shows that the model has recognized some yearly seasonality with two peaks in April and August, as well as lows in March and October.

Step #6 Adjusting the Changepoints of our Facebook Prophet Model

Let’s take a closer look at the changepoints in our model. Changepoints are the points in time where the trend of the time series is expected to change, and Facebook Prophet’s algorithm automatically detects these points and adapts the model accordingly. Changepoints are important to Facebook Prophet because they allow the model to capture gradual changes or shifts in the data. By identifying and incorporating changepoints into the forecasting model, Facebook Prophet can make more accurate predictions. Changepoints can also help to identify potential outliers in the data.

6.1 Checking Current Changepoints

We can illustrate the changepoints in our model with the add_changepoints_to_plot method. The method adds vertical lines to a plot to indicate the locations of the changepoints in the data. By plotting the changepoints on a graph, we can visually identify when these changes in trend occur and potentially diagnose any issues with our model.

# Printing the ChangePoints of our Model
forecast_df, model, future_df = fit_and_forecast(df_x, 365, 1.0)
axislist = add_changepoints_to_plot(model.plot(forecast_df).gca(), model, forecast_df)

The chart above shows that our model has identified several changepoints in the historical data. However, it has only searched for changepoints within 80% of the time series. As a result, the algorithm hasn’t identified any change points in the most recent years after 2020. We can adjust the changepoints with the changepoint_range (default = 80%) variable. This is what we will do in the next section.

6.2 Adjusting Changepoints

We can adjust the range within which Facebook Prophet looks for changepoints with the “changepoint_range”. It is specified as a fraction of the total duration of the time series. For example, if changepoint_range is set to 0.8 and the time series spans 10 years, the algorithm will look for changepoints within the last 8 years of the series.

By default, changepoint_range is set to 0.8, which means that the algorithm will look for changepoints within the last 80% of the time series. We can adjust this value depending on the characteristics of our data and our desired level of flexibility in the model.

Increasing the value of changepoint_range will allow the algorithm to identify more changepoints and potentially improve the fit of the model, but it may also increase the risk of overfitting. Conversely, decreasing the value of changepoint_range will reduce the number of changepoints detected and may improve the model’s ability to generalize to new data, but it may also reduce the accuracy of the forecast.

Let’s fit our model again, but this time we let Facebook Prophet search for changepoints within the entire time series (changepoint_range=1.0).

# Adjusting ChangePoints of our Model
forecast_df, model, future_df = fit_and_forecast(df_x, 365, 1.0, 1.0)
axislist = add_changepoints_to_plot(model.plot(forecast_df).gca(), model, forecast_df)

The plot above shows that Facebook Prophet has now identified several additional breakpoints in the time series. As a result, the forecast has become rather pessimistic, as Facebook Prophet gave more weight to recent changes.

Finally, it is worth mentioning that it is possible to add changepoints for specific dates manually. You can try this out using “model.changepoints(series)”. The function takes a series of timestamps as the parameter value.

Summary

Get ready to dive into the world of stock market prediction with Facebook Prophet! In this article, we’ll show you how to leverage the power of this amazing tool to forecast time series data, using Coca-Cola’s stock as an example. We’ll guide you through the process of fitting a curve to univariate time series data and fine-tuning the initial breakpoints and trendlines to enhance model performance. With Facebook Prophet’s automatic trend identification algorithm, you’ll be able to easily adapt to changes in the data over time.

Also: Mastering Multivariate Stock Market Prediction with Python

As a data scientist, you’ll appreciate how easy it is to use Facebook Prophet and how it consistently outperforms other models. With its straightforward interface and impressive accuracy, this tool is a must-have for your forecasting toolkit. And we’re always looking for feedback from our audience, so let us know what you think! We’re committed to improving our content to provide the best learning experience possible.

Sources and Further Reading

The links above to Amazon are affiliate links. By buying through these links, you support the Relataly.com blog and help to cover the hosting costs. Using the links does not affect the price.

Other Methods for Time Series Forecasting

The post Univariate Stock Market Forecasting using Facebook Prophet in Python appeared first on relataly.com.

Create a Personalized Movie Recommendation Engine using Content-based Filtering in Python

Florian Follonier — Mon, 25 Jul 2022 11:29:00 +0000

Content-based recommender systems are a popular type of machine learning algorithm that recommends relevant articles based on what a user has previously consumed or liked. This approach aims to identify items with certain keywords, understand what the customer likes, and then identify other items that are similar to items the user has previously consumed or rated. The recommendations are based on the similarity of the items, represented by similarity scores in a vector matrix. The attributes used to describe an item are called “content.” For example, in the case of movie recommendations, content could be the genre, actors, director, year of release, etc. A well-designed content-based recommendation service will suggest movies of the same genre, actors, or keywords. This tutorial will implement a content-based recommendation service for movies using Python and Scikit-learn.

The rest of this tutorial proceeds as follows: After a brief introduction to content-based recommenders, we will work with a database that contains several thousands of IMDB movie titles and create a feature model that uses actors, release year, and a short description for each movie. In this tutorial, you will also learn how to deal with some challenges of building a content-based recommender. For example, we will look at how we can engineer features for content-based model words and reduce the dimensionality of our model. Finally, we use our model to generate some sample predictions.

Note: Another popular type of recommender system that I have covered in a previous article is collaborative filtering.

Recommendation systems can ease decision-making. Image created with Midjourney.

What is Content-Based Filtering?

The idea behind content-based recommenders is to generate recommendations based on user’s preferences and tastes. These preferences revolve around past user choices, for example, the number of times a user has watched a movie, purchased an item, or clicked on a link.

Content-based filtering uses domain-specific item features to measure the similarity between items. Given the user preferences, the algorithm will recommend items similar to what the user has consumed or liked before. For movie recommendations, this content can be the genre, actors, release year, director, film length, or keywords used to describe the movies. This approach works particularly well for domains with a lot of textual metadata, such as movies and videos, books, or products.

Content-based movie recommendations will suggest more of the same, for example, actors, genres, stories, and directors.

Basic Steps to Building a Content-based Recommender System

The approach to building a content-based recommender involves four essential steps:

The first step is to create a so-called ‘bag of words’ model from the input data, which is a list of words used to characterize the items. This step involves selecting useful content for describing and differentiating the items. The more precise the information, the better the recommendations will be.
The next step is to turn the bag (of words) into a feature vector. Different algorithms can be used for this step, for example, the Tfdif vectorizer or the count vectorizer. The result is a vector matrix with items as records and features as columns. This step often also includes applying techniques for dimensionality reduction.
The idea of content-based recommendations is based on measuring item similarity. Similarity scores are assigned through pairwise comparison. Here again, we can choose between different measures, e.g., the dot product or cosine similarity.
Once you have the similarity scores, you can return the most similar items by sorting the data by similarity scores. Given user preferences (single or multiple items a user consumed or liked), the algorithm will then recommend the most similar items.

Approach to Building a Content-based Recommender System

Similarity Scoring

The quality of the content-based recommendations is significantly influenced by how well the algorithm succeeds in measuring the similarity of the items. There are different techniques to calculate similarity, including Cosine Similarity, Pearson Similarity, Dot Product, and Euclidian Distance. They have in common that they use numerical characteristics of the text to calculate the distance between text vectors in an n-dimensional vector space.

It is worth denoting that these techniques can only measure word-level similarity. This means the algorithms compare the word of the item for word without considering the semantic meaning of the sentences. In some instances, this can lead to errors. For example, how similar are “now that they were sitting on a bank, he noticed she stole his heart, and he was in love” and “They are gangsters who love to steal from a large bank”? By just looking at the words, one may appear similar because the words have a good overlap.

Pros and Cons of Content-based Filtering

Like most machine learning algorithms, content-based recommenders have their strength and weaknesses.

Advantages

Content-based filtering is good at capturing a user’s specific interests and will recommend more of the same (for example, genre, actors, directors, etc.). It will also recommend niche items if they match the user preferences, even if these items draw little attention.
Another advantage is that the model can generate recommendations for a specific user without the knowledge of other users. This is particularly helpful if you want to generate predictions for many users.

Disadvantages

On the other hand, there are also a couple of downsides. The feature representation of the items has to be done manually to a certain extent, and the prediction quality strongly depends on whether items are described in detail. Therefore, content-based filtering requires a lot of expertise.
Since recommendations are based on the user’s previous interests. However, the recommendations are unlikely to go beyond that and expand to areas (e.g., genres) that are still unknown to the user. Content-based models thus tend to develop some tunnel vision, so that the model recommends more and more of the same.

Implementing a Content-based Movie Recommender in Python

In the following, we will implement a content-based movie recommender using Python and Scikit-learn. We will carry out all steps necessary to create a content-based recommender. The data comes from an IMDB dataset containing more than 40k films between 1996 and 2018. Based on the data, we define the features we want to use for recommending the movies. These features include the genre, director, main actors, plot keywords, or other metadata associated with the movies. Then we preprocess the data to extract these features and create a feature matrix. The feature matrix becomes the foundation for a similarity matrix that measures the similarity between the items based on their feature vectors. Finally, we use the similarity matrix to generate recommendations for a given item.

By the end of this Python tutorial, you will have learned how to implement a content-based recommendation system for movies using Python and Scikit-learn. This knowledge can be applied to other types of recommendations, such as articles, products, or songs.

The code is available on the GitHub repository.

View on GitHub Relataly GitHub Repo

This Robot doesn’t know what to watch on tv. Let’s build a recommender system for him! Image generated using DALL-E 2 by OpenAI.

Prerequisites

Before you start with the coding part, ensure you have set up your Python 3 environment and required packages. If you don’t have an environment, consider the Anaconda Python environment. Follow this tutorial to set it up.

Also, make sure you install all required packages. In this tutorial, we will be working with the following standard packages:

In addition, we will be using Seaborn for visualization and the natural language processing library nltk.

You can install these packages by using one of the following commands:

pip install
conda install (if you are using the anaconda packet manager)

About the IMDB Movies Dataset

We will train our movie recommender on a popular Movies Dataset (you can download it from grouplens.org). The MovieLens recommendation service collected the Dataset from 610 users between 1996 and 2018. Unpack the data into the working folder of your project.

The full Dataset contains metadata on over 45,000 movies and 26 million ratings from over 270,000 users. The Dataset contains the following files (Source of the data description: Kaggle.com):

movies_metadata.csv: The main Movies Metadata file contains information on 45,000 movies featured in the Full MovieLens Dataset. Features include posters, backdrops, budget, revenue, release dates, languages, production countries, and companies.
ratings_small.csv: The subset of 100,000 ratings from 700 users on 9,000 movies. Each line corresponds to a 5-star movie rating with half-star increments (0.5 – 5.0 stars).
keywords.csv: Contains the movie plot keywords for our MovieLens movies. Available in the form of a stringified JSON Object.
credits.csv: Consists of Cast and Crew Information for all our films. Available in the form of a stringified JSON Object.

recomender systems collaborative filtering imdb movies

" data-image-caption="

recomender systems collaborative filtering imdb movies

" data-large-file="https://www.relataly.com/wp-content/uploads/2022/04/MDB-Movie-Database.png" src="https://www.relataly.com/wp-content/uploads/2022/04/MDB-Movie-Database.png" alt="MDB Movie Database Recommender Systems Collaborative Filtering " class="wp-image-7128" width="366" height="192" srcset="https://www.relataly.com/wp-content/uploads/2022/04/MDB-Movie-Database.png 1024w, https://www.relataly.com/wp-content/uploads/2022/04/MDB-Movie-Database.png 300w, https://www.relataly.com/wp-content/uploads/2022/04/MDB-Movie-Database.png 768w" sizes="(max-width: 366px) 100vw, 366px" />

IMDB Movie Database

Several other files are included that we won’t use, incl. ratings_small, links_small, and links.

You can download it here or from Kaggle.

Step #1: Load the Data

Our goal is to create a content-based recommender system for movie recommendations. In this case, the content will be meta information on movies, such as genre, actors, the description.

We begin by making imports and loading the data from three files:

movies_metadata.csv
credits.csv
keywords.csv

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
sns.set_style('white', { 'axes.spines.right': False, 'axes.spines.top': False})
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.decomposition import TruncatedSVD
from nltk.corpus import stopwords

# the IMDB movies data is available on Kaggle.com
# https://www.kaggle.com/datasets/rounakbanik/the-movies-dataset

# in case you have placed the files outside of your working directory, you need to specify the path
path = 'data/movie_recommendations/' 

# load the movie metadata
df_meta=pd.read_csv(path + 'movies_metadata.csv', low_memory=False, encoding='UTF-8') 

# some records have invalid ids, which is why we remove them
df_meta = df_meta.drop([19730, 29503, 35587])

# convert the id to type int and set id as index
df_meta = df_meta.set_index(df_meta['id'].str.strip().replace(',','').astype(int))
pd.set_option('display.max_colwidth', 20)
df_meta.head(2)

		adult	belongs_to_collection			budget		genres				homepage			id		imdb_id		original_language	original_title	overview			...	release_date	revenue		runtime	spoken_languages	status		tagline	title	video	vote_average	vote_count
id																					
862		False	{'id': 10194, 'n...				30000000	[{'id': 16, 'nam...	http://toystory....	862		tt0114709	en					Toy Story		Led by Woody, An...	...	1995-10-30		373554033.0	81.0	[{'iso_639_1': '...	Released	NaN	Toy Story	False	7.7	5415.0
8844	False	NaN								65000000	[{'id': 12, 'nam...	NaN					8844	tt0113497	en				Jumanji				When siblings Ju...	...	1995-12-15		262797249.0	104.0	[{'iso_639_1': '...	Released	Roll the dice an...	Jumanji	False	6.9	2413.0

After we have loaded credits and keywords, we will combine the data into a single dataframe. Now we have various input fields available. However, we will only use keywords, cast, year of release, genres, and overview. If you like, you can enhance the data with additional inputs, for example, budget, running time, or film language.

Once we have gathered our data in a single dataframe, we print out the first rows to gain an overview of the data.

# load the movie credits
df_credits = pd.read_csv(path + 'credits.csv', encoding='UTF-8')
df_credits = df_credits.set_index('id')

# load the movie keywords
df_keywords=pd.read_csv(path + 'keywords.csv', low_memory=False, encoding='UTF-8') 
df_keywords = df_keywords.set_index('id')

# merge everything into a single dataframe 
df_k_c = df_keywords.merge(df_credits, left_index=True, right_on='id')
df = df_k_c.merge(df_meta[['release_date','genres','overview','title']], left_index=True, right_on='id')
df.head(3)

		keywords			cast				crew				release_date	genres				overview			title
id							
862		[{'id': 931, 'na...	[{'cast_id': 14,...	[{'credit_id': '...	1995-10-30		[{'id': 16, 'nam...	Led by Woody, An...	Toy Story
8844	[{'id': 10090, '...	[{'cast_id': 1, ...	[{'credit_id': '...	1995-12-15		[{'id': 12, 'nam...	When siblings Ju...	Jumanji
15602	[{'id': 1495, 'n...	[{'cast_id': 2, ...	[{'credit_id': '...	1995-12-22		[{'id': 10749, '...	A family wedding...	Grumpier Old Men

We can see cast, crew, and genres have a dictionary-like structure. To create a cosine similarity matrix, we need to extract the keywords from these columns and gather them in a single column. This is what we will do in the next step.

Step #2: Feature Engineering and Data Cleaning

A problem with modeling text is that machine learning algorithms have difficulty processing text directly. An essential step in creating content-based recommenders is bringing the text into a machine-readable form. This is what we call feature engineering.

2.1 Creating a Bag-of-Words Model

We begin with feature engineering and creating the bag of words. As mentioned, a bag of words is a list of words relevant to describe items in a dataset, such as films, and differentiate them. Creating a bag of words removes stopwords but preserves multiplicity so that words can occur multiple times in the concatenated text. Later, each word can be used as a feature in calculating cosine similarities.

The input for a bag of words does not necessarily come from a single input column. We will use keywords, genres, cast, and overview and merge them into a new single column that we call tags. Make sure to capture the text field’s nature. We will keep names and surnames together and not split them, as we will do with the words from the overview column. The result of this process is our bag.

In addition, we add the movie title and a new index (id), which will later ease working with the similarity matrix. Finally, we print the first rows of our feature dataframe.

# create an empty DataFrame
df_movies = pd.DataFrame()

# extract the keywords
df_movies['keywords'] = df['keywords'].apply(lambda x: [i['name'] for i in eval(x)])
df_movies['keywords'] = df_movies['keywords'].apply(lambda x: ' '.join([i.replace(" ", "") for i in x]))

# extract the overview
df_movies['overview'] = df['overview'].fillna('')

# extract the release year 
df_movies['release_date'] = pd.to_datetime(df['release_date'], errors='coerce').apply(lambda x: str(x).split('-')[0] if x != np.nan else np.nan)

# extract the actors
df_movies['cast'] = df['cast'].apply(lambda x: [i['name'] for i in eval(x)])
df_movies['cast'] = df_movies['cast'].apply(lambda x: ' '.join([i.replace(" ", "") for i in x]))

# extract genres
df_movies['genres'] = df['genres'].apply(lambda x: [i['name'] for i in eval(x)])
df_movies['genres'] = df_movies['genres'].apply(lambda x: ' '.join([i.replace(" ", "") for i in x]))

# add the title
df_movies['title'] = df['title']

# merge fields into a tag field
df_movies['tags'] = df_movies['keywords'] + df_movies['cast']+' '+df_movies['genres']+' '+df_movies['release_date']

# drop records with empty tags and dublicates
df_movies.drop(df_movies[df_movies['tags']==''].index, inplace=True)
df_movies.drop_duplicates(inplace=True)

# add a fresh index to the dataframe, which we will later use when refering to items in a vector matrix
df_movies['new_id'] = range(0, len(df_movies))

# Reduce the data to relevant columns
df_movies = df_movies[['new_id', 'title', 'tags']]

# display the data
pd.set_option('display.max_colwidth', 500)
pd.set_option('display.expand_frame_repr', False)
print(df_movies.shape)
df_movies.head(5)

		new_id	title							tags
id			
862		0		Toy Story						jealousy toy boy friendship friends rivalry boynextdoor newtoy toycomestolifeTomHanks TimAllen DonRickles JimVarney WallaceShawn JohnRatzenberger AnniePotts JohnMorris ErikvonDetten LaurieMetcalf R.LeeErmey SarahFreeman PennJillette Animation Comedy Family 1995
8844	1		Jumanji							boardgame disappearance basedonchildren'sbook newhome recluse giantinsectRobinWilliams JonathanHyde KirstenDunst BradleyPierce BonnieHunt BebeNeuwirth DavidAlanGrier PatriciaClarkson AdamHann-Byrd LauraBellBundy JamesHandy GillianBarber BrandonObray CyrusThiedeke GaryJosephThorup LeonardZola LloydBerry MalcolmStewart AnnabelKershaw DarrylHenriques RobynDriscoll PeterBryant SarahGilson FloricaVlad JuneLion BrendaLockmuller Adventure Fantasy Family 1995
15602	2		Grumpier Old Men				fishing bestfriend duringcreditsstinger oldmenWalterMatthau JackLemmon Ann-Margret SophiaLoren DarylHannah BurgessMeredith KevinPollak Romance Comedy 1995
31357	3		Waiting to Exhale				basedonnovel interracialrelationship singlemother divorce chickflickWhitneyHouston AngelaBassett LorettaDevine LelaRochon GregoryHines DennisHaysbert MichaelBeach MykeltiWilliamson LamontJohnson WesleySnipes Comedy Drama Romance 1995
11862	4		Father of the Bride Part II		baby midlifecrisis confidence aging daughter motherdaughterrelationship pregnancy contraception gynecologistSteveMartin DianeKeaton MartinShort KimberlyWilliams-Paisley GeorgeNewbern KieranCulkin BDWong PeterMichaelGoetz KateMcGregor-Stewart JaneAdams EugeneLevy LoriAlan Comedy 1995

2.2 Visualizing Text Length

We can use a bar chart to illustrate each movie’s word bag length. This gives us an idea of how detailed the movie descriptions are. Items with short descriptions have, in principle, a lower probability of being recommended later. Recommenders produce better results if the length of the descriptions is somewhat balanced.

# add the tag length to the movies df
df_movies['tag_len'] = df_movies['tags'].apply(lambda x: len(x))

# illustrate the tag text length
sns.displot(data=df_movies.dropna(), bins=list(range(0, 2000, 25)), height=5, x='tag_len', aspect=3, kde=True)
plt.title('Distribution of tag text length')
plt.xlim([0, 2500])

Step #3: Vectorization using TfidfVectorizer

The next step is to create a vector matrix from the Bag of Words model. Each column from the matrix represents a word feature. This step is the basis for determining the similarity of the movies afterward. Before the vectorization, we will remove stop words from the text (e.g., and, it, that, or, why, where, etc.). In addition, I limited the number of features in the matrix to 5000 to reduce training time.

A simple vectorization approach is to determine the word frequency for each movie using a count vectorizer. However, a frequently mentioned disadvantage of this approach is that it does not consider how often a word occurs. For example, some words may appear in almost all items. On the other hand, some words may be prevalent in a few items but are rare in general. So we can argue that observing rare words in an item is more informative than observing common words. Instead of a count vectorizer, we will use a more practical approach called TfidfVectorizer from the scikit-learn package.

Tfidf stands for term frequency-inverse document frequency. Compared to a count vectorizer, the tf-idf vectorizer considers the overall word frequencies and weights the general importance of the words when spanning the vectors. This way, tf-idf can determine which words are more important than others, reducing the model’s complexity and improving performance. This medium article explains the math behind tf-idf vectorization in more detail.

# set a custom stop list from nltk
stop = list(stopwords.words('english'))

# create the tfid vectorizer, alternatively you can also use countVectorizer
tfidf =  TfidfVectorizer(max_features=5000, analyzer = 'word', stop_words=set(stop))
vectorized_data = tfidf.fit_transform(df_movies['tags'])
count_matrix = pd.DataFrame(vectorized_data.toarray(), index=df_movies['tags'].index.tolist())
print(count_matrix)

			0     1     2     3     4     5     6     7     8     9     ...  4990  4991  4992  4993  4994  4995  4996  4997  4998  4999
862      	0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0  ...   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0
8844     	0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0  ...   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0
15602    	0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0  ...   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0
31357    	0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0  ...   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0
11862    	0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0  ...   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0
...      	...   ...   ...   ...   ...   ...   ...   ...   ...   ...  ...   ...   ...   ...   ...   ...   ...   ...   ...   ...   ...
439050   	0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0  ...   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0
111109   	0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0  ...   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0
67758    	0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0  ...   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0
227506   	0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0  ...   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0
461257   	0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0  ...   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0

[45432 rows x 5000 columns]

The vectorization process results in a feature matrix in which each feature is a word from the text bag of words.

We can display features with the get_feature_names_out function from the tfidf vectorizer.

# print feature names
print(tfidf.get_feature_names_out()[940:990])

['climbing' 'clinteastwood' 'clinthoward' 'clive' 'cliveowen'
 'cliverevill' 'cliverussell' 'clone' 'clorisleachman' 'cloviscornillac'
 'clown' 'clugulager' 'clydekusatsu' 'co' 'coach' 'cobb' 'cocaine' 'code'
 'coffin' 'cohen' 'coldwar' 'cole' 'colehauser' 'coleman' 'colinfarrell'
 'colinfirth' 'colinhanks' 'colinkenny' 'colinsalmon' 'colleencamp'
 'college' 'colmfeore' 'colmmeaney' 'coma' 'combat' 'comedian' 'comedy'
 'comicbook' 'comingofage' 'comingout' 'common' 'communism' 'communist'
 'company' 'competition' 'composer' 'computer' 'con' 'concentrationcamp'
 'concert']

As you can see, features are specific words,

Step #4 Dimensionality Reduction and Calculate Consine Similarities

In the previous section, we created a vector matrix that contains movies and features. This matrix is the foundation for calculating similarity scores for all movies. Before we assign feature scores, we will apply dimensionality reduction.

4.1 Dimensionality Reduction using SVD

The matrix spans a high-dimensional vector space with more than 5000 feature columns. Do we need all of these features? The answer is most likely not. There are likely a lot of words in the matrix that only occur once or twice. On the other hand, words may occur in almost all movies. How can we deal with this issue?

The reason for this is that we have a very dimensional vector space. By reducing this space to fewer, more essential features, we can save some time training our recommender model. We will use TruncatedSVD from the scikit-learn package, a popular algorithm for dimensionality reduction. The algorithm smoothens the matrix and approximates it to a lower dimensional space, thereby reducing noise and model complexity.

This way, we will reduce the vector space from 5000 to 3000 features.

# reduce dimensionality for improved performance
svd = TruncatedSVD(n_components=3000)
reduced_data = svd.fit_transform(count_matrix)

4.2 Calculate Text Similarity Scores for all Movies

Now that we have reduced the complexity of our vector matrix, we can calculate the similarity scores for all movies. In this process, we assign a similarity score to all item pairs that measure content closeness according to the position of the items in the vector space.

We use the cosine function to calculate the similarity value of the movies. The cosine similarity is a mathematical calculation to determine the mathematical similarity of two vectors. In our case, the vectors are the movie descriptions. The cosine similarity function uses these feature vectors to compare each movie to every other and assigns them a similarity value.

A similarity value of -1 means that two feature vectors are correlated, and the movies are entirely different.
A value of 1 means that the two movies are identical.
A value of 0 is between and means f an average match of the feature vectors.

The cosine similarity function will calculate pairwise similarities for all movies in our vector matrix. We can determine the number of pairwise comparisons with the formula k²/2, whereby k is the number of items in the vector matrix. In our case, we have a k of 45000 movies. This means the cosine similarity function must calculate about 1 billion similarity scores. So don’t worry if the process takes some time to complete.

# compute the cosine similarity matrix
similarity = cosine_similarity(reduced_data)
similarity

array([[ 1.00000000e+00,  9.75542082e-02,  6.00755620e-02, ...,
        -3.03965235e-04,  0.00000000e+00,  5.81243547e-05],
       [ 9.75542082e-02,  1.00000000e+00,  5.92929339e-02, ...,
        -2.97565163e-03,  0.00000000e+00,  4.57945869e-05],
       [ 6.00755620e-02,  5.92929339e-02,  1.00000000e+00, ...,
         9.40459504e-03,  0.00000000e+00, -2.22415551e-04],
       ...,
       [-3.03965235e-04, -2.97565163e-03,  9.40459504e-03, ...,
         1.00000000e+00,  0.00000000e+00, -2.60823346e-04],
       [ 0.00000000e+00,  0.00000000e+00,  0.00000000e+00, ...,
         0.00000000e+00,  0.00000000e+00,  0.00000000e+00],
       [ 5.81243547e-05,  4.57945869e-05, -2.22415551e-04, ...,
        -2.60823346e-04,  0.00000000e+00,  1.00000000e+00]])

Step #5: Generate Content-based Movie Recommendations

Once you have created the similarity matrix, it’s time to generate some recommendations. We begin by generating recommendations based on a single movie. In the cosine similarity matrix, the most similar movies have the highest similarity scores. Once we have the film with the highest scores, we can visualize the results in a bar chart that shows the cosine similarity scores.

The example below displays the results of the movie “The Matrix.” Oh, how I love this movie 🙂

# create a function that takes in movie title as input and returns a list of the most similar movies
def get_recommendations(title, n, cosine_sim=similarity):
    
    # get the index of the movie that matches the title
    movie_index = df_movies[df_movies.title==title].new_id.values[0]
    print(movie_index, title)
    
    # get the pairwsie similarity scores of all movies with that movie and sort the movies based on the similarity scores
    sim_scores_all = sorted(list(enumerate(cosine_sim[movie_index])), key=lambda x: x[1], reverse=True)
    
    # checks if recommendations are limited
    if n > 0:
        sim_scores_all = sim_scores_all[1:n+1]
        
    # get the movie indices of the top similar movies
    movie_indices = [i[0] for i in sim_scores_all]
    scores = [i[1] for i in sim_scores_all]
    
    # return the top n most similar movies from the movies df
    top_titles_df = pd.DataFrame(df_movies.iloc[movie_indices]['title'])
    top_titles_df['sim_scores'] = scores
    top_titles_df['ranking'] = range(1, len(top_titles_df) + 1)
    
    return top_titles_df, sim_scores_all

# generate a list of recommendations for a specific movie title
movie_name = 'The Matrix'
number_of_recommendations = 15
top_titles_df, _ = get_recommendations(movie_name, number_of_recommendations)
 
# visualize the results
def show_results(movie_name, top_titles_df):
    fix, ax = plt.subplots(figsize=(11, 5))
    sns.barplot(data=top_titles_df, y='title', x= 'sim_scores', color='blue')
    plt.xlim((0,1))
    plt.title(f'Top 15 recommendations for {movie_name}')
    pct_values = ['{:.2}'.format(elm) for elm in list(top_titles_df['sim_scores'])]
    ax.bar_label(container=ax.containers[0], labels=pct_values, size=12)

show_results(movie_name, top_titles_df)

Example for the movies “Spectre” and “The Lion King”

Step #6: Generate Content-based Movie Recommendations

But what if you want to generate recommendations for specific users that have seen several movies? For this, we can aggregate the similarity scores for all films the user has seen. This way, we create a new dataframe that sums up similarity scores. To return the top-recommended movies, we can sort this dataframe by similarity scores and replace the top elements.

# list of movies a user has seen
movie_list = ['The Lion King', 'Seven', 'RoboCop 3', 'Blade Runner', 'Quantum of Solace', 'Casino Royale', 'Skyfall']

# create a copy of the movie dataframe and add a column in which we aggregated the scores
user_scores = pd.DataFrame(df_movies['title'])
user_scores['sim_scores'] = 0.0

# top number of scores to be considered for each movie
number_of_recommendations = 10000
for movie_name in movie_list:
    top_titles_df, _ = get_recommendations(movie_name, number_of_recommendations)
    # aggregate the scores
    user_scores = pd.concat([user_scores, top_titles_df[['title', 'sim_scores']]]).groupby(['title'], as_index=False).sum({'sim_scores'})
# sort and print the aggregated scores
user_scores.sort_values(by='sim_scores', ascending=False)[1:20]

Summary

In this tutorial, you have learned to implement a simple content-based recommender system for movie recommendations in Python. We have used several movie-specific details to calculate a similarity matrix for all movies in our dataset. Finally, we have used this model to generate recommendations for two cases:

Films that are similar to a specific movie
Films that are recommended based on the watchlist of a particular user.

A downside of content-based recommenders is that you cannot test their performance unless you know how users perceived the recommendations. This is because content-based recommenders can only determine which items in a dataset are similar. To understand how well the suggestions work, you must include additional data about actual user preferences.

More advanced recommenders will combine content-based recommendations with user-item interactions (e.g., collaborative filtering). Such models are called hybrid recommenders, but this is something for another article.

Image created with Midjourney

Sources and Further Reading

Below are some resources for further reading on recommender systems and content-based models.

Books

The links above to Amazon are affiliate links. By buying through these links, you support the Relataly.com blog and help to cover the hosting costs. Using the links does not affect the price.

Articles

The post Create a Personalized Movie Recommendation Engine using Content-based Filtering in Python appeared first on relataly.com.

Using Random Search to Tune the Hyperparameters of a Random Decision Forest with Python

Florian Follonier — Thu, 07 Apr 2022 17:55:36 +0000

Perfecting your machine learning model’s hyperparameters can often feel like hunting for a proverbial needle in a haystack. But with the Random Search algorithm, this intricate process of hyperparameter tuning can be efficiently automated, saving you valuable time and effort. Hyperparameters are properties intrinsic to your model, like the number of estimators in an ensemble model, and heavily influence its performance. Unlike model parameters, which are discovered during training by the machine learning algorithm, hyperparameters require pre-specification.

In this comprehensive Python tutorial, we’ll guide you on how to harness the power of Random Search to optimize a regression model’s hyperparameters. Our illustrative example utilizes a Support Vector Machine (SVM) for predicting house prices. However, the fundamental principles you’ll learn can be seamlessly applied to any model. So why painstakingly fine-tune hyperparameters manually when Random Search can handle the task efficiently?

Here’s a preview of what this Python tutorial entails:

A brief overview of how Random Search operates and instances where it might be preferable to Grid Search.
A hands-on Python tutorial featuring a public house price dataset from Kaggle.com. The aim here is to train a regression model capable of predicting US house prices based on various properties.
Training a ‘best-guess’ model in Python, followed by using Random Search to discover a model with enhanced performance.
Finally, we’ll implement cross-validation to validate our models’ performance.

By the end of this tutorial, you’ll be well-equipped to let Random Search efficiently fine-tune your model’s hyperparameters, freeing up your time for other crucial tasks.

Hyperparameter Tuning

Hyperparameters are configuration options that allow us to customize machine learning models and improve their performance. While normal parameters are the internal coefficients that the model learns during training, we need to specify hyperparameters before the training. It is usually impossible to find the best configuration without testing different configurations.

Searching for a suitable model configuration is called “hyperparameter tuning” or “hyperparameter optimization.” Machine learning algorithms have varying hyperparameters and parameter values. For example, a random decision forest classifier allows us to configure varying parameters such as the number of trees, the maximum tree depth, and the minimum number of nodes required for a new branch.

The hyperparameters and the range of possible parameter values span a search space in which we seek to identify the best configuration. The larger the search space, the more difficult it gets to find an optimal model. We can use random search to automatize this process.

Random search can be an efficient way to tune the hyperparameters of a machine learning model. Image generated with Midjourney

Techniques for Tuning Hyperparameters

Hyperparameter tuning is the process of adjusting the hyperparameters of a machine learning algorithm to optimize its performance on a specific dataset or task. Several techniques can be used for hyperparameter tuning, including:

Grid Search: grid search is a brute-force search algorithm that systematically evaluates a given set of hyperparameter values by training and evaluating a model for each combination of values. It is a simple and effective technique, but it can be computationally expensive, especially for large or complex datasets.
Random Search: As mentioned, random search is an alternative to grid search that randomly samples a given set of hyperparameter values rather than evaluating all possible combinations. It can be more efficient than grid search, but it may not find the optimal set of hyperparameters.
Bayesian Optimization: A bayesian optimization is a probabilistic approach to hyperparameter tuning, which uses Bayesian inference to model the distribution of hyperparameter values that are likely to produce a good performance. It can be more efficient and effective than grid search or random search, but it can be more challenging to implement and interpret.
Genetic Algorithms: genetic algorithms are optimization algorithms inspired by the principles of natural selection and genetics. They use a population of candidate solutions, which are iteratively evolved and selected based on their fitness or performance, to find the optimal set of hyperparameters.

In this article, we specifically look at the Random Search technique.

You can spend much time tuning a machine learning model. Image generated with Midjourney.

What is Random Search?

The random search algorithm generates models from hyperparameter permutations randomly selected from a grid of parameter values. The idea behind the randomized approach is that testing random configurations efficiently identifies a good model. We can use random search both for regression and classification models.

Random Search and Grid Search are the most popular techniques for hyperparametric tuning, and both methods are often compared. Unlike random search, grid search covers the search space exhaustively by trying all possible variants. The technique works well for testing a small number of configurations already known to work well.

As long as both search space and training time are small, the grid search technique is excellent for finding the best model. However, the number of model variants increases exponentially with the size of the search space. It is often more efficient for large search spaces or complex models to use random search.

Since random search does not exhaustively cover the search space, it does not necessarily yield the best model. However, it is also much faster than grid search and efficient in delivering a suitable model in a short time.

Random Search vs. Exhaustive Grid Search

Tuning the Hyperparameters of a Random Decision Forest Regressor in Python using Random Search

In this tutorial, we delve into the use of the Random Search algorithm in Python, specifically for predicting house prices. We’ll be using a dataset rich in diverse house characteristics. Various elements, such as data quality and quantity, model intricacy, the selection of machine learning algorithms, and housing market stability, significantly influence the accuracy of house price predictions.

Our initial model employs a Random Decision Forest algorithm, which we’ll optimize using a random search approach for hyperparameters tuning. By identifying and implementing a more advantageous configuration, we aim to enhance our model’s performance significantly.

Here’s a concise outline of the steps we’ll undertake:

Loading the house price dataset
Exploring the dataset intricacies
Preparing the data for modeling
Training a baseline Random Decision Forest model
Implementing a random search approach for model optimization
Measuring and evaluating the performance of our optimized model

Through this step-by-step guide, you’ll learn to enhance model performance, further refining your understanding of Random Search algorithm implementation in Python.

The Python code is available in the relataly GitHub repository.

View on GitHub Relataly Github Repo

Once we have trained a house price prediction model, we can use it to asses the price of new houses. Image generated with Midjourney.

Prerequisites

Before starting the coding part, ensure that you have set up your Python (3.8 or higher) environment and required packages. If you don’t have an environment, follow this tutorial to set up the Anaconda environment.

Also, make sure you install all required packages. In this tutorial, we will be working with the following standard packages:

In addition, we will be using the Python Machine Learning library Scikit-learn to implement the random forest and the grid search technique.

You can install packages using console commands:

pip install
conda install (if you are using the anaconda packet manager)

House Price Prediction: About the Use Case and the Data

House price prediction is the process of using statistical and machine learning techniques to predict the future value of a house. This can be useful for a variety of applications, such as helping homeowners and real estate professionals to make informed decisions about buying and selling properties. In order to make accurate predictions, it is important to have access to high-quality data about the housing market.

In this tutorial, we will work with a house price dataset from the house price regression challenge on Kaggle.com. The dataset is available via a git hub repository. It contains information about 4800 houses sold between 2016 and 2020 in the US. The data includes the sale price and a list of 48 house characteristics, such as:

Year – The year of construction,
SaleYear – The year in which the house was sold
Lot Area – The lot area of the house
Quality – The overall quality of the house from one (lowest) to ten (highest)
Road – The type of road, e.g., paved, etc.
Utility – The type of the utility
Park Lot Area – The parking space included with the property
Room number – The number of rooms

Predicting house prices with machine learning. Image generated with Midjourney.

Step #1 Load the Data

We begin by loading the house price data from the relataly GitHub repository. A separate download is not required.

# A tutorial for this file is available at www.relataly.com

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import RandomizedSearchCV, train_test_split
from sklearn.metrics import mean_absolute_error, mean_absolute_percentage_error
from sklearn import svm

# Source: 
# https://www.kaggle.com/c/house-prices-advanced-regression-techniques

# Load train and test datasets
path = "https://raw.githubusercontent.com/flo7up/relataly_data/main/house_prices/train.csv"
df = pd.read_csv(path)
print(df.columns)
df.head()

Index(['Id', 'MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street',
       'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig',
       'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType',
       'HouseStyle', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd',
       'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType',
       'MasVnrArea', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual',
       'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1',
       'BsmtFinType2', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating',
       'HeatingQC', 'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF',
       'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath',
       'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual',
       'TotRmsAbvGrd', 'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType',
       'GarageYrBlt', 'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual',
       'GarageCond', 'PavedDrive', 'WoodDeckSF', 'OpenPorchSF',
       'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'PoolQC',
       'Fence', 'MiscFeature', 'MiscVal', 'MoSold', 'YrSold', 'SaleType',
       'SaleCondition', 'SalePrice'],
      dtype='object')
	Id	MSSubClass	MSZoning	LotFrontage	LotArea	Street	Alley	LotShape	LandContour	Utilities	...	PoolArea	PoolQC	Fence	MiscFeature	MiscVal	MoSold	YrSold	SaleType	SaleCondition	SalePrice
0	1	60			RL			65.0		8450	Pave	NaN		Reg			Lvl			AllPub		...	0			NaN		NaN		NaN			0		2		2008	WD			Normal			208500
1	2	20			RL			80.0		9600	Pave	NaN		Reg			Lvl			AllPub		...	0			NaN		NaN		NaN			0		5		2007	WD			Normal			181500
2	3	60			RL			68.0		11250	Pave	NaN		IR1			Lvl			AllPub		...	0			NaN		NaN		NaN			0		9		2008	WD			Normal			223500
3	4	70			RL			60.0		9550	Pave	NaN		IR1			Lvl			AllPub		...	0			NaN		NaN		NaN			0		2		2006	WD			Abnorml			140000
4	5	60			RL			84.0		14260	Pave	NaN		IR1			Lvl			AllPub		...	0			NaN		NaN		NaN			0		12		2008	WD			Normal			250000
5 rows × 81 columns

Step #2 Explore the Data

Before jumping into preprocessing and model training, let’s quickly explore the data. A distribution plot can help us understand our dataset’s frequency of regression values.

# Create histograms for feature columns separated by prediction label value
ax = sns.displot(data=df[['SalePrice']].dropna(), height=6, aspect=2)
plt.title('Sale Price Distribution')

For feature selection, it is helpful to understand the predictive power of the different variables in a dataset. We can use scatterplots to estimate the predictive power of specific features. Running the code below will create a scatterplot that visualizes the relation between the sale price, lot area, and the house’s overall quality.

# Create histograms for feature columns separated by prediction label value
plt.figure(figsize=(16,6))
df_features = df[['SalePrice', 'LotArea', 'OverallQual']]
sns.scatterplot(data=df_features, x='LotArea', y='SalePrice', hue='OverallQual')
plt.title('Sale Price Distribution')

As expected, the scatterplot shows that the sale price increases with the overall quality. On the other hand, the LotArea has only a minor effect on the sale price.

Step #3 Data Preprocessing

Next, we prepare the data for use as input to train a regression model. Because we want to keep things simple, we reduce the number of variables and use only a small set of features. In addition, we encode categorical variables with integer dummy values.

To ensure that our regression model does not know the target variable, we separate house price (y) from features (x). Last, we split the data into separate datasets for training and testing. The result is four different data sets: x_train, y_train, x_test, and y_test.

def preprocessFeatures(df):   
    # Define a list of relevant features
    feature_list = ['SalePrice', 'OverallQual', 'Utilities', 'GarageArea', 'LotArea', 'OverallCond']
    df_dummy = pd.get_dummies(df[feature_list])
    # Cleanse records with na values
    #df_prep = df_prep.dropna()
    return df_dummy

df_base = preprocessFeatures(df)

# Split the data into x_train and y_train data sets
x_train, x_test, y_train, y_test = train_test_split( df_base.copy(), df_base['SalePrice'].copy(), train_size=0.7, random_state=0)
x_train

		OverallQual	GarageArea	LotArea	OverallCond	Utilities_AllPub	Utilities_NoSeWa
682		6			431			2887	5			1					0
960		5			0			7207	7			1					0
1384	6			280			9060	5			1					0
1100	2			246			8400	5			1					0
416		6			440			7844	7			1					0

Step #4 Train Different Regression Models using Random Search

Now that the dataset is ready, we can train the random decision forest regressor. To do this, we first define a dictionary with different parameter ranges. In addition, we need to define the number of model variants (n) that the algorithm should try. The random search algorithm then selects n random permutations from the grid and uses them to train the model.

We use the RandomSearchCV algorithm from the scikit-learn package. The “CV” in the function name stands for cross-validation. Cross-validation involves splitting the data into subsets (folds) and rotating them between training and validation runs. This way, each model is trained and tested multiple times on different data partitions. When the search algorithm finally evaluates the model configuration, it summarizes these results into a test score.

We use a Random Decision Forest – a robust machine learning algorithm that can handle classification and regression tasks. As a so-called ensemble model, the Random Forest considers predictions from a set of multiple independent estimators. The estimator is an important parameter to pass to the RandomSearchCV function. Random decision forests have several hyperparameters that we can use to influence their behavior. We define the following parameter ranges:

max_leaf_nodes = [2, 3, 4, 5, 6, 7]
min_samples_split = [5, 10, 20, 50]
max_depth = [5,10,15,20]
max_features = [3,4,5]
n_estimators = [50, 100, 200]

These parameter ranges define the search space from which the randomized search algorithm (RandomSearchCV) will select random configurations. Other parameters will use default values as defined by scikit-learn.

# Define the Estimator and the Parameter Ranges
dt = RandomForestRegressor()
number_of_iterations = 20
max_leaf_nodes = [2, 3, 4, 5, 6, 7]
min_samples_split = [5, 10, 20, 50]
max_depth = [5,10,15,20]
max_features = [3,4,5]
n_estimators = [50, 100, 200]

# Define the param distribution dictionary
param_distributions = dict(max_leaf_nodes=max_leaf_nodes, 
                           min_samples_split=min_samples_split, 
                           max_depth=max_depth,
                           max_features=max_features,
                           n_estimators=n_estimators)

# Build the gridsearch
grid = RandomizedSearchCV(estimator=dt, 
                          param_distributions=param_distributions, 
                          n_iter=number_of_iterations, 
                          cv = 5)

grid_results = grid.fit(x_train, y_train)

# Summarize the results in a readable format
print("Best params: {0}, using {1}".format(grid_results.cv_results_['mean_test_score'], grid_results.best_params_))
results_df = pd.DataFrame(grid_results.cv_results_)

Best params: [0.68738293 0.49581669 0.52138751 0.61235299 0.65360944 0.61165147
 0.70392285 0.52278886 0.67687248 0.68219638 0.70031536 0.65842909
 0.51939338 0.70801017 0.70911805 0.69543885 0.67983801 0.60744371
 0.68270285 0.70741042], using {'n_estimators': 100, 'min_samples_split': 5, 'max_leaf_nodes': 7, 'max_features': 3, 'max_depth': 15}
	mean_fit_time	std_fit_time	mean_score_time	std_score_time	param_n_estimators	param_min_samples_split	param_max_leaf_nodes	param_max_features	param_max_depth	params	split0_test_score	split1_test_score	split2_test_score	split3_test_score	split4_test_score	mean_test_score	std_test_score	rank_test_score
0	0.049196		0.002071		0.004074		0.000820		50					20						5	4	15	{'n_estimators': 50, 'min_samples_split': 20, ...	0.662973	0.705533	0.669520	0.702608	0.696280	0.687383	0.017637	7
1	0.041115		0.000554		0.003046		0.000094		50					50						2	3	10	{'n_estimators': 50, 'min_samples_split': 50, ...	0.490984	0.527231	0.426270	0.523086	0.511513	0.495817	0.036978	20
2	0.043325		0.000779		0.003486		0.000447		50					50						2	5	20	{'n_estimators': 50, 'min_samples_split': 50, ...	0.484524	0.559358	0.485459	0.517253	0.560343	0.521388	0.033545	18
3	0.162083		0.005665		0.012420		0.004788		200					5						3	3	20	{'n_estimators': 200, 'min_samples_split': 5, ...	0.586586	0.638341	0.573437	0.626793	0.636608	0.612353	0.027021	14
4	0.166659		0.003026		0.010958		0.000084		200					10						4	3	15	{'n_estimators': 200, 'min_samples_split': 10,...	0.633305	0.679161	0.623236	0.661864	0.670481	0.653609	0.021636	13

These are the five best models and their respective hyperparameter configurations.

Step #5 Select the best Model and Measure Performance

Finally, we will choose the best model from the list using the “best_model” function. We then calculate the MAE and the MAPE to understand how the model performs on the overall test dataset. We then print a comparison between actual sale prices and predicted sale prices.

# Select the best Model and Measure Performance
best_model = grid_results.best_estimator_
y_pred = best_model.predict(x_test)
y_df = pd.DataFrame(y_test)
y_df['PredictedPrice']=y_pred
y_df.head()

	SalePrice	PredictedPrice
529	200624		166037.831002
491	133000		135860.757958
459	110000		123030.336177
279	192000		206488.444327
655	88000		130453.604206

Next, let’s take a look at the classification errors.

# Mean Absolute Error (MAE)
MAE = mean_absolute_error(y_pred, y_test)
print('Mean Absolute Error (MAE): ' + str(np.round(MAE, 2)))

# Mean Absolute Percentage Error (MAPE)
MAPE = mean_absolute_percentage_error(y_pred, y_test)
print('Median Absolute Percentage Error (MAPE): ' + str(np.round(MAPE*100, 2)) + ' %')

Mean Absolute Error (MAE): 29591.56 
Median Absolute Percentage Error (MAPE): 15.57 %

On average, the model deviates from the actual value by 16 %. Considering we only used a fraction of the available features and defined a small search space, there is much room for improvement.

Summary

This article has shown how we can use grid Search in Python to efficiently search for the optimal hyperparameter configuration of a machine learning model. In the conceptual part, you learned about hyperparameters and how to use random search to try out all permutations of a predefined parameter grid. The second part was a Python hands-on tutorial, in which you learned to use random search to tune the hyperparameters of a regression model. We worked with a house price dataset and trained a random decision forest regressor that predicts the sale price for houses depending on several characteristics. Then we defined parameter ranges and tested random permutations. In this way, we quickly identified a configuration that outperforms our initial baseline model.

Remember that a random search efficiently identifies a good-performing model but does not necessarily return the best-performing one. Tech random search techniques can be used to tune the hyperparameters of both regression and classification models.

Sources and Further Reading

I hope this article was helpful. If you have any questions or suggestions, please write them in the comments.

The links above to Amazon are affiliate links. By buying through these links, you support the Relataly.com blog and help to cover the hosting costs. Using the links does not affect the price.

The post Using Random Search to Tune the Hyperparameters of a Random Decision Forest with Python appeared first on relataly.com.

Stock Market Forecasting Neural Networks for Multi-Output Regression in Python

Florian Follonier — Tue, 13 Jul 2021 21:10:23 +0000

Multi-output time series regression can forecast several steps of a time series at once. The number of neurons in the final output layer determines how many steps the model can predict. Models with one output return single-step forecasts. Models with various outputs can return entire series of time steps and thus deliver a more detailed projection of how a time series could develop in the future. This article is a hands-on Python tutorial that shows how to design a neural network architecture with multiple outputs. The goal is to create a multi-output model for stock-price forecasting using Python and Keras. By the end of this tutorial, you will have learned how to design a multi-output model for stock price forecasting using Python and Keras. This knowledge can be applied to other types of time series forecasting tasks, such as weather forecasting or sales forecasting.

This article proceeds as follows: We briefly discuss the architecture of a multi-output neural network. After familiarizing ourselves with the model architecture, we develop a Keras neural network for multi-output regression. For data preparation, we perform various steps, including cleaning, splitting, selecting, and scaling the data. Afterward, we define a model architecture with multiple LSTM layers and ten output neurons in the last layer. This architecture enables the model to generate projections for ten consecutive steps. After configuring the model architecture, we train the model with the historical daily prices of the Apple stock. Finally, we use this model to generate a ten-day forecast.

Disclaimer

This article does not constitute financial advice. Stock markets can be very volatile and are generally difficult to predict. Predictive models and other forms of analytics applied in this article only serve the purpose of illustrating machine learning use cases.

Multi-Output Regression vs. Single-Output Regression

In time series regression, we train a statistical model on the past values of a time series to make statements about how the time series develops further. During model training, we feed the model with so-called mini-batches and the corresponding target values. The model then creates forecasts for all input batches and compares these predictions to the actual target values to calculate the residuals (prediction errors). In this way, the model can adjust its parameters iteratively and learn to make better predictions.

Multivariate forecasting models take into account multiple input variables, such as historical time series data and additional features like moving averages or momentum indicators, to improve the accuracy of their predictions. The idea is that these various variables can help the model identify patterns in the data that suggest future price movements.

An exemplary architecture of a neural network with five input neurons (blue) and four output neurons (red)

The Architecture of a Neural Network with Multiple Outputs

Next, we will discuss the architecture of a neural network with multiple outputs. The architecture consists of several layers, including an input layer, several hidden layers, and an output layer. The number of neurons in the first layer must match the input data, and the number of neurons in the output layer determines the period length of the predictions.

Models with a single neuron in the output layer are used to predict a single time step. It is possible to predict multiple price steps with a single-output model. It requires a rolling forecasting approach in which the outputs are iteratively reused to make further-reaching predictions. However, this way is somewhat cumbersome. A more elegant way is to train a multi-output model right away.

The inputs and outputs of a neural network for time series regression with five input neurons and four outputs

Training Neural Networks with Multiple Outputs

A model with multiple neurons in the output layer can predict numerous steps once per batch. Multi-output regression models train on many sequences of subsequent values, followed by the consecutive output sequence. The model architecture thus contains multiple neurons in the initial layer and various neurons in the output layer (as illustrated).

In a multi-output regression model, each neuron in the output layer is responsible for predicting a different time step in the future. To train such a model, you need to provide a sequence of input data followed by the corresponding sequence of output data. For example, if you want to predict the stock price for the next ten days, you would provide a sequence of input data containing the historical stock prices for the past 50 days, followed by a sequence of output data containing the stock prices for the next 10 days.

The model will then learn to map the input sequence to the output sequence so that it can make predictions for multiple time steps in the future based on the input data.

In the next part of this tutorial, we will walk through the process of developing a multi-output regression model in more detail.

Implementing a Neural Network Model for Multi-Output Multi-Step Regression in Python

Let’s get started with the hands-on Python part. In the following, we will develop a neural network with Keras and Tensorflow that forecasts the Apple stock price. To prepare the data for a neural network with multiple outputs in time series forecasting, we will spend the most time preparing it and bringing it into the right shape. Broadly this involves the following steps:

Load the time series data that we want to use as input and output for your model. We use historical price data that is available via the yahoo finance API.
Then we split our data into training and testing sets. We will use the training set to fit the model and the testing set to evaluate the model’s performance.
Preprocess the data: This includes scaling the data and selecting relevant features.
Reshape the data and bring them into a format that can be input into the neural network. This involves converting the data into a 3D array for time series data.
Finally, we will train our model and generate the forecasting.

The code is available on the GitHub repository.

View on GitHub Relataly GitHub Repo

Neural network architectures with multiple outputs allow for more potent solutions but are more complex to train. Image created with Midjourney.

Prerequisites

Before beginning the coding part, ensure that you have set up your Python 3 environment and required packages. If you don’t have a Python environment, consider Anaconda. To set it up, you can follow the steps in this tutorial.

Also, make sure you install all required packages. In this tutorial, we will be working with the following standard packages:

In addition, we will be using the machine learning libraries Keras, Scikit-learn, and Tensorflow. For visualization, we will be using the Seaborn package.

Please also have either the pandas_datareader or the yfinance package installed. You will use one of these packages to retrieve the historical stock quotes.

You can install these packages using console commands:

pip install
conda install (if you are using the anaconda packet manager)

Step #1: Load the Data

The Pandas DataReader library is our first choice for interacting with the yahoo finance API. If the library causes a problem (it sometimes does), you can also use the yfinance package, which should return the same data. We begin by loading historical price quotes of the Apple stock from the public yahoo finance API. Running the code below will load the data into a Pandas DataFrame.

# import pandas_datareader as webreader # Remote data access for pandas
import math # Mathematical functions 
import numpy as np # Fundamental package for scientific computing with Python
import pandas as pd # Additional functions for analysing and manipulating data
from datetime import date, timedelta, datetime # Date Functions
from pandas.plotting import register_matplotlib_converters # This function adds plotting functions for calender dates
import matplotlib.pyplot as plt # Important package for visualization - we use this to plot the market data
import matplotlib.dates as mdates # Formatting dates
from sklearn.metrics import mean_absolute_error, mean_squared_error # Packages for measuring model performance / errors
from keras.models import Sequential # Deep learning library, used for neural networks
from keras.layers import LSTM, Dense, Dropout # Deep learning classes for recurrent and regular densely-connected layers
from keras.callbacks import EarlyStopping # EarlyStopping during model training
from sklearn.preprocessing import RobustScaler, MinMaxScaler # This Scaler removes the median and scales the data according to the quantile range to normalize the price data 
import seaborn as sns

# from pandas_datareader.nasdaq_trader import get_nasdaq_symbols
# symbols = get_nasdaq_symbols()

# Setting the timeframe for the data extraction
today = date.today()
date_today = today.strftime("%Y-%m-%d")
date_start = '2010-01-01'

# Getting NASDAQ quotes
stockname = 'Apple'
symbol = 'AAPL'
# df = webreader.DataReader(
#     symbol, start=date_start, end=date_today, data_source="yahoo"
# )

import yfinance as yf #Alternative package if webreader does not work: pip install yfinance
df = yf.download(symbol, start=date_start, end=date_today)

# # Create a quick overview of the dataset
df.head()

Tensorflow Version: 2.6.0
Num GPUs: 1
[*********************100%***********************]  1 of 1 completed
			Open		High		Low			Close		Adj Close	Volume
Date						
2010-01-04	7.622500	7.660714	7.585000	7.643214	6.515213	493729600
2010-01-05	7.664286	7.699643	7.616071	7.656429	6.526477	601904800
2010-01-06	7.656429	7.686786	7.526786	7.534643	6.422666	552160000
2010-01-07	7.562500	7.571429	7.466071	7.520714	6.410791	477131200
2010-01-08	7.510714	7.571429	7.466429	7.570714	6.453413	447610800

The data should comprise the following columns:

Close
Open
High
Low
Adj Close
Volume

The target variable that we are trying to predict is the Closing price (Close).

Step #2: Explore the Data

Once we have loaded the data, we print a quick overview of the time-series data using different line graphs. The following code will plot a line chart for each column in df_plot using the seaborn library. The charts will be organized in a grid with nrows number of rows and ncols number of columns. The sharex parameter is set to True, which means that the x-axes of the subplots will be shared. The figsize parameter determines the size of the plot in inches.

# Plot line charts
df_plot = df.copy()

ncols = 2
nrows = int(round(df_plot.shape[1] / ncols, 0))

fig, ax = plt.subplots(nrows=nrows, ncols=ncols, sharex=True, figsize=(14, 7))
for i, ax in enumerate(fig.axes):
        sns.lineplot(data = df_plot.iloc[:, i], ax=ax)
        ax.tick_params(axis="x", rotation=30, labelsize=10, length=0)
        ax.xaxis.set_major_locator(mdates.AutoDateLocator())
fig.tight_layout()
plt.show()

The line plots look as expected and reflect the Apple stock price history. Because we are fetching daily data from an API, please note that the lineplots will look different depending on when you run the code.

Step #3: Preprocess the Data

Next, we prepare the data for the training process of our multi-output forecasting model. Preparing the data for multivariate forecasting involves several steps:

Selecting features for model training
Scaling and splitting the data into separate sets for training and testing
Slicing the time series into several shifted training batches

Remember that the steps are specific to our data and the use case. The steps required to prepare the data for a neural network with multiple outputs in time series forecasting will depend on the characteristics of your data and the requirements of your model. It is essential to consider these factors and tailor your data preparation accordingly and carefully.

3.1 Basic Preparations

We begin by creating a copy of the initial data and resetting the index.

# Indexing Batches
df_train = df.sort_values(by=['Date']).copy()

# We safe a copy of the dates index, before we need to reset it to numbers
date_index = df_train.index

# We reset the index, so we can convert the date-index to a number-index
df_train = df_train.reset_index(drop=True).copy()
df_train.head(5)

			Open		High		Low			Close		Adj Close	Volume
Date						
2022-11-29	144.289993	144.809998	140.350006	141.169998	141.169998	83763800
2022-11-30	141.399994	148.720001	140.550003	148.029999	148.029999	111224400
2022-12-01	148.210007	149.130005	146.610001	148.309998	148.309998	71250400
2022-12-02	145.960007	148.000000	145.649994	147.809998	147.809998	65421400
2022-12-05	147.770004	150.919998	145.770004	146.630005	146.630005	68732400

3.2 Feature Selection and Scaling

We proceed with feature selection. To keep things simple, we will use the features from the input data without any modifications. After selecting the features, we scale them to a range between 0 and 1. To ease unscaling the predictions after training, we create two different scalers: One for the training data, which takes five columns, and one for the output data that scales a single column (the Close Price). I have covered feature engineering in a separate article if you want to learn more about this topic.

def prepare_data(df):

    # List of considered Features
    FEATURES = ['Open', 'High', 'Low', 'Close', 'Volume']

    print('FEATURE LIST')
    print([f for f in FEATURES])

    # Create the dataset with features and filter the data to the list of FEATURES
    df_filter = df[FEATURES]
    
    # Convert the data to numpy values
    np_filter_unscaled = np.array(df_filter)
    #np_filter_unscaled = np.reshape(np_unscaled, (df_filter.shape[0], -1))
    print(np_filter_unscaled.shape)

    np_c_unscaled = np.array(df['Close']).reshape(-1, 1)
    
    return np_filter_unscaled, np_c_unscaled, df_filter
    
np_filter_unscaled, np_c_unscaled, df_filter = prepare_data(df_train)
                                          
# Creating a separate scaler that works on a single column for scaling predictions
# Scale each feature to a range between 0 and 1
scaler_train = MinMaxScaler()
np_scaled = scaler_train.fit_transform(np_filter_unscaled)
    
# Create a separate scaler for a single column
scaler_pred = MinMaxScaler()
np_scaled_c = scaler_pred.fit_transform(np_c_unscaled)

FEATURE LIST
['Open', 'High', 'Low', 'Close', 'Volume']
(3254, 5)

The final step of the data preparation is to create the structure for the input data. This structure needs to match the input layer of the model architecture.

3.3 Slicing the Data for a Model with Multiple In- and Outputs

The code below starts a sliding window process that cuts the initial time series data into multiple slices, i.e., mini-batches. Each batch is a smaller fraction of the initial time series shifted by a single step. Because we will feed our model with multivariate input data, the time series consists of five input columns/features. Each batch comprises a period of 50 steps from the time series and an output sequence of ten consecutive values. To validate that the batches have the right shape, we visualize mini-batches in a line graph with their consecutive target values.

# Set the input_sequence_length length - this is the timeframe used to make a single prediction
input_sequence_length = 50
# The output sequence length is the number of steps that the neural network predicts
output_sequence_length = 10 #

# Prediction Index
index_Close = df_train.columns.get_loc("Close")

# Split the training data into train and train data sets
# As a first step, we get the number of rows to train the model on 80% of the data 
train_data_length = math.ceil(np_scaled.shape[0] * 0.8)

# Create the training and test data
train_data = np_scaled[:train_data_length, :]
test_data = np_scaled[train_data_length - input_sequence_length:, :]

# The RNN needs data with the format of [samples, time steps, features]
# Here, we create N samples, input_sequence_length time steps per sample, and f features
def partition_dataset(input_sequence_length, output_sequence_length, data):
    x, y = [], []
    data_len = data.shape[0]
    for i in range(input_sequence_length, data_len - output_sequence_length):
        x.append(data[i-input_sequence_length:i,:]) #contains input_sequence_length values 0-input_sequence_length * columns
        y.append(data[i:i + output_sequence_length, index_Close]) #contains the prediction values for validation (3rd column = Close),  for single-step prediction
    
    # Convert the x and y to numpy arrays
    x = np.array(x)
    y = np.array(y)
    return x, y

# Generate training data and test data
x_train, y_train = partition_dataset(input_sequence_length, output_sequence_length, train_data)
x_test, y_test = partition_dataset(input_sequence_length, output_sequence_length, test_data)

# Print the shapes: the result is: (rows, training_sequence, features) (prediction value, )
print(x_train.shape, y_train.shape)
print(x_test.shape, y_test.shape)

# Validate that the prediction value and the input match up
# The last close price of the second input sample should equal the first prediction value
nrows = 3 # number of shifted plots
fig, ax = plt.subplots(nrows=nrows, ncols=1, figsize=(16, 8))
for i, ax in enumerate(fig.axes):
    xtrain = pd.DataFrame(x_train[i][:,index_Close], columns={f'x_train_{i}'})
    ytrain = pd.DataFrame(y_train[i][:output_sequence_length-1], columns={f'y_train_{i}'})
    ytrain.index = np.arange(input_sequence_length, input_sequence_length + output_sequence_length-1)
    xtrain_ = pd.concat([xtrain, ytrain[:1].rename(columns={ytrain.columns[0]:xtrain.columns[0]})])
    df_merge = pd.concat([xtrain_, ytrain])
    sns.lineplot(data = df_merge, ax=ax)
plt.show

(2544, 50, 5) (2544, 10)
(640, 50, 5) (640, 10)

Step #4: Prepare the Neural Network Architecture and Train the Multi-Output Regression Model

Now that we have the training data prepared and ready, the next step is to configure the architecture of the multi-out neural network. Because we will be using multiple input series, our model is, in fact, a multivariate architecture so that it corresponds to the input training batches.

4.1 Configuring and Training the Model

We choose a comparably simple architecture with only two LSTM layers and two additional dense layers. The first dense layer has 20 neurons, and the second layer is the output layer, which has ten output neurons. If you wonder how I got to the number of neurons in the third layer, I conducted several experiments and found that this number leads to solid results.

To ensure that the architecture matches our input data’s structure, we reuse the variables for the previous code section (n_input_neurons, n_output_neurons. The input sequence length is 50, and the output sequence (the steps for the period we want to predict) is ten.

# Configure the neural network model
model = Sequential()
n_output_neurons = output_sequence_length

# Model with n_neurons = inputshape Timestamps, each with x_train.shape[2] variables
n_input_neurons = x_train.shape[1] * x_train.shape[2]
print(n_input_neurons, x_train.shape[1], x_train.shape[2])
model.add(LSTM(n_input_neurons, return_sequences=True, input_shape=(x_train.shape[1], x_train.shape[2]))) 
model.add(LSTM(n_input_neurons, return_sequences=False))
model.add(Dense(20))
model.add(Dense(n_output_neurons))

# Compile the model
model.compile(optimizer='adam', loss='mse')

After configuring the model architecture, we can initiate the training process and illustrate how the loss develops over the training epochs.

# Training the model
epochs = 10
batch_size = 16
early_stop = EarlyStopping(monitor='loss', patience=5, verbose=1)
history = model.fit(x_train, y_train, 
                    batch_size=batch_size, 
                    epochs=epochs,
                    validation_data=(x_test, y_test)
                   )
                    
                    #callbacks=[early_stop])

Epoch 1/5
159/159 [==============================] - 7s 14ms/step - loss: 0.0047 - val_loss: 0.0262
Epoch 2/5
159/159 [==============================] - 2s 11ms/step - loss: 3.6759e-04 - val_loss: 0.0097
Epoch 3/5
159/159 [==============================] - 2s 11ms/step - loss: 1.5222e-04 - val_loss: 0.0056
Epoch 4/5
159/159 [==============================] - 2s 11ms/step - loss: 1.0327e-04 - val_loss: 0.0031
Epoch 5/5
159/159 [==============================] - 2s 11ms/step - loss: 1.1690e-04 - val_loss: 0.0026

4.2 Loss Curve

Next, we plot the loss curve, which represents the amount of error between the model’s predicted values and the actual values in the training data. A lower loss value indicates that the model makes more accurate predictions on the training data.

# Plot training & validation loss values
fig, ax = plt.subplots(figsize=(10, 5), sharex=True)
plt.plot(history.history["loss"])
plt.title("Model loss")
plt.ylabel("Loss")
plt.xlabel("Epoch")
ax.xaxis.set_major_locator(plt.MaxNLocator(epochs))
plt.legend(["Train", "Test"], loc="upper left")
plt.grid()
plt.show()

As we can see, the loss curve drops quickly during training, which typically means that the model is quickly learning to make accurate predictions.

Step #5 Evaluate Model Performance

Now that we have trained the model, we can make forecasts on the test data and use traditional regression metrics such as the MAE, MAPE, or MDAPE to measure the performance of our model.

# Get the predicted values
y_pred_scaled = model.predict(x_test)

# Unscale the predicted values
y_pred = scaler_pred.inverse_transform(y_pred_scaled)
y_test_unscaled = scaler_pred.inverse_transform(y_test).reshape(-1, output_sequence_length)

# Mean Absolute Error (MAE)
MAE = mean_absolute_error(y_test_unscaled, y_pred)
print(f'Median Absolute Error (MAE): {np.round(MAE, 2)}')

# Mean Absolute Percentage Error (MAPE)
MAPE = np.mean((np.abs(np.subtract(y_test_unscaled, y_pred)/ y_test_unscaled))) * 100
print(f'Mean Absolute Percentage Error (MAPE): {np.round(MAPE, 2)} %')

# Median Absolute Percentage Error (MDAPE)
MDAPE = np.median((np.abs(np.subtract(y_test_unscaled, y_pred)/ y_test_unscaled)) ) * 100
print(f'Median Absolute Percentage Error (MDAPE): {np.round(MDAPE, 2)} %')


def prepare_df(i, x, y, y_pred_unscaled):
    # Undo the scaling on x, reshape the testset into a one-dimensional array, so that it fits to the pred scaler
    x_test_unscaled_df = pd.DataFrame(scaler_pred.inverse_transform((x[i]))[:,index_Close]).rename(columns={0:'x_test'})
    
    y_test_unscaled_df = []
    # Undo the scaling on y
    if type(y) == np.ndarray:
        y_test_unscaled_df = pd.DataFrame(scaler_pred.inverse_transform(y)[i]).rename(columns={0:'y_test'})

    # Create a dataframe for the y_pred at position i, y_pred is already unscaled
    y_pred_df = pd.DataFrame(y_pred_unscaled[i]).rename(columns={0:'y_pred'})
    return x_test_unscaled_df, y_pred_df, y_test_unscaled_df


def plot_multi_test_forecast(x_test_unscaled_df, y_test_unscaled_df, y_pred_df, title): 
    # Package y_pred_unscaled and y_test_unscaled into a dataframe with columns pred and true   
    if type(y_test_unscaled_df) == pd.core.frame.DataFrame:
        df_merge = y_pred_df.join(y_test_unscaled_df, how='left')
    else:
        df_merge = y_pred_df.copy()
    
    # Merge the dataframes 
    df_merge_ = pd.concat([x_test_unscaled_df, df_merge]).reset_index(drop=True)
    
    # Plot the linecharts
    fig, ax = plt.subplots(figsize=(20, 8))
    plt.title(title, fontsize=12)
    ax.set(ylabel = stockname + "_stock_price_quotes")
    sns.lineplot(data = df_merge_, linewidth=2.0, ax=ax)

# Creates a linechart for a specific test batch_number and corresponding test predictions
batch_number = 50
x_test_unscaled_df, y_pred_df, y_test_unscaled_df = prepare_df(i, x_test, y_test, y_pred)
title = f"Predictions vs y_test - test batch number {batch_number}"
plot_multi_test_forecast(x_test_unscaled_df, y_test_unscaled_df, y_pred_df, title)

The quality of the predictions is acceptable, considering that this tutorial aimed not to achieve excellent predictions but to demonstrate the process and architecture of training a multi-output regression. So, there is certainly room for improvement. Feel free to experiment with different features or try other hyperparameters and neural network layers.

Step #6 Create a New Forecast

Finally, let’s create a forecast on a new dataset. We take the scaled dataset from section 2 (np_scaled) and extract a series with the latest 50 values. The data is reshaped into a 3D array with shape (1, 50, 5) to match the expected input shape of the model. We use these values to generate a new prediction for the next ten days using the predict method. We store the result in the y_pred_scaled variable. In addition, we need to transform the predictions back to the original scale. We do this by using the inverse_transform method of the scaler_pred object, which was fit on the training data. Finally, we visualize the multi-step forecast in another line chart.

# Get the latest input batch from the test dataset, which is contains the price values for the last ten trading days
x_test_latest_batch = np_scaled[-50:,:].reshape(1,50,5)

# Predict on the batch
y_pred_scaled = model.predict(x_test_latest_batch)
y_pred_unscaled = scaler_pred.inverse_transform(y_pred_scaled)

# Prepare the data and plot the input data and the predictions
x_test_unscaled_df, y_test_unscaled_df, _ = prepare_df(0, x_test_latest_batch, '', y_pred_unscaled)
plot_multi_test_forecast(x_test_unscaled_df, '', y_test_unscaled_df, "x_new Vs. y_new_pred")

Summary

In this tutorial, we demonstrated how to use multiple output neural networks to make predictions at different time steps. We first discussed the architecture of a recurrent neural network and how it can be used to process sequential data. We then showed how to properly preprocess the data and split it into training and test sets for training a multi-output regression model.

Next, we trained a model to predict the stock price of Apple ten steps into the future using historical data. We also discussed how to use the trained model to make multi-step predictions on new data and how to visualize the results.

To further improve the performance of the model, you can experiment with different hyperparameters and adjust the model architecture. For example, adding more neurons to the output layers will increase the prediction horizon, but remember that prediction error will also increase as the horizon lengthens. You can also try using different activation functions or adding more layers to the model to see how it affects the performance.

I hope this article was helpful in understanding multi-output neural networks better. If you have any questions or comments, please let me know.

Sources and Further Reading

The links above to Amazon are affiliate links. By buying through these links, you support the Relataly.com blog and help to cover the hosting costs. Using the links does not affect the price.

If you want to learn about an alternative approach to univariate stock market forecasting, consider taking a look at Facebook Prophet or ARIMA models

The post Stock Market Forecasting Neural Networks for Multi-Output Regression in Python appeared first on relataly.com.

Predictive Policing: Preventing Crime in San Francisco using XGBoost and Python

Florian Follonier — Sun, 07 Mar 2021 16:16:19 +0000

In this tutorial, we’ll be using machine learning to predict and map out crime in San Francisco. We’ll be working with a dataset from Kaggle that contains information on 39 different types of crimes, including everything from vehicle theft to drug offenses. Using Python and the powerful Scikit-Learn library, we’ll train a classification model using the XGboost algorithm to predict 39 types of crimes based on when and where it occurred. We’ll then use the Plotly library to visualize the results on a map of the city, highlighting areas with higher rates of certain crimes. This type of prediction and mapping is similar to what the San Francisco Police Department uses in their practice of predictive policing, where they allocate resources to at-risk areas in an effort to prevent crime.

As we embark on this thrilling journey, we’ll start by downloading and preprocessing the San Francisco crime data. Next, we’ll channel the data to train two distinct classification models. The first model will utilize a standard Random Forest Classifier, while the second will leverage the exceptional XGBoost package. We’ll experiment with various models that boast different hyperparameters. Ultimately, we’ll visualize our predictions on a striking SF crime map and assess the performance of our diverse models. So, buckle up and let’s dive into the exhilarating world of crime prediction and mapping!

Predictive policing can make police work much more efficient and effective. Image generated using Midjourney.

What is Predictive Policing?

The use case we are looking at in this article falls into predictive policing. Predictive policing uses data, algorithms, and other technological tools to predict where and when crimes are likely to occur. The goal of predictive policing is to help law enforcement agencies better allocate their resources and focus their efforts on areas where crime is likely to happen, with the ultimate goal of reducing crime and improving public safety. This approach to policing is based on the idea that by using data and other tools to identify patterns and trends, law enforcement agencies can better anticipate where crimes are likely to occur and take steps to prevent them from happening.

The benefits of predictive policing include the ability to allocate law enforcement resources better, the potential to reduce crime and improve public safety, and the ability to identify trends and patterns that may not be immediately obvious to law enforcement officers. Additionally, by using data and other tools to anticipate where crimes are likely to occur, law enforcement agencies can take proactive steps to prevent those crimes from happening, which can save time and money.

Creating a Crime Map for Predictive Policing using XGBoost in Python

In this practical tutorial, we’ll construct an XGBoost multi-label classifier to predict crime types in San Francisco. Urban crime, such as in San Francisco, is a dynamic and multifaceted issue that can dramatically vary based on location, time, and other factors. Our aim is to develop a predictive algorithm capable of forecasting specific crime types based on a given location and time parameters. The end product is an interactive San Francisco crime map providing a snapshot of crime hotspots throughout the city.

Law enforcement agencies, like the San Francisco Police Department, use similar maps for strategic resource allocation to curb crime rates effectively. Additionally, this SF crime map will underscore crime clusters – areas notorious for particular types of crime incidents. By the end of this tutorial, you’ll have a deeper understanding of using machine learning in practical scenarios and aiding real-world decision-making.

The code is available on the GitHub repository.

View on GitHub Relataly Github Repo

Crime doesn’t sleep in San Francisco. That’s why predictive policing can make a real impact. Image generated with Midjourney

Prerequisites

Before starting the Python coding part, ensure that you have set up your Python 3 environment and required packages. If you don’t have an environment, follow this tutorial to set up the Anaconda environment.

Also, make sure you install all required packages. In this tutorial, we will be working with the following standard packages:

pandas
NumPy
matplotlib
Seaborn

In addition, we will be using XGBoost (‘xgboost’) and the machine learning library scikit-learn.

You can install packages using console commands:

pip install
conda install (if you are using the anaconda packet manager)

Step #1 Load the Data

We begin by downloading the San Francisco crime challenge data on kaggle.com. Once you have downloaded the dataset, place the CSV files (train.csv) into your Python working folder.

The dataset was collected by the SFO police department between 2003 and 2015. According to the data description from the SF crime challenge, the dataset contains the following variables:

Dates: timestamp of the crime incident
Category: Category of the crime incident (only in train.csv) that we will use as the target variable
Descript: detailed description of the crime incident (only in train.csv)
DayOfWeek: the day of the week
PdDistrict: the name of the Police Department District
Resolution: how the crime incident was resolved (only in train.csv)
Address: the approximate street address of the crime incident
X: Longitude
Y: Latitude

The next step is to load the data into a dataframe. Then we use the head() command to print the first five lines and ensure you can see the data.

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import matplotlib.pyplot as plt 
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix
from xgboost import XGBClassifier
import plotly.express as px

# The Data is part of the Kaggle Competition: https://www.kaggle.com/c/sf-crime/data
df_base = pd.read_csv("data/crime/sf-crime/train.csv")

print(df_base.describe())
df_base.head()

		X              Y
count  	878049.000000  878049.000000
mean     -122.422616      37.771020
std         0.030354       0.456893
min      -122.513642      37.707879
25%      -122.432952      37.752427
50%      -122.416420      37.775421
75%      -122.406959      37.784369
max      -120.500000      90.000000

	Dates				Category	Descript			DayOfWeek	PdDistrict	Resolution	Address				X			Y
0	2015-05-13 23:53:00	WARRANTS	WARRANT ARREST		Wednesday	NORTHERN	ARREST, 	OAK ST / ...		-122.425892	37.774599
1	2015-05-13 23:53:00	OTHER ...	TRAFFIC ...			Wednesday	NORTHERN	ARREST, 	OAK ST / ...		-122.425892	37.774599
2	2015-05-13 23:33:00	OTHER ...	TRAFFIC ...			Wednesday	NORTHERN	ARREST, 	VANNESS AV... ST	-122.424363	37.800414
3	2015-05-13 23:30:00	LARCENY/THEFT	GRAND THEFT...	Wednesday	NORTHERN	NONE		1500 Block... ST	-122.426995	37.800873
4	2015-05-13 23:30:00	LARCENY/THEFT	GRAND THEFT ...	Wednesday	PARK		NONE		100 Block... ST		-122.438738	37.771541

If the data was loaded correctly, you should see the first five records of the dataframe, as shown above.

Step #2 Explore the Data

At the beginning of a new project, we usually don’t understand the data well and need to acquire that understanding. Therefore, next, we will explore the data and familiarize ourselves with its characteristics.

The following examples will help us better understand our data’s characteristics. For example, you can use whisker charts and a correlation matrix to understand better the correlation between variables, such as between weekdays and prediction categories. Feel free to create more charts.

2.1 Prediction Labels

Running the code below shows a bar plot of the prediction labels. The plot shows the frequency in which the class labels occur in the data.

# print the value counts of the categories
plt.figure(figsize=(15,5))
ax = sns.countplot(x = df_base['Category'], orient='v', order = df_base['Category'].value_counts().index)
ax.set_xticklabels(ax.get_xticklabels(),rotation = 90)

As shown above, our class labels are highly imbalanced, affecting model accuracy. When we evaluate the performance of our model, we need to consider this.

2.2 When a Crime Occured – Considering Dates and Time

We assume that when a crime occurs impacts the type of crime. For this reason, we look at how crimes distribute across different days of the week and times of the day. First, we look at crime numbers per weekday.

# Print Crime Counts per Weekday
plt.figure(figsize=(6,3))
ax = sns.countplot(y = df_base['DayOfWeek'], orient='h', order = df_base['DayOfWeek'].value_counts().index)
ax.set_xticklabels(ax.get_xticklabels(),rotation = 90)

Fewer crimes happen on Sundays, and most are on Fridays. So it seems that even criminals like to have a weekend. For the sake of clarity, we thereby limit the categories. Let’s take a look at the time when certain crimes are reported.

# Convert the time to minutes
df_base['Hour_Min'] = pd.to_datetime(df_base['Dates']).dt.hour  + pd.to_datetime(df_base['Dates']).dt.minute / 60

# Print Crime Counts per Time and Category
df_base_filtered = df_base[df_base['Category'].isin([
    'PROSTITUTION', 
    'VEHICLE THEFT', 
    'DRUG/NARCOTIC', 
    'WARRENTS', 
    'BURGLERY', 
    'FRAUD', 
    'ASSAULT',
    'LARCENY/THEFT',
    'VANDALISM'])]

plt.figure(figsize=(16,10))
ax = sns.displot(x = 'Hour_Min', hue="Category", data = df_base_filtered, kind="kde", height=8, aspect=1.5)

In addition, the time when a crime happens affects the likelihood of certain types. For example, we can see that FRAUD rarely occurs at night and usually during the day. We can see that criminals often go to work in the afternoon and at midnight. On the other hand, certain crimes, such as VEHICLE THEFT, mainly occur at night and late afternoon but less often in the morning.

If you want to gain an overview of additional features, you can use the pair plot function. Because our dataset is large, we reduce the computation time by plotting 1/100 of the data.

sns.pairplot(data = df_base_filtered[0::100], height=4, aspect=1.5, hue='Category')

2.3 Where a Crime Occured – Considering Address

Next, we look at the address information, from which we can often extract additional information. We do this by printing some sample address values.

# Extracting information from the streetnames
for i in df_base['Address'][0:10]:
    print(i)

OAK ST / LAGUNA ST
OAK ST / LAGUNA ST
VANNESS AV / GREENWICH ST
1500 Block of LOMBARD ST
100 Block of BRODERICK ST
0 Block of TEDDY AV
AVALON AV / PERU AV
KIRKWOOD AV / DONAHUE ST
600 Block of 47TH AV
JEFFERSON ST / LEAVENWORTH ST

The street names alone are not so helpful. However, the address data does provide additional information. For example, it tells us whether the location is a street intersection or not. In addition, it contains the type of street. This information is valuable because now we can extract parts of the text and use them as separate features.

We could do a lot more, but we’ve got a good enough idea of the data.

Step #3 Data Preprocessing

Probably the most exciting and important aspect of model development is feature engineering. Compared to model parameterization, the right features can often achieve more significant leaps in performance.

3.1 Remarks on Data Preprocessing for XGBoost

When preprocessing the data, it is helpful to know which algorithms to use because some algorithms are picky about the shape of the data. We will prepare the data to train a gradient-boosting model (XGBoost). This algorithm uses a random forest ensemble, which can only handle integer and Boolean values, but no categorical data. Therefore we need to encode our values. We also need to map the categorical labels to integer values.

We don’t need to scale the continuous feature variables because gradient boosting and decision trees, generally, are not sensitive to variables that have different scales.

3.2 Feature Engineering

Based on the data exploration that we have done in the previous section, we create three feature types:

Date & Time: When a crime happens is essential. For example, when there is a lot of traffic on the street, there is a higher likelihood of traffic-related crimes. For example, when it is Saturday, more people will usually come to the nightlife district, which attracts certain crimes, e.g., drug-related. Therefore, we will create different features for the time, the day, the month, and the year.
Address: As mentioned, we will extract additional features from the address column. First, we create different features for the street type (for example, ST, AV, WY, TR, DR). In addition, we check whether the address contains the word “Block.” In addition, we will let our model know whether the address is a street crossing.
Latitude & Longitude: We will transform the latitude and longitude values into polar coordinates. We will also remove some outliers from the dataset whose latitude is far off the grid. Above all, this will make it easier for our model to make sense of the location.

Considering these features, the primary input to our crime-type prediction model is the information on when and where a crime occurs.

# Processing Function for Features
def cart2polar(x, y):
    dist = np.sqrt(x**2 + y**2)
    phi = np.arctan2(y, x)
    return dist, phi

def preprocessFeatures(dfx):
    
    # Time Feature Engineering
    df = pd.get_dummies(dfx[['DayOfWeek' , 'PdDistrict']])
    df['Hour_Min'] = pd.to_datetime(dfx['Dates']).dt.hour + pd.to_datetime(dfx['Dates']).dt.minute / 60
    # We add a feature that contains the expontential time
    df['Hour_Min_Exp'] = np.exp(df['Hour_Min'])
    
    df['Day'] = pd.to_datetime(dfx['Dates']).dt.day
    df['Month'] = pd.to_datetime(dfx['Dates']).dt.month
    df['Year'] = pd.to_datetime(dfx['Dates']).dt.year

    month_one_hot_encoded = pd.get_dummies(pd.to_datetime(dfx['Dates']).dt.month, prefix='Month')
    df = pd.concat([df, month_one_hot_encoded], axis=1, join="inner")
    
    # Convert Carthesian Coordinates to Polar Coordinates
    df[['X', 'Y']] = dfx[['X', 'Y']] # we maintain the original coordindates as additional features
    df['dist'], df['phi'] = cart2polar(dfx['X'], dfx['Y'])
  
    # Extracting Street Types
    df['Is_ST'] = dfx['Address'].str.contains(" ST", case=True)
    df['Is_AV'] = dfx['Address'].str.contains(" AV", case=True)
    df['Is_WY'] = dfx['Address'].str.contains(" WY", case=True)
    df['Is_TR'] = dfx['Address'].str.contains(" TR", case=True)
    df['Is_DR'] = dfx['Address'].str.contains(" DR", case=True)
    df['Is_Block'] = dfx['Address'].str.contains(" Block", case=True)
    df['Is_crossing'] = dfx['Address'].str.contains(" / ", case=True)
    
    return df

# Processing Function for Labels
def encodeLabels(dfx):
    df = pd.DataFrame (columns = [])
    factor = pd.factorize(dfx['Category'])
    return factor

# Remove Outliers by Longitude
df_cleaned = df_base[df_base['Y']<70]

# Encode Labels as Integer
factor = encodeLabels(df_cleaned)
y_df = factor[0]
labels = list(factor[1])
# for val, i in enumerate(labels):
#     print(val, i)

We could also try to further improve our features by using additional data sources, such as weather data. However, there is no guarantee that this will improve the model results, and it did not in the case of criminal records. Therefore, we have omitted this part.

Step #4 Visualize Crime Types on a Map of San Francisco

Next, we create a San Francisco crime map using the cartesian coordinates indicating where a crime has occurred. First, we only plot the data without a geographical map. Later we will use these spatial data to create a dot plot and overlay it with a map of San Francisco. Visualizing the crime types on a map helps us understand how crime types distribute across the city.

4.1 Plot Crime Types using a Scatter Plot

Next, we want to gain an overview of possible spatial patterns and hotspots. We expect to see streets and neighborhoods where certain crimes are more common than in the more expensive areas of the city. In addition, we expect to see places in the city where certain crime types occur relatively rarely. To gain an overview of the crime distribution in San Francisco, we use a scatter plot to display the crime coordinates on a blank chart.

Running the code below creates the crime map of San Francisco with all crime types. Depending on the speed of your machine, the creation of the map may take several minutes.

# Plot Criminal Activities by Lat and Long
df_filtered = df_cleaned.sample(frac=0.05)  
#df_filtered = df_cleaned[df_cleaned['Category'].isin(['PROSTITUTION', 'VEHICLE THEFT', 'FRAUD'])].sample(frac=0.05) # to filter 

groups = df_filtered.groupby('Category')

fig, ax = plt.subplots(sharex=False, figsize=(20, 12))
ax.margins(0.05) # Optional, just adds 5% padding to the autoscaling
for name, group in groups:
    ax.plot(group['X'], group['Y'], marker='.', linestyle='', label=name, alpha=0.9)
ax.legend()
plt.show()

The plot shows that certain streets in San Francisco are more prone to specific crime types than others. It is also clear that there are certain crime hotspots in the city, especially in the center. We can also see that few crimes are reported in public park areas.

4.2 Create a Crime Map of San Francisco using Plotly

Next, we will create a San Francisco crime map using the Plotly Python library. Because the plugin can handle a limited amount of data simultaneously, we will reduce our data to a fraction of 1% and a few selected crime types.

Running the code below opens a _map.html file in your browser that displays the SF crime map. The result is a zoomable geographic map of San Francisco that shows how the selected crime types distribute across the city.

# 4.2 Create a Crime Map of San Francisco using Plotly
# Limit the data to a fraction and selected categories
df_filtered = df_cleaned.sample(frac=0.01) 
fig = px.scatter_mapbox(df_filtered, lat="Y", lon="X", hover_name="Category", color='Category', hover_data=["Y", "X"], zoom=12, height=800)
fig.update_layout(mapbox_style="open-street-map")
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show()

The SF crime map shows different types of crimes, including prostitution, vehicle theft, and fraud. The interactive map allows you to change zoom levels and filter the type of crime displayed on the map. For example, if you filter DRUG/NARCOTIC-related crimes, you can see that these crimes mainly occur in the city center near the financial district and the nightlife area.

Step #5 Split the Data

Before training our predictive model, we will split our data into separate datasets for training and testing. For this purpose, we use the train_test_split function of scikit-learn and configure a split ratio of 70%. Then we output the data, which we employ in the next step to train and validate a model.

# Create train_df & test_df
x_df = preprocessFeatures(df_cleaned).copy()

# Split the data into x_train and y_train data sets
x_train, x_test, y_train, y_test = train_test_split(x_df, y_df, train_size=0.7, random_state=0)
x_train

		DayOfWeek_Friday	DayOfWeek_Monday	DayOfWeek_Saturday	DayOfWeek_Sunday	DayOfWeek_Thursday	DayOfWeek_Tuesday	DayOfWeek_Wednesday	PdDistrict_BAYVIEW	PdDistrict_CENTRAL	PdDistrict_INGLESIDE	...	Y			dist		phi			Is_ST	Is_AV	Is_WY	Is_TR	Is_DR	Is_Block	Is_crossing
276998	0					0					0					0					0					1					0					0					0					0						...	37.785023	128.110900	2.842200	True	False	False	False	False	True		False
81579	0					0					0					0					0					1					0					0					0					0						...	37.748470	128.185052	2.842677	False	True	False	False	False	True		False
206676	0					0					0					1					0					0					0					0					0					0						...	37.762744	128.113657	2.842389	True	False	False	False	False	True		False
732006	0					0					0					0					0					0					1					0					0					0						...	37.784140	128.109653	2.842204	True	False	False	False	False	False		True
796194	1					0					0					0					0					0					0					0					0					0						...	37.791333	128.125982	2.842185	True	False	False	False	False	True		False
5 rows × 45 columns

Step #6 Train a Random Forest Classifier

We can train the predictive models now that we have prepared the data. We train a basic model based on the Random Forest algorithm in the first step. The Random Forest is a robust algorithm that can handle regression and classification problems. One of our recent articles provides more information on Random Forests and how you can find the optimal configuration of their hyperparameters. In this tutorial, we use the Random Forest to establish a baseline against which we can measure the performance of our XGboost model. We, therefore, use the Random Forest with a simple parameter configuration without tuning the hyperparameters.

# Train a single random forest classifier - parameters are a best guess
clf = RandomForestClassifier(max_depth=100, random_state=0, n_estimators = 200)
clf.fit(x_train, y_train.ravel())
y_pred = clf.predict(x_test)

results_log = classification_report(y_test, y_pred)
print(results_log)

Output exceeds the size limit. Open the full output data in a text editor
              precision    recall  f1-score   support

           0       0.15      0.10      0.12     12657
           1       0.29      0.35      0.32     37898
           2       0.38      0.63      0.47     52237
           3       0.46      0.40      0.43     16136
           4       0.16      0.08      0.10     13426
           5       0.25      0.21      0.23     27798
           6       0.10      0.04      0.06      6850
           7       0.23      0.22      0.23     23087
           8       0.19      0.12      0.15      2586
           9       0.20      0.13      0.15     10942
          10       0.08      0.03      0.05      9559
          11       0.00      0.00      0.00      1300
          12       0.20      0.10      0.14      3200
          13       0.37      0.43      0.40     16282
          14       0.02      0.02      0.02      1350
          15       0.01      0.00      0.00      2912
          16       0.05      0.03      0.04      2217
          17       0.61      0.52      0.56      7865
          18       0.11      0.06      0.08      4954
          19       0.04      0.03      0.03       723
          20       0.28      0.19      0.23       581
          21       0.05      0.02      0.03       708
          22       0.25      0.13      0.17      1333
...
    accuracy                           0.31    263395
   macro avg       0.15      0.12      0.13    263395
weighted avg       0.28      0.31      0.28    263395

The baseline model is a random forest classifier with 31% percent accuracy on the test dataset.

Step #7 Train an XGBoost Classifier

Now that we have a baseline model, we can train our gradient boosting classifier using the XGBoost package. We expect this model to perform better than the baseline.

7.1 About Gradient Boosting

XGBoost is an implementation of a gradient-boosting algorithm that uses a decision-tree-based ensemble machine learning algorithm. The algorithm searches for an optimal ensemble of trees. In this process, the algorithm iteratively adds trees to the model or removes them to reduce the prediction error of the previous tree constellation. The algorithm repeats these steps until it can make no further improvements. Thus, training does not optimize the model against the predictions but the previous model’s residuals (prediction errors).

But XGBoost does more! It is an extreme version of gradient boosting that uses additional optimization techniques to achieve the best result with minimal effort. In contrast to the random decision forest, the XGBoost classification algorithm determines an optimal number of trees in the training process. We do not have to specify this number in advance.

A disadvantage of XGBoost is that it tends to overfit the data. Therefore, testing against unseen data is essential. This tutorial will test only against a single test sample for simplicity, but using cross-validation would be a better choice.

7.2 Train the XGBoost Classifier

Various Gradient Boosting Algorithms are available for Python, including one from scikit-learn. However, scikit-learn does not support multi-threading, which makes the training process slower than necessary. For this reason, we will use the gradient boosting classifier from the XGBoost package.

# Configure the XGBoost model
param = {'booster': 'gbtree', 
         'tree_method': 'gpu_hist',
         'predictor': 'gpu_predictor',
         'max_depth': 140, 
         'eta': 0.3, 
         'objective': '{multi:softmax}', 
         'eval_metric': 'mlogloss', 
         'num_round': 30,
         'feature_selector ': 'cyclic'
        }

xgb_clf = XGBClassifier(param)
xgb_clf.fit(x_train, y_train.ravel())
score = xgb_clf.score(x_test, y_test.ravel())
print(score)

# Create predictions on the test dataset
y_pred = xgb_clf.predict(x_test)

# Print a classification report
results_log = classification_report(y_test, y_pred)
print(results_log)

Output exceeds the size limit. Open the full output data in a text editor
0.30852142219859907
              precision    recall  f1-score   support

           0       0.17      0.01      0.02     12657
           1       0.30      0.42      0.35     37898
           2       0.33      0.72      0.46     52237
           3       0.31      0.27      0.29     16136
           4       0.21      0.03      0.05     13426
           5       0.24      0.18      0.21     27798
           6       0.17      0.01      0.01      6850
           7       0.21      0.19      0.20     23087
           8       0.26      0.01      0.02      2586
           9       0.22      0.08      0.12     10942
          10       0.13      0.00      0.00      9559
          11       0.07      0.00      0.01      1300
          12       0.20      0.08      0.11      3200
          13       0.34      0.43      0.38     16282
          14       0.00      0.00      0.00      1350
          15       0.12      0.00      0.01      2912
          16       0.15      0.02      0.03      2217
          17       0.57      0.34      0.43      7865
          18       0.19      0.03      0.05      4954
          19       0.00      0.00      0.00       723
          20       0.50      0.24      0.32       581
          21       0.10      0.01      0.01       708
...
    accuracy                           0.31    263395
   macro avg       0.18      0.11      0.11    263395
weighted avg       0.27      0.31      0.25    263395

Now that we have trained our classification model, let’s see how it performs. For this purpose, we will generate predictions (y_pred) on the test dataset (x_test). Afterward, we use the predictions and the valid values (y_test) to create a classification report.

Our model achieves an accuracy score of 31%. At first hand, this might not look so good, but considering that we have 39 categories and only sparse information available, this performance is quite impressive.

Step #8 Measure Model Performance

So how well does our XGboost model perform? To measure the performance of our model, we create a confusion matrix that visualizes the performance of the XGboost classifier. If you want to learn more about measuring the performance of classification models, check out this tutorial on measuring classification performance.

Running the code below creates the confusion matrix that shows the number of correct and false predictions for each crime category.

# Print a multi-Class Confusion Matrix
cnf_matrix = confusion_matrix(y_test.reshape(-1), y_pred)
df_cm = pd.DataFrame(cnf_matrix, columns=np.unique(y_test), index = np.unique(y_test))
df_cm.index.name = 'Actual'
df_cm.columns.name = 'Predicted'
plt.figure(figsize = (16,12))
plt.tight_layout()
sns.set(font_scale=1.4) #for label size
sns.heatmap(df_cm, cbar=True, cmap= "inferno", annot=False, fmt='.0f' #, annot_kws={"size": 13}
           )

The confusion matrix shows that our model frequently predicts crime category two and neglects the other crime types. The reason is the uneven distribution of crime types in the training data. As a result, when we evaluate the model, we need to pay attention to the importance of the different crime types. For example, we might train the model to predict certain crime types accurately, although this might come at a lower accuracy when predicting other crime types. However, such optimizations depend on the technical context and the goals one wants to achieve with the prediction model.

Summary

This tutorial has presented the machine learning use case “Predictive Policing” and showed how to implement it in Python. We have trained an XGBoost model that predicts crime types in San Francisco based on the information on when and where specific crimes have occurred. We also illustrated our data on an interactive crime map of San Francisco with the Plotly Python library. The Crime Map is an intuitive way of visualizing crime in a city and highlighting particular hotspots. Finally, we have used the prediction model to make test predictions and evaluate the model performance against other algorithms, such as a classic Random Decision Forest. The XGBoost model achieves a prediction accuracy of about 31%—a respectable performance, considering that the prediction problem involves 39 crime classes.

We hope this tutorial was helpful. If you have any questions or suggestions on what we could improve, feel free to post them in the comments. We appreciate your feedback.

Predictive policing with machine learning – Crime map of San Francisco, created with Python and Plotly

Sources and Further Reading

Looking for more esciting map vizualizations? Consider the relataly tutorial on visualizing COVID-19 data on geographic heatmaps using GeoPandas.

The links above to Amazon are affiliate links. By buying through these links, you support the Relataly.com blog and help to cover the hosting costs. Using the links does not affect the price.

The post Predictive Policing: Preventing Crime in San Francisco using XGBoost and Python appeared first on relataly.com.

Forecasting Beer Sales with ARIMA in Python

Florian Follonier — Wed, 03 Feb 2021 22:23:08 +0000

Time series analysis and forecasting is a tough nut to crack, but the ARIMA model has been cracking it for decades. ARIMA, short for “Auto-Regressive Integrated Moving Average,” is a powerful statistical modeling technique for time series analysis. It’s particularly effective when the time series you’re analyzing follows a clear pattern, like seasonal changes in weather or sales. ARIMA has been used to forecast everything from beer sales to order quantities, and this tutorial will show you how to build your own ARIMA model in Python. You’ll be making predictions like a pro in no time!

This tutorial proceeds in two parts: The first part covers the concepts behind ARIMA. You will learn how ARIMA works, what Stationarity means, and when it is appropriate to use ARIMA. The second part is a Python hands-on tutorial that applies auto-ARIMA to the Sales Forecasting domain. We’ll be working with a time series of beer sales, and our goal is to predict how the beer sales quantities will evolve in the coming years. First, we check if the time series is stationary. Then we train an ARIMA forecasting model. Finally, we use the model to produce a sales forecast and measure the model’s performance.

About Sales Forecasting

Sales forecasting is a crucial business strategy that involves predicting future sales volumes for a product (for example, beer) or service. It leverages sophisticated statistical and analytical techniques, such as time series analysis or machine learning algorithms, to scrutinize historical sales data. By identifying trends and patterns within this data, businesses can make informed predictions about their future sales performance.

This strategic forecasting plays a pivotal role in business operations. It is instrumental in guiding key decisions surrounding production, inventory management, staffing, and various other operational elements. By honing in on accurate sales forecasting, businesses can strike the perfect balance – maintaining enough inventory to meet customer demand without overproducing or overstocking. This equilibrium ensures a smooth flow in the supply chain and avoids unnecessary costs tied to excess production or storage.

Furthermore, sales forecasting serves as a roadmap for business growth. It aids in identifying potential market opportunities and predicting future sales revenue. This valuable foresight enables businesses to strategically plan their expansion, ensuring resources are optimally utilized and future goals are met. With this in-depth understanding of sales forecasting, businesses can stay ahead of market trends, navigate through business challenges, and ultimately steer towards success.

Businesses rely on sales forecasting to make informed decisions about production, inventory management, staffing, and other key operational aspects. Image created with Midjourney

" data-image-caption="

Businesses rely on sales forecasting to make informed decisions about production, inventory management, staffing, and other key operational aspects. Image created with Midjourney

" data-large-file="https://www.relataly.com/wp-content/uploads/2023/03/brewery-arima-beer-sales-forecasting-python-tutorial-machine-learning-relataly-min.png" src="https://www.relataly.com/wp-content/uploads/2023/03/brewery-arima-beer-sales-forecasting-python-tutorial-machine-learning-relataly-min.png" alt="Businesses rely on sales forecasting to make informed decisions about production, inventory management, staffing, and other key operational aspects. Image created with Midjourney" class="wp-image-12602" srcset="https://www.relataly.com/wp-content/uploads/2023/03/brewery-arima-beer-sales-forecasting-python-tutorial-machine-learning-relataly-min.png 506w, https://www.relataly.com/wp-content/uploads/2023/03/brewery-arima-beer-sales-forecasting-python-tutorial-machine-learning-relataly-min.png 300w, https://www.relataly.com/wp-content/uploads/2023/03/brewery-arima-beer-sales-forecasting-python-tutorial-machine-learning-relataly-min.png 140w" sizes="(max-width: 506px) 100vw, 506px" />

Businesses rely on sales forecasting to make informed decisions about production, inventory management, staffing, and other key operational aspects. Image created with Midjourney

Introduction to ARIMA Time Series Modelling

ARIMA models provide an alternative approach to time series forecasting that differs significantly from machine learning methods. Working with ARIMA requires a good understanding of Stationarity and knowledge of the transformations used to make time-series data stationary. The concept of Stationarity is, therefore, first on our schedule.

The Concept of Stationarity

Stationarity is an essential concept in stochastic processes that describes the nature of a time series. We consider a time series strictly stationary if its statistical properties do not change over time. In this case, summary statistics, such as the mean and variance, do not change over time. However, the time-series data we encounter in the real world often show a trend or significant irregular fluctuations, making them non-stationary or weakly stationary.

So why is Stationarity such an essential concept for ARIMA? If a time series is stationary, we can assume that the past values of the time series are predictive of future development. In other words, a stationary time series exhibits consistent behavior that makes it predictable. On the other hand, a non-stationary time series is characterized by a kind of random behavior that will be difficult to capture in modeling. Namely, if random movements characterized the past, there is a high probability that the future will be no different.

Fortunately, in many cases, it is possible to transform a time series that is non-stationary into a stationary form and, in this way, build better prediction models.

A stationary Vs. a non-stationary time series

How to Test Whether a Time Series is Stationary

The first step in the ARIMA modeling approach is determining whether a time series is stationary. There are different ways to determine whether a time series is stationary:

Plotting: We can plot the time series and visually check if it shows consistent behavior or changes over a more extended period.
Summary statistics: We can split the time series into different periods and calculate the summary statistics, such as the variance. If these metrics are subject to significant changes, the time series is non-stationary. However, the results will also depend on the respective periods, leading to false conclusions.
Statistic tests: There are various tests to determine the stationary of a time series, such as Kwiatkowski–Phillips–Schmidt–Shin, Augmented Dickey-Fuller, or Phillips–Perron. These tests systematically check a time series and measure the results against the null hypothesis, providing an indicator of the trustworthiness of the results.

What is an (S)ARIMA Model?

As the name implies, ARIMA uses autoregression (AR), integration (differencing), and moving averages (MA) to fit a linear regression model to a time series.

ARIMA Parameters

The default notation for ARIMA is a model with parameters p, d, and q, whereby each parameter takes an integer value:

d (differencing): In the case of a non-stationary time series, there is a chance to remove a trend from the data by differencing once or several times, thus bringing the data to a stationary state. The model parameter d determines the order of the differentiation. A value of d = 0 simplifies the ARIMA model to an ARMA model, lacking the integration aspect. If this is the case, we do not need to integrate the function because the time series is already stationary.
p (order of the AR terms): The autoregressive process describes the dependent relationship between an observation and several lagged observations (lags). Predictions are then based on past data from the same time series using linear functions. p = 1 means the model uses values that lag by one period.
q (order of the MA terms): The parameter q determines the number of lagged forecast errors in the prediction equation. In contrast to the AR process, the MA process assumes that values at a future point in time depend on the errors made by predictions at current and past points in time. This means that it is not previous events that determine the predictions but rather the previous estimation or prediction errors used to calculate the following time series value.

SARIMA

In the real world, many time series have seasonal effects. Examples are monthly retail sales figures, temperature reports, weekly airline passenger data, etc. To consider this, we can specify a seasonal range (e.g., m=12 for monthly data) and additional seasonal AR or MA components for our model that deal with seasonality. Such a model is also called a SARIMA model, and we can define it as a model(p, d, q)(P, D, Q)[m].

Auto-(S)ARIMA

When working with ARIMA, we can set the model parameters manually or use auto-ARIMA and let the model search for the optimal parameters. We do this by varying the parameters and then testing against Stationarity. With the seasonal option enabled, the process tries to identify the optimal hyperparameters for the seasonal components of the model. Auto-ARIMA works by conducting differencing tests to determine the order of differencing, d and then fitting models with parameters in defined ranges, e.g., start_p, max_p as well as start_q, max_q. If our model has a seasonal component, we can also define parameter ranges for the seasonal part of the model.

Creating a Sales Forecast with ARIMA in Python

Having grasped the fundamental concepts behind ARIMA (AutoRegressive Integrated Moving Average), we’re now ready to dive into the practical aspect of crafting a sales forecasting model in Python. Utilizing ARIMA for forecasting sales data is an esteemed practice owing to the algorithm’s adeptness in modeling seasonal changes combined with long-term trends – a characteristic commonly exhibited by sales data.

In this tutorial, we’ll be employing a dataset representing the monthly beer sales across the United States from 1992 through 2018, recorded in millions of US dollars. Our objective is to construct a robust time series model using ARIMA to accurately predict future sales trends.

When it comes to the technological aspect, we’ll be using the Python-based ‘statsmodels’ and ‘pmdarima’ libraries to build our ARIMA sales forecasting model. So, if you’re ready to harness the power of Python and ARIMA for sales prediction, let’s get started!

The code is available on the GitHub repository.

View on GitHub Relataly GitHub Repo

A fluffy cat drinking beer after creating an ARIMA sales forecast. Image created with Midjourney

Prerequisites

Before we start coding, ensure you have set up your Python 3 environment and required packages. If you don’t have an environment, you can follow this tutorial to set up the Anaconda environment.

Also, make sure you install all required packages. In this tutorial, we will be working with the following standard packages:

In addition, we will be using the statsmodels library and pmdarima.

You can install packages using console commands:

pip install
conda install (if you are using the anaconda packet manager)

Step #1 Load the Sales Data to Our Python Project

In the initial step of this tutorial, we commence by setting up the necessary Python environment. We import several packages that we’ll be using for data manipulation, visualization, and implementing machine learning models. We then fetch the dataset we’ll be working with – the monthly beer sales in the United States from 1992 through 2018. This data is sourced from a publicly accessible URL and loaded into a pandas DataFrame.

# A tutorial for this file is available at www.relataly.com
# Tested with Python 3.88

# Setting up packages for data manipulation and machine learning
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import pmdarima as pm
from statsmodels.graphics.tsaplots import plot_acf
from statsmodels.tsa.seasonal import seasonal_decompose
import seaborn as sns
sns.set_style('white', { 'axes.spines.right': False, 'axes.spines.top': False})

# Link to the dataset: 
# https://www.kaggle.com/bulentsiyah/for-simple-exercises-time-series-forecasting

path = "https://raw.githubusercontent.com/flo7up/relataly_data/main/alcohol_sales/BeerWineLiquor.csv"
df = pd.read_csv(path)
df.head()

		date	beer
0	1/1/1992	1509
1	2/1/1992	1541
2	3/1/1992	1597
3	4/1/1992	1675
4	5/1/1992	1822

As shown above, the sales figures in this dataset stem from the first day of each month.

Step #2 Visualize the Time Series and Check it for Stationarity

Before modeling the sales data, we visualize the time series and test it for Stationarity. Visualization helps us choose the parameters for our ARIMA model, thus making it an essential step.

First, we will look at the different components of the time series. We do this by using the seasonal_decompose function of the statsmodels library.

# Decompose the time series
plt.rcParams["figure.figsize"] = (10,6)
result = seasonal_decompose(df['beer'], model='multiplicative', period = 12)
result.plot()
plt.show()

To test for Stationarity, we use the ADFuller test. It is common to run this test multiple times throughout a data science project. Therefore, we create a function that we can then reuse later.

def check_stationarity(df_sales, title_string, labels):
    # Visualize the data
    fig, ax = plt.subplots(figsize=(16, 8))
    plt.title(title_string, fontsize=14)
    if df_sales.index.size > 12:
        df_sales['ma_12_month'] = df_sales['beer'].rolling(window=12).mean()
        df_sales['ma_25_month'] = df_sales['beer'].rolling(window=25).mean()
        sns.lineplot(data=df_sales[['beer', 'ma_25_month', 'ma_12_month']], palette=sns.color_palette("mako_r", 3))
        plt.legend(title='Smoker', loc='upper left', labels=labels)
    else:
        sns.lineplot(data=df_sales[['beer']])
    
    plt.show()
    
    sales = df_sales['beer'].dropna()
    # Perform an Ad Fuller Test
    # the default alpha = .05 stands for a 95% confidence interval
    adf_test = pm.arima.ADFTest(alpha = 0.05) 
    print(adf_test.should_diff(sales))
    
df_sales = pd.DataFrame(df['beer'], columns=['beer'])
df_sales.index = pd.to_datetime(df['date']) 
title = "Beer sales in the US between 1992 and 2018 in million US$/month"
labels = ['beer', 'ma_12_month', 'ma_25_month']
check_stationarity(df_sales, title, labels)

The data does not appear to be stationary. We can see that our time series is steadily increasing and shows annual seasonality. The steady increase indicates a continuous growth in beer consumption over the last decades. The seasonality in the sales data likely results from people drinking more beer in summer than in other seasons.

Step #3 Exemplary Differencing and Autocorrelation

The chart from the previous section shows that our time series is non-stationary. The reason is that it follows a clear upward trend. We also know that the time series has a seasonal component. Therefore, we need to define additional parameters and construct a SARIMA model.

Before we use auto-correlation to determine the optimal parameters, we will try manual differencing to make the time series stationary. There is no guarantee that differencing works. It is essential to remember that differencing can sometimes also worsen prediction performance. So be careful, not to overdifference! We could also trust that the auto-ARIMA model chooses the best parameters for us. However, we should always validate the selected parameters.

The ideal differencing parameter is the least number of differencing steps to achieve a stationary time series. We will monitor the results with autocorrelation plots to check whether differencing was successful.

We print the autocorrelation for the original time series and after the first and second-order differencing.

# 3.1 Non-seasonal part
def auto_correlation(df, prefix, lags):
    plt.rcParams.update({'figure.figsize':(7,7), 'figure.dpi':120})
    
    # Define the plot grid
    fig, axes = plt.subplots(3,2, sharex=False)

    # First Difference
    axes[0, 0].plot(df)
    axes[0, 0].set_title('Original' + prefix)
    plot_acf(df, lags=lags, ax=axes[0, 1])

    # First Difference
    df_first_diff = df.diff().dropna()
    axes[1, 0].plot(df_first_diff)
    axes[1, 0].set_title('First Order Difference' + prefix)
    plot_acf(df_first_diff, lags=lags - 1, ax=axes[1, 1])

    # Second Difference
    df_second_diff = df.diff().diff().dropna()
    axes[2, 0].plot(df_second_diff)
    axes[2, 0].set_title('Second Order Difference' + prefix)
    plot_acf(df_second_diff, lags=lags - 2, ax=axes[2, 1])
    plt.tight_layout()
    plt.show()
    
auto_correlation(df_sales['beer'], '', 10)

(0.019143247561160443, False)

The charts above show that the time series becomes stationary after one order differencing. However, we can see that the lag goes into the negative very quickly, which indicates overdifferencing.

Next, we perform the same procedure for the seasonal part of our time series.

# 3.2 Seasonal part

# Reduce the timeframe to a single seasonal period
df_sales_s = df_sales['beer'][0:12]

# Autocorrelation for the seasonal part
auto_correlation(df_sales_s, '', 10)

# Check if the first difference of the seasonal period is stationary
df_diff = pd.DataFrame(df_sales_s.diff())
df_diff.index = pd.date_range(df_sales_s.diff().iloc[1], periods=12, freq='MS') 
check_stationarity(df_diff, "First Difference (Seasonal)", ['difference'])

(0.99, True)

After first order differencing, the seasonal part of the time series is stationary. The autocorrelation plot shows that the values go into the negative but remain within acceptable boundaries. Second-order differencing does not seem to improve these values. Consequently, we conclude that first-order differencing is a good choice for the D parameter.

Step #4 Finding an Optimal Model with Auto-ARIMA

Next, we auto-fit an ARIMA model to our time series. In this way, we ensure that we can later measure the performance of our model against a fresh set of data that the model has not seen so far. We will split our dataset into train and test in preparation for this.

Once we have created the train and test data sets, we can configure the parameters for the auto_arima stepwise optimization. By setting max_d = 1, we tell the model to test no-differencing and first-order differencing. Also, we set max_p and max_q to 3.

To deal with the seasonality in our time series, we set the “seasonal” parameter to True and the “m” parameter to 12 data points. We turn our model into a SARIMA model that allows us to configure additional D, P, and Q parameters. We define a max value for Q and P of 3. Previously we have already seen that further differencing does not improve the Stationarity. Therefore, we can set the value of D to 1.

After configuring the parameters, we next fit the model to the time series. The model will try to find the optimal parameters and choose the model with the least AIC.

# split into train and test
pred_periods = 30
split_number = df_sales['beer'].count() - pred_periods # corresponds to a prediction horizion  of 2,5 years
df_train = pd.DataFrame(df_sales['beer'][:split_number]).rename(columns={'beer':'y_train'})
df_test = pd.DataFrame(df_sales['beer'][split_number:]).rename(columns={'beer':'y_test'})

# auto_arima
model_fit = pm.auto_arima(df_train, test='adf', 
                         max_p=3, max_d=3, max_q=3, 
                         seasonal=True, m=12,
                         max_P=3, max_D=2, max_Q=3,
                         trace=True,
                         error_action='ignore',  
                         suppress_warnings=True, 
                         stepwise=True)

# summarize the model characteristics
print(model_fit.summary())

Performing stepwise search to minimize aic
 ARIMA(2,0,2)(1,1,1)[12] intercept   : AIC=inf, Time=3.89 sec
 ARIMA(0,0,0)(0,1,0)[12] intercept   : AIC=3383.210, Time=0.02 sec
 ARIMA(1,0,0)(1,1,0)[12] intercept   : AIC=3351.655, Time=0.38 sec
 ARIMA(0,0,1)(0,1,1)[12] intercept   : AIC=3364.350, Time=1.09 sec
 ARIMA(0,0,0)(0,1,0)[12]             : AIC=3604.145, Time=0.02 sec
 ARIMA(1,0,0)(0,1,0)[12] intercept   : AIC=3349.908, Time=0.11 sec
 ARIMA(1,0,0)(0,1,1)[12] intercept   : AIC=3351.532, Time=0.29 sec
 ARIMA(1,0,0)(1,1,1)[12] intercept   : AIC=3353.520, Time=1.24 sec
 ARIMA(2,0,0)(0,1,0)[12] intercept   : AIC=3312.656, Time=0.10 sec
 ARIMA(2,0,0)(1,1,0)[12] intercept   : AIC=3314.483, Time=0.57 sec
 ARIMA(2,0,0)(0,1,1)[12] intercept   : AIC=3314.378, Time=0.30 sec
 ARIMA(2,0,0)(1,1,1)[12] intercept   : AIC=3305.552, Time=3.02 sec
 ARIMA(2,0,0)(2,1,1)[12] intercept   : AIC=3291.425, Time=4.19 sec
 ARIMA(2,0,0)(2,1,0)[12] intercept   : AIC=3306.914, Time=3.06 sec
 ARIMA(2,0,0)(3,1,1)[12] intercept   : AIC=3276.501, Time=4.67 sec
 ARIMA(2,0,0)(3,1,0)[12] intercept   : AIC=3282.240, Time=5.24 sec
 ARIMA(2,0,0)(3,1,2)[12] intercept   : AIC=inf, Time=7.39 sec
 ARIMA(2,0,0)(2,1,2)[12] intercept   : AIC=inf, Time=4.74 sec
 ARIMA(1,0,0)(3,1,1)[12] intercept   : AIC=3313.877, Time=5.17 sec
 ARIMA(3,0,0)(3,1,1)[12] intercept   : AIC=3246.820, Time=5.72 sec
 ARIMA(3,0,0)(2,1,1)[12] intercept   : AIC=3255.313, Time=5.33 sec
 ARIMA(3,0,0)(3,1,0)[12] intercept   : AIC=3249.998, Time=6.77 sec
 ARIMA(3,0,0)(3,1,2)[12] intercept   : AIC=inf, Time=8.39 sec
 ARIMA(3,0,0)(2,1,0)[12] intercept   : AIC=3259.938, Time=3.55 sec
...
===================================================================================

Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).

Auto-ARIMA has determined that the best model is (3,0,0)(3,1,1). These results match the results from section 3, in which we manually performed differencing.

Step #5 Simulate the Time Series using in-sample Forecasting

Now that we have trained our model, we want to use it to simulate the entire time series. We will do this by calling the predict method in the sample function. The prediction will match the same period as the original time series with which we trained the model. Because the model predicts one step, the prediction results will naturally be close to the actual values.

# Generate in-sample Predictions
# The parameter dynamic=False means that the model makes predictions upon the lagged values.
# This means that the model is trained until a point in the time-series and then tries to predict the next value.
pred = model_fit.predict_in_sample(dynamic=False) # works only with auto-arima
df_train['y_train_pred'] = pred

# Calculate the percentage difference
df_train['diff_percent'] = abs((df_train['x_train'] - pred) / df_train['x_train'])* 100

# Print the predicted time-series
fig, ax1 = plt.subplots(figsize=(16, 8))
plt.title("In Sample Sales Prediction", fontsize=14)
sns.lineplot(data=df_train[['x_train', 'y_train_pred']], linewidth=1.0)

# Print percentage prediction errors on a separate axis (ax2)
ax2 = ax1.twinx() 
ax2.set_ylabel('Prediction Errors in %', color='purple', fontsize=14)  
ax2.set_ylim([0, 50])
ax2.bar(height=df_train['diff_percent'][20:], x=df_train.index[20:], width=20, color='purple', label='absolute errors')
plt.legend()
plt.show()

Next, we take a look at the prediction errors.

Step #6 Generate and Visualize a Sales Forecast

Now that we have trained an optimal model, we are ready to generate a sales forecast. First, we specify the number of periods that we want to predict. In addition, we create an index from the number of predictions adjacent to the original time series and continue it (prediction_index).

# Generate prediction for n periods, 
# Predictions start from the last date of the training data
test_pred = model_fit.predict(n_periods=pred_periods, dynamic=False)
df_test['y_test_pred'] = test_pred
df_union = pd.concat([df_train, df_test])
df_union.rename(columns={'beer':'y_test'}, inplace=True)

# Print the predicted time-series
fig, ax = plt.subplots(figsize=(16, 8))
plt.title("Test/Pred Comparison", fontsize=14)
sns.despine();
sns.lineplot(data=df_union[['y_train', 'y_train_pred', 'y_test', 'y_test_pred']], linewidth=1.0, dashes=False, palette='muted')
ax.set_xlim([df_union.index[150],df_union.index.max()])
plt.legend()
plt.show()

As shown above, our model’s forecast continues the seasonal pattern of the beer sales time series. On the one hand, this indicates that US beer sales will continue to rise and, on the other hand, that our model works just fine 🙂

Step #7 Measure the Performance of the Sales Forecasting Model

In this section, we will measure the performance of our ARIMA model. To learn more about this topic, check out this relataly article measuring regression performance.

The previous section’s simulation chart shows a few outliers among the prediction errors. Therefore, we focus our analysis on the percentage errors. Two helpful metrics are the mean absolute error (MAPE) and the mean absolute percentage error (MDAPE).

# Mean Absolute Percentage Error (MAPE)
MAPE = np.mean((np.abs(np.subtract(df_test['y_test'], df_test['y_test_pred'])/ df_test['y_test']))) * 100
print(f'Mean Absolute Percentage Error (MAPE): {np.round(MAPE, 2)} %')

# Median Absolute Percentage Error (MDAPE)
MDAPE = np.median((np.abs(np.subtract(df_test['y_test'], df_test['y_test_pred'])/ df_test['y_test'])) ) * 100
print(f'Median Absolute Percentage Error (MDAPE): {np.round(MDAPE, 2)} %')

Mean Absolute Percentage Error (MAPE): 3.94 %  Median Absolute Percentage Error (MDAPE): 3.49 %

The percent errors show that our ARIMA model achieves a decent predictive performance.

Summary

This Python tutorial has shown how to use SARIMA for sales forecasting. Sales forecasting is important for businesses because it can help them to make informed decisions about production, inventory management, and staffing, among other things. By accurately forecasting sales, businesses can ensure that they have the right amount of product available to meet customer sales, avoid overproduction and excess inventory, and plan for future growth. The use cases presented were forecasting beer sales, and we have used arima to analyze seasonal sales data.

In the first part, we have learned how ARIMA works, what Stationarity is and how to check if a time series is stationary. In the second part, we developed an ARIMA model in Python to create a forecast for US beer sales. For this purpose, we created an in-sample forecast and used Auto-tARIMA to find the optimal parameters for our sales forecasting model.

If you have any questions or suggestions, please let me know in the comments, and I will do my best to answer.

Now that we have learned to use ARIMA to forecast beer sales, you really deserved yourself a beer. Cheers! Image created with Midjourney

" data-image-caption="

Now that we have learned to use ARIMA to forecast beer sales, you really deserved yourself a beer. Cheers! Image created with Midjourney

" data-large-file="https://www.relataly.com/wp-content/uploads/2023/03/brewery-arima-beer-sales-forecasting-python-tutorial-machine-learning-relataly2-min.png" src="https://www.relataly.com/wp-content/uploads/2023/03/brewery-arima-beer-sales-forecasting-python-tutorial-machine-learning-relataly2-min.png" alt="Now that we have learned to use ARIMA to forecast beer sales, you really deserved yourself a beer. Cheers! Image created with Midjourney" class="wp-image-12603" srcset="https://www.relataly.com/wp-content/uploads/2023/03/brewery-arima-beer-sales-forecasting-python-tutorial-machine-learning-relataly2-min.png 506w, https://www.relataly.com/wp-content/uploads/2023/03/brewery-arima-beer-sales-forecasting-python-tutorial-machine-learning-relataly2-min.png 300w" sizes="(max-width: 506px) 100vw, 506px" />

Now that you have learned to use ARIMA to forecast beer sales, you really earned yourself a beer. Cheers! Image created with Midjourney

Sources and Further Reading

The links above to Amazon are affiliate links. By buying through these links, you support the Relataly.com blog and help to cover the hosting costs. Using the links does not affect the price.

Want to learn more about time series analysis and prediction?
Check out these recent relataly tutorials:

The post Forecasting Beer Sales with ARIMA in Python appeared first on relataly.com.

Image Classification with Convolutional Neural Networks – Classifying Cats and Dogs in Python

Florian Follonier — Sun, 13 Dec 2020 14:09:31 +0000

This tutorial shows how to use Convolutional Neural Networks (CNNs) with Python for image classification. CNNs belong to the field of deep learning, a subarea of machine learning, and have become a cornerstone to many exciting innovations. There are endless applications, from self-driving cars over biometric security to automated tagging in social media. And the importance of CNNs grows steadily! So there are plenty of reasons to understand how this technology works and how we can implement it.

This article proceeds as follows: The first part introduces the core concepts behind CNNs and explains their use in image classification. The second part is a hands-on tutorial in which you will build your own CNN to distinguish images of cats and dogs. This tutorial develops a model that achieves around 82% validation accuracy. We will work with TensorFlow and Python to integrate different layers, such as Convolution Layers, Dense layers, and MaxPooling. Furthermore, we will prevent the network from overfitting the training data by using Dropout between the layers. We will also load the model and make predictions on a fresh set of images. Finally, we analyze and illustrate the performance of our image classifier.

Also: Generating Detailed Images with OpenAI DALL-E and ChatGPT in Python: A Step-By-Step API Tutorial

Image Classification with Convolutional Neural Networks

The history of image recognition dates back to the mid-1960s when the first attempts were made to identify objects by coding their characteristic shapes and lines. However, this task turned out to be incredibly complex. Our human brain is trained so well to recognize things that one can easily forget how diverse the observation conditions can be. Here are some examples:

Fotos can be taken from various viewpoints
Living things can have multiple forms and poses
Objects come in different forms, colors, and sizes
The picture may hide parts of the things in the picture
The light conditions vary from image to image
There may be one or multiple objects in the same image

At the beginning of the 1990s, the focus of research shifted to statistical approaches and learning algorithms.

The idea of computer vision is inspired by the fact that the visual cortex has cells activated by specific shapes and their orientation in the visual field.

The Emergence of CNNs

The basic concept of a neural network in computer vision has existed since the 1980s. It goes back to research from Hubel and Wiesel on the emergence of a cat’s visual system. They found that the visual cortex has cells activated by specific shapes and their orientation in the visual field. Some of their findings inspired the development of crucial computer vision technologies, such as, for example, hierarchical features with different levels of abstraction [1, 2]. However, it took another three decades of research and the availability of faster computers before the emergence of modern CNNs.

The year 2012 was a defining moment for the use of CNNs in image recognition. This year, for the first time, CNN won the ILSVRC competition for computer vision. The challenge was classifying more than a hundred thousand images into 1000 object categories. With an error rate of only 15,3%, the succeeding model was a CNN called “AlexNet.”.

AlexNet was the first model to achieve more than 75% accuracy. In the same year, CNNs succeeded in several other competitions. For example, in 2015, the CNN ResNet exceeded human performance in the ILSVRC competition. Only a decade ago, this achievement was considered almost impossible. So how was this performance increase possible? To understand this surge in performance, let us first look at what a picture is.

Top-performing models in the ImageNet image classification challenge (Alyafeai & Ghouti, 2019)

What is an Image?

A digital image is a three-dimensional array of integer values. One dimension of this array represents the pixel width, and one dimension represents the height of the picture. The third dimension contains the color depth, defined by the image format. As shown below, we can thus represent the format of a digital image as “width x height x depth.” Next, let’s have a quick look at different image formats.

A digital image is a multidimensional integer array.

Overview of Different Image Formats

We can train CNNs with different image formats, but the input data are always multidimensional arrays of integer values. One of the most commonly used color formats in deep learning is “RGB.” RGB stands for the three color channels: “Red,” “Green,” and “Blue.” RGB images are divided into three layers of integer values, one layer for each color channel—the integer values of a 16-bit RGB image in each layer range from 1 to 255. Together, the three layers can reproduce 65,536 different colors.

In contrast to RGB images, grey-scale images only have a single color layer. This layer resembles the brightness of each pixel in the image. Consequently, the format of a grey-scale image is width x height x 1. Using grey-scale images or images with black and white shades instead of RGB images can speed up the training process because less data needs to be processed. However, image data with multiple color channels provide the model with more information, leading to better predictions. The RGB format is often a good choice between prediction quality and performance. Next, let’s look at how CNNs handle digital images in the learning process.

Convolutional Neural Networks

As mentioned before, a CNN is a specific form of an artificial neural network. The main difference between the CNN and the standard multi-layer perceptron is their convolutional layers. CNNs can have other layers, but the convolutions make a CNN so good at detecting objects. They allow the network to identify patterns based on features that work regardless of where in the image they occur. Let’s see how this works in more detail.

Convolutional Layers

Convolutional layers use a rasterizing technique that breaks down an image into smaller groups of pixels called filters. Filters act as feature detectors from the original image. The primary purpose is to extract meaningful features from the input images.

During the training, the CNN slides the filter over image locations and calculates the dot product for each feature at a time. The results of these calculations are stored in a so-called feature map (sometimes called an activation map). A feature map represents where in the image a particular feature was identified. Subsequently, the values from the feature map are transformed with an activation function (usually ReLu), and the algorithm uses them as input to the next layer.

Illustration of operations in the convolutional layers

Features become more complex with the increasing depth of the network. In the first layer of the network, convolutions will detect generic geometric forms and low-level features based on edges, corners, squares, or circles. The subsequent layers of the network will look at more sophisticated shapes and may, for example, include features that resemble the form of an eye of a cat or the nose of a dog. In this way, convolutions provide the network with features at different levels of detail that enable powerful detection patterns.

Exemplary convolutions of an image that contains the number “3.”

Pooling / Downsampling

A convolutional layer is usually followed by a pooling operation, which reduces the amount of data by filtering unnecessary information. This process is also called downsampling or subsampling. There are various forms of pooling. In the most common variant – max-pooling – only the highest value in a predefined grid (e.g., 2×2) is processed, and the remaining values are discarded. For example, imagine a 2×2 grid with values 0.1, 0.5, 0.4, and 0.8. The algorithm would only process the 0,8 further for this grid and use it as part of the input to the next layer. The advantages of pooling are reduced data and faster training times. Because pooling minimizes the complexity of the network, it allows for the construction of deeper architectures with more layers. In addition, pooling offers a certain protection against overfitting during training.

Dropout

Dropout is another technique that helps prevent the network from overfitting the training data. When we activate Dropout for a layer, the algorithm will remove a random number of neurons from the layer per training step. As a result, the network needs to learn patterns that give less weight to individual layers and thus generalize better. The dropout rate controls the percentage of switched-off neurons in each training iteration. We can configure Dropout for each layer separately.

CNNs with many layers and training epochs tend to overfit the training data. Especially here, Dropout is crucial to avoid overfitting and to achieve good prediction results with data that the network does not know yet. A typical value for the rate lies between 10% to 30%.

Multi-Layer Perceptron (MLP)

The CNN architecture ends with multiple dense layers that are fully connected. The layers are part of a Multilayer Perception (MLP), which has the task of dense down the results from the previous convolutions and outputting one of the multiple classes. Consequently, the number of neurons in the final dense layer usually corresponds to the number of different classes to be predicted. It is also possible to use a single neuron in the final layer for two-class prediction problems. In this case, the last neuron outputs a binary label of 0 or 1.

Building a CNN with Tensorflow that Classifies Cats and Dogs

Now that you are familiar with the basic concepts behind convolutional neural networks, we can commence with the practical part and build an image classifier. In the following, we will train a CNN to distinguish images of cats and dogs. We first define a CNN model and then feed it a few thousand photos from a public dataset with labeled images of cats and dogs.

Distinguishing cats and dogs may not sound difficult, but many challenges exist. Imagine the almost infinite circumstances in which animals can be photographed, not to mention the many forms a cat can take. These variations lead to the fact that even humans sometimes confuse a cat with a dog or vice versa. So don’t expect our model to be perfect right from the start. Our model will score around 82% accuracy on the validation dataset.

The code is available on the GitHub repository.

View on GitHub Relataly GitHub Repo

Cat or Dog? That’s what our CNN will predict.

Prerequisites

Before starting the coding part, make sure that you have set up your Python 3 environment and required packages. If you don’t have an environment, you can follow this tutorial to set up the Anaconda environment.

Also, make sure you install all required packages. In this tutorial, we will be working with the following standard packages:

In addition, we will be using Keras (2.0 or higher) with Tensorflow backend and the machine learning library Scikit-learn.

You can install packages using console commands:

pip install
conda install (if you are using the anaconda packet manager)

Download the Dataset

We will train our image classification model with a public dataset from Kaggle.com. The dataset contains more than 25.000 JPG pictures of cats and dogs. The images are uniformly named and numbered, for example, dog.1.jpg, dog.2.jpg, dog.3.jpg, cat.1.jpg, cat.2.jpg, and so on. You can download the picture set directly from Kaggle: cats-vs-dogs.

Setup the Folder Structure

There are different ways data can be structured and loaded during model training. One approach (1) is to split the images into classes and create a separate folder for each class, class_a, class_b, etc. Another method (2) is to put all images into a single folder and define a DataFrame that splits the data into test and train. Because the cats and dogs dataset files already contain the classes in their name, I decided to go for the second approach.

Before we begin with the coding part, we create a folder structure that looks as follows:

The folder structure of our cats and dogs prediction project

If you want to use the standard pathways given in the python tutorial, make sure that your notebook resides in the parent folder of the “data” folder.

After you have created the folder structure, open the cats-vs-dogs zip file. The ZIP file contains the folders “train,” “test,” and “sample.” Unzip the JPG files from the “train” (20.000 images) and the “test” folder (5.000 pictures) to the “train” folder of your project. Afterward, the train folder should contain 25.000 images. The sample folder is intended to include your sample images, for example, of your pet. We will later use the images from the sample folder to test the model on new real-world data.

We have fulfilled all requirements and can start with the coding part.

Step #1 Make Imports and Check Training Device

We begin by setting up the imports for this project. I have put the package imports at the beginning to give you a quick overview of the packages you need to install.

Using the GPU instead of the CPU allows for faster training times. However, setting up Tensorflow to work with the GPUs can cause problems. Not everyone has a GPU; in this case, TensorFlow should usually automatically run all code on the CPU. However, should you for any reason prefer to manually switch to CPU training, change [“CUDA_VISIBLE_DEVICES”]= “1” to “-1”. As a result, Tensorflow will run all code on the CPU and ignore all available GPUs.

import os
#os.environ["CUDA_VISIBLE_DEVICES"]="-1" 

import tensorflow as tf
from tensorflow.keras.preprocessing.image import ImageDataGenerator, array_to_img, img_to_array, load_img
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Convolution2D, MaxPooling2D, ZeroPadding2D
from tensorflow.keras.layers import Conv2D, Activation, Dropout, Flatten, Dense, BatchNormalization
from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau
from tensorflow.keras.metrics import Accuracy
from tensorflow.keras import regularizers
from tensorflow.keras.optimizers import SGD, Adam
from tensorflow.python.client import device_lib
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score

tf.config.allow_growth = True
tf.config.per_process_gpu_memory_fraction = 0.9

from random import randint
import seaborn as sns
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.colors as mcolors
import seaborn as sns
from PIL import Image
import random as rdn

Running the command below checks the TensorFlow version and the number of available GPUs in our system.

# check the tensorflow version
print('Tensorflow Version: ' + tf.__version__)

# check the number of available GPUs
physical_devices = tf.config.list_physical_devices('GPU')
print("Num GPUs:", len(physical_devices))

Tensorflow Version: 2.4.0-rc3
Num GPUs: 1

My GPU is an RTX 3080. When I wrote this article, the GPU was not yet supported by the standard TensorFlow release. I have therefore used the pre-release version of TensorFlow (2.4.0-rc3). I expect the following standard release (2.3) to work fine.

In my case, the GPU check returns one because I have a single GPU on my computer. If TensorFlow doesn’t recognize any GPU, this command will return 0. Tensorflow will then run on the CPU.

Step #2 Define the Prediction Classes

Next, we will define the path to the folders that contain our train and validation images. In addition, we will define a Dataframe “image_df,” which has all the pictures from the “train” folder. With the help of this Dataframe, we can later split the data simply by defining which images from the train folder contain the training dataset and which belong to the test dataset. Important note: the dataframe “image_df” only includes the names of the images and the classes, but not the photos themselves.

It’s good to check the distribution of classes in the training data set. For this purpose, we create a bar plot, which illustrates the number of both classes in the image data. And yes, I admit, I choose some custom colors to make it look fancy.

# set the directory for train and validation images
train_path = 'data/images/cats-and-dogs/train/'
#test_path = 'data/cats-and-dogs/test/'

# function to create a list of image labels 
def createImageDf(path):
    filenames = os.listdir(path)
    categories = []

    for fname in filenames:
        category = fname.split('.')[0]
        if category == 'dog':
            categories.append(1)
        else:
            categories.append(0)
    df = pd.DataFrame({
        'filename':filenames,
        'category':categories
    })
    return df

# display the header of the train_df dataset
image_df = createImageDf(train_path)
image_df.head(5)

sns.countplot(y='category', data=image_df, palette=['#2FE5C7',"#2F8AE5"], orient="h")

The number of images in the two classes is balanced, so we don’t need to rebalance the data. That’s nice!

Step #3 Plot Sample Images

I prefer not to jump directly into preprocessing and check that the data has been correctly loaded. We will do this by plotting some random images from the train folder. This step is not necessary, but it’s a best practice.

n_pictures = 16 # number of pictures to be shown
columns = int(n_pictures / 2)
rows = 2
plt.figure(figsize=(40, 12))
for i in range(n_pictures):
    num = i + 1
    ax = plt.subplot(rows, columns, i + 1)
    if i < columns:
        image_name = 'cat.' + str(rdn.randint(1, 1000)) + '.jpg'
    else: 
        image_name = 'dog.' + str(rdn.randint(1, 1000)) + '.jpg'
    plt.xlabel(image_name)    
    plt.imshow(load_img(train_path + image_name)) 

#if you get a deprecated warning, you can ignore it

I never expected to have so many pictures of cats and dogs one day, but I guess neither did you 🙂 Neural networks require a fixed input shape where each neuron corresponds to a pixel value.

As we can see from the sample images, the images in our dataset have different sizes and aspect ratios. For the images to fit into the input shape of our neural network, we need to put the images into a standard format. But before that, we split the data into two datasets for train and test.

Step #4 Split the Data

Image classification requires splitting the data into a train and a validation set. We define a split ratio of 1/5 so that 80% of the data goes into the training dataset and 20% goes into the validation dataframe. We shuffle the data to create two DataFrameswith a mix of random cat and dog pictures. In addition, we transform the classes of the images into categorical values 0->”cat” and 1->”dog”. The result is two new DataFrames: train_df (20.000 images) and validate_df (5.000 images).

image_df["category"] = image_df["category"].replace({0:'cat',1:'dog'})

train_df, validate_df = train_test_split(image_df, test_size=0.20, random_state=42)
train_df = train_df.reset_index(drop=True)
total_train = train_df.shape[0]

validate_df = validate_df.reset_index(drop=True)
total_validate = validate_df.shape[0]
train_df.head()

print(len(train_df), len(validate_df))

Output: 20000 5000

Step #5 Preprocess the Images

The next step is to define two data generators for these DataFrames, which use the names given in the train and validation DataFrames to feed the images from the “train” path into our neural network. The data generator has various configuration options. We will perform the following operations:

Rescale the image by dividing their RGB color values (1-255) by 255
Shuffle the images (again)
Bring the images into a uniform shape of 128 x 128 pixels
We define a batch size of 32, which processes the 32 images simultaneously.
The class mode is “binary” so our two prediction labels are encoded as float32 scalars with values 0 or 1. As a result, we will only have a single end neuron in our network.
We perform some data augmentation techniques on the training data (incl. horizontal flip, shearing, and zoom). In this way, the model never sees different variants of the images, which helps to prevent overfitting.

Some augmentation techniques

It is essential to mention that the input shape of the first layer of the neural network must correspond to the image shape of 128 x 128. The reason is that each pixel becomes an input to a neuron.

# set the dimensions to which we will convert the images
img_width, img_height = 128, 128
target_size = (img_width, img_height)
batch_size = 32
rescale=1.0/255

# configure the train data generator
print('Train data:')
train_datagen = ImageDataGenerator(rescale=rescale)
train_generator = train_datagen.flow_from_dataframe(
    train_df, 
    train_path,
    shear_range=0.2, #
    zoom_range=0.2, #
    horizontal_flip=True, # 
    shuffle=True, # shuffle the image data
    x_col='filename', y_col='category',
    classes=['dog', 'cat'],
    target_size=target_size,
    batch_size=batch_size,
    color_mode="rgb",
    class_mode='binary')

# configure test data generator
# only rescaling
print('Test data:')
validation_datagen = ImageDataGenerator(rescale=rescale)
validation_generator = validation_datagen.flow_from_dataframe(
    validate_df, 
    train_path,    
    shuffle=True,
    x_col='filename', y_col='category',
    classes=['dog', 'cat'],
    target_size=target_size,
    batch_size=batch_size,
    color_mode="rgb",
    class_mode='binary')

Train data:
Found 20000 validated image filenames belonging to 2 classes.
Test data:
Found 5000 validated image filenames belonging to 2 classes.

At this point, we have already completed the data preprocessing part. The next step is to define and compile the convolutional neural network.

Step #6 Define and Compile the Convolutional Neural Network

The architecture of our image classification CNN is inspired by the famous VGGNet. In this section, we will define and compile our CNN model. We do this by defining multiple layers and stacking them on top of each other. However, to lower the amount of time needed to train the network, I reduced the number of layers.

The initial layer of our network is the initial input layer, which receives the preprocessed images. As already noted, the shape of the input layer needs to match the shape of our images. Considering how we have defined the format of the images in our data generators, the input shape is defined as 128 x 128 x 3.

The subsequent layers are four convolutional layers. Each of these layers is followed by a pooling layer. In addition, we define a Dropoutrate of 20% for each convolutional layer.

Finally, a fully connected output layer with 128 neurons and a binary layer for the output complete the structure of the CNN.

3-dimensional Input Shape of our Neural Network

Additional Info

Loss function: measures model accuracy during training. We try to minimize this function to “steer” the model in the right direction. We use binary_crossentropy.
Optimizer: defines how the model weights are updated based on the data it sees and its loss function.
Metrics are used to monitor the steps during training and testing. The following example uses accuracy, which is the fraction of the correctly classified images.

# define the input format of the model
input_shape = (img_width, img_height, 3)
print(input_shape)

# define  model
model = Sequential()
model.add(Conv2D(32, (3, 3), strides=(1, 1), activation='relu', kernel_initializer='he_uniform', padding='same', input_shape=input_shape))
model.add(MaxPooling2D((2, 2)))
model.add(Dropout(0.20))
model.add(Conv2D(64, (3, 3), strides=(1, 1), activation='relu', kernel_initializer='he_uniform', padding='same'))
model.add(MaxPooling2D((2, 2)))
model.add(Dropout(0.20))
model.add(Conv2D(64, (3, 3), strides=(1, 1), activation='relu', kernel_initializer='he_uniform', padding='same'))
model.add(MaxPooling2D((2, 2)))
model.add(Dropout(0.20))
model.add(Conv2D(128, (3, 3),  strides=(1, 1),activation='relu', kernel_initializer='he_uniform', padding='same'))
model.add(MaxPooling2D((2, 2)))
model.add(Dropout(0.20))
model.add(Flatten())
model.add(Dense(128, activation='relu', kernel_initializer='he_uniform'))
model.add(Dropout(0.5))
model.add(Dense(1, activation='sigmoid'))

# compile the model and print its architecture
opt = SGD(lr=0.001, momentum=0.9)
history = model.compile(optimizer=opt, loss='binary_crossentropy', metrics=['accuracy'])
print(model.summary())

input_shape: (100, 100, 3)
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
conv2d (Conv2D)              (None, 100, 100, 32)      896       
_________________________________________________________________
max_pooling2d (MaxPooling2D) (None, 50, 50, 32)        0         
_________________________________________________________________
dropout (Dropout)            (None, 50, 50, 32)        0         
_________________________________________________________________
conv2d_1 (Conv2D)            (None, 50, 50, 64)        18496     
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 25, 25, 64)        0         
_________________________________________________________________
dropout_1 (Dropout)          (None, 25, 25, 64)        0         
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 25, 25, 64)        36928     
_________________________________________________________________
max_pooling2d_2 (MaxPooling2 (None, 12, 12, 64)        0         
_________________________________________________________________
dropout_2 (Dropout)          (None, 12, 12, 64)        0         
_________________________________________________________________
conv2d_3 (Conv2D)            (None, 12, 12, 128)       73856     
_________________________________________________________________
...
Trainable params: 720,257
Non-trainable params: 0
_________________________________________________________________
None

At this point, we have defined and assembled our convolutional neural network. Next, it is time to train the model.

Step #7 Train the Model

Before we train the image classifier, we still have to choose the number of epochs. More epochs can improve the model performance and lead to longer training times. In addition, the risk increases that the model overfits. Finding the optimal number of epochs is difficult and often requires a trial-and-error approach. I typically start with a small number of 5 epochs and then increase this number until increases do not lead to significant improvements.

# train the model
epochs = 40
early_stop = EarlyStopping(monitor='loss', patience=6, verbose=1)

history = model.fit(
    train_generator,
    epochs=epochs,
    callbacks=[early_stop],
    steps_per_epoch=len(train_generator),
    verbose=1,
    validation_data=validation_generator,
    validation_steps=len(validation_generator))

Epoch 1/35
625/625 [==============================] - 121s 194ms/step - loss: 0.7050 - accuracy: 0.5282 - val_loss: 0.6902 - val_accuracy: 0.5824
Epoch 2/35
625/625 [==============================] - 115s 183ms/step - loss: 0.6853 - accuracy: 0.5469 - val_loss: 0.6856 - val_accuracy: 0.5806
Epoch 3/35
625/625 [==============================] - 115s 184ms/step - loss: 0.6744 - accuracy: 0.5752 - val_loss: 0.6746 - val_accuracy: 0.5806
Epoch 4/35
625/625 [==============================] - 112s 180ms/step - loss: 0.6569 - accuracy: 0.5987 - val_loss: 0.6593 - val_accuracy: 0.6110
Epoch 5/35
625/625 [==============================] - 115s 185ms/step - loss: 0.6423 - accuracy: 0.6194 - val_loss: 0.6474 - val_accuracy: 0.6134
Epoch 6/35
625/625 [==============================] - 116s 185ms/step - loss: 0.6309 - accuracy: 0.6370 - val_loss: 0.6386 - val_accuracy: 0.6260
Epoch 7/35
625/625 [==============================] - 115s 183ms/step - loss: 0.6139 - accuracy: 0.6539 - val_loss: 0.6082 - val_accuracy: 0.6682

A quick comment on the required time to train the model. Although the model is not overly complex and the size of the data is still moderate, training the model can take some time. I made two training runs – the first run on my GPU (Nvidia Geforce 3080 RTX) and the second on my CPU (AMD Ryzen 3700x). On the GPU, training took approximately 10 minutes. The CPU training was much slower and took about 30 minutes, three times longer than the GPU.

After training, you may want to save the classification model and load it at a later time. You can do this with the code below:
However, we need to define the model strictly as it was during training before loading.

# Safe the weights
model.save_weights('cats-and-dogs-weights-v1.h5')

# Define model as during training
# model architecture

# Loads the weights
model.load_weights('cats-and-dogs-weights-v1.h5')

Step #8 Visualize Model Performance

After training the model, we want to check the performance of our image classification model. For this purpose, we can apply the same performance measures as in traditional classification projects. The code below illustrates the performance of our image classifier on the validation dataset.

To learn more about measuring model performance, check out my previous post on Measuring Model Performance.

def plot_loss(history, value1, value2, title):
    fig, ax = plt.subplots(figsize=(15, 5), sharex=True)
    plt.plot(history.history[value1], 'b')
    plt.plot(history.history[value2], 'r')
    plt.title(title)
    plt.ylabel("Loss")
    plt.xlabel("Epoch")
    ax.xaxis.set_major_locator(plt.MaxNLocator(epochs))
    plt.legend(["Train", "Validation"], loc="upper left")
    plt.grid()
    plt.show()

# plot training & validation loss values
plot_loss(history, "loss", "val_loss", "Model loss")
# plot training & validation loss values
plot_loss(history, "accuracy", "val_accuracy", "Model accuracy")

Next, let’s print the accuracy and a confusion matrix on the predictions from the validation dataset.

# function that returns the label for a given probability
def getLabel(prob):
    if(prob > .5):
               return 'dog'
    else:
               return 'cat'

# get the predictions for the validation data
val_df = validate_df.copy()
val_df['pred'] = ""
val_pred_prob = model.predict(validation_generator)

for i in range(val_pred_prob.shape[0]):
    val_df['pred'][i] = getLabel(val_pred_prob[i])
          
# create a confusion matrix
y_val = val_df['category']
y_pred = val_df['pred']

print('Accuracy: {:.2f}'.format(accuracy_score(y_val, y_pred)))
cnf_matrix = confusion_matrix(y_val, y_pred)

# plot the confusion matrix in form of a heatmap

%matplotlib inline
class_names=[False, True] # name  of classes
fig, ax = plt.subplots(figsize=(8, 8))
tick_marks = np.arange(len(class_names))
plt.xticks(tick_marks, class_names)
plt.yticks(tick_marks, class_names)
sns.heatmap(pd.DataFrame(cnf_matrix), annot=True, cmap="YlGnBu", fmt='g')
plt.title('Confusion matrix')
plt.ylabel('Actual label')
plt.xlabel('Predicted label')

Accuracy: 0.82

Step #9 Image Classification on Sample Images

Now that we have trained the model, I bet you can’t wait to test the image classifier on some sample data. For this purpose, ensure that you have some sample images in the “sample” folder. Running the code below will feed the image classifier with the test dataset. Based on this dataset, the model will then predict the labels for the images from the sample folder. Finally, the code below prints the images in an image grid and the predicted labels.

# set the path to the sample images
sample_path = "data/images/cats-and-dogs/sample/"
sample_df = createImageDf(sample_path)
sample_df['category'] = sample_df['category'].replace({0:'cat',1:'dog'})
sample_df['pred'] = ""

# create an image data generator for the sample images - we will only rescale the images
test_datagen = ImageDataGenerator(rescale=1./255)
test_generator = test_datagen.flow_from_dataframe(
    sample_df, 
    sample_path,    
    shuffle=False,
    x_col='filename', y_col='category',
    target_size=target_size)

# make the predictions 
pred_prob = model.predict(test_generator)
image_number = pred_prob.shape[0]

# define the plot size
for i in range(pred_prob.shape[0]):
    sample_df['pred'][i] = getLabel(pred_prob[i])
    
print('Accuracy: {:.2f}'.format(accuracy_score(sample_df['category'], sample_df['pred'])))

nrows = 6
ncols = int(round(image_number / nrows, 0))
fig, axs = plt.subplots(nrows, ncols, figsize=(15, 15))
for i, ax in enumerate(fig.axes):
    if i < sample_df.shape[0]:
        filepath = sample_path + sample_df.at[i ,'filename']
        ax = ax
        img = Image.open(filepath).resize(target_size)
        ax.imshow(img)
        ax.set_title(sample_df.at[i ,'filename'] + '\n' + ' predicted: '  + str(sample_df.at[i ,'pred']))
        result = [True if sample_df.at[i ,'pred'] == sample_df.at[i ,'category'] else False]
        ax.set_xlabel(str(result))
        ax.set_xticks([]); ax.set_yticks([])

Our image classifier achieves an accuracy of around 83% on the validation set. The model is not perfect, but it should have labeled most images correctly. With deeper architectures, more data, and training runs, you can create classification models that achieve better results over 95%.

Summary

In this tutorial, you learned how to train an image classification model. We have prepared a dataset and performed several transformations to bring the data in shape for training. Finally, we have trained a convolutional neural network to distinguish between dogs and cats. You can now use this knowledge to train image classification models that determine other objects.

There are many other cool things that you can do with CNNs. For example, object localization in images and videos and even stock market prediction. But these are topics for further articles.

I am always happy to receive feedback. I hope you enjoyed the article and would be happy if you left a comment. Cheers

Sources and Further Reading

Andriy Burkov Machine Learning Engineering
Oliver Theobald (2020) Machine Learning For Absolute Beginners: A Plain English Introduction
Charu C. Aggarwal (2018) Neural Networks and Deep Learning
Aurélien Géron (2019) Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems
David Forsyth (2019) Applied Machine Learning Springer
[1] D. H. Hubel and T. N. Wiesel – Receptive Fields of Neurons in the Cat’s Striate Cortex, The Journal of physiology (1959)

The links above to Amazon are affiliate links. By buying through these links, you support the Relataly.com blog and help to cover the hosting costs. Using the links does not affect the price.

The post Image Classification with Convolutional Neural Networks – Classifying Cats and Dogs in Python appeared first on relataly.com.