Sales Forecasting Archives - relataly.com

Feature Engineering and Selection for Regression Models with Python and Scikit-learn

Florian Follonier — Mon, 26 Sep 2022 22:20:29 +0000

Training a machine learning model is like baking a cake: the quality of the end result depends on the ingredients you put in. If your input data is poor, your predictions will be too. But with the right ingredients – in this case, carefully selected input features – you can create a model that’s both accurate and powerful. This is where feature engineering comes in. It’s the process of exploring, creating, and selecting the most relevant and useful features to use in your model. And just like a chef experimenting with different spices and flavors, the process of feature engineering is iterative and tailored to the problem at hand. In this guide, we’ll walk you through a step-by-step process using Python and Scikit-learn to create a strong set of features for a regression problem. By the end, you’ll have the skills to tackle any feature engineering challenge that comes your way.

The remainder of this article proceeds as follows: We begin with a brief intro to feature engineering and describe valuable techniques. We then turn to the hands-on part, in which we develop a regression model for car sales. We apply various techniques that show how to handle outliers and missing values, perform correlation analysis, and discover and manipulate features. You will also find information about common challenges and helpful sklearn functions. Finally, we will compare our regression model to a baseline model that uses the original dataset.

Also: Sentiment Analysis with Naive Bayes and Logistic Regression in Python

What is Feature Engineering?

Feature engineering is the process of using domain knowledge of the data to create features (variables) that make machine learning algorithms work. This is an important step in the machine learning pipeline because the choice of good features can greatly affect the performance of the model. The goal is to identify features, tweak them, and select the most promising ones into a smaller feature subset. We can break this process down into several action items.

Data Scientists can easily spend 70% to 80% of their time on feature engineering. The time is well spent, as changes to input data have a direct impact on performance. This process is often iterative and requires repeatedly revisiting the various tasks as understanding the data and the problem evolves. Knowing techniques and associated challenges helps in adequate feature engineering.

Also: Mastering Prompt Engineering for ChatGPT for Business Use

Feature engineering is about carefully choosing features instead of taking all the features at once. Image created with Midjourney.

Core Tasks

The goal of feature engineering is to create a set of features that are representative of the underlying data and that can be used by the machine learning algorithm to make accurate predictions. Several tasks are commonly performed as part of the feature engineering process, including:

Data discovery: To solve real-world problems with analytics, it is crucial to understand the data. Once you have gathered your data, describing and visualizing the data are means to familiarize yourself with it and develop a general feel for the data.
Data structuring: The data needs to be structured into a unified and usable format. Variables may have a wrong datatype, or the data is distributed across different data frames and must first be merged. In these cases, we first need to bring the data together and into the right shape.
Data cleansing: Besides being structured, data needs to be cleaned. Records may be redundant or contaminated with errors and missing values that can hinder our model from learning effectively. The same goes for outliers that can distort statistics.
Data transformation: We can increase the predictive power of our input features by transforming them. Activities may include applying mathematical functions, removing specific data, or grouping variables into bins. Or we create entirely new features out of several existing ones.
Feature selection: Only some may contain valuable information from the many available variables. By sorting variables that are less relevant and selecting the most promising features, we can create models that are less complex and yield better results.

Exploratory Feature Engineering Toolset

Exploratory analysis for identifying and assessing relevant features knows several tools:

Data Cleansing
Descriptive statistics
Univariate Analysis
Bi-variate Analysis
Multivariate Analysis

Data Cleansing

Educational data is often remarkably perfect, without any errors or missing values. However, it is important to recognize that most real-world data has data quality issues. Some reasons for data quality issues are

Standardization issues because the data was recorded from different peoples, sensor types, etc.
Sensor or system outages can lead to gaps in the data or create erroneous data points.
Human errors

An important part of feature engineering is to inspect the data and ensure its quality before use. This is what we understand as “data cleansing.” It includes several tasks that aim to improve the data quality, remove erroneous data points and bring the data into a more useful form.

Cleaning errors, missing values, and other issues.
Handling possible imbalanced data
Removing obvious outliers
Standardisation, e.g., dates or adresses

Accomplishing these tasks requires a good understanding of the data. We, therefore, carry out data cleansing activities closely intertwined with other exploratory tasks, e.g., univariate and bivariate data analysis. Also, remember that visualizations can aid in the process, as they can greatly enhance your ability to analyze and understand the data.

Descriptive Statistics

One of the first steps in familiarizing oneself with a new dataset is to use descriptive statistics. Descriptive statistics help understand the data and how the sample represents the real-world population. We can use several statistical measures to analyze and describe a dataset, including the following:

Measures of Central Tendency represent a typical value of the data.
- The mean: The average-based adds together all values in the sample and divides them by the number of samples.
- The median: The median is the value that lies in the middle of the range of all sample values
- The mode: is the most occurring value in a sample set (for categorical variables)
Measures of Variability tell us something about the spread of the data.
- Range: The difference between the minimum and maximum value
- Variance: This is the average of the squared difference of the mean.
- Standard Deviation: The square root of the variance.
and Measures of Frequency inform us how often we can expect a value to be present in the data, e.g., value counts

Univariate Analysis

As “uni” suggests, the univariate analysis focuses on a single variable. Rather than examining the relationships between the variables, univariate analysis employs descriptive statistics and visualizations to understand individual columns better.

Which illustrations and measures we use depends on the type of the variable.

Categorical variables (incl. binary)

Descriptive measures include counts in percent and absolute values
Visualizations include pie charts, bar charts (count plots)

Continuous variables

Descriptive measures include min, max, median, mean, variance, standard deviation, and quantiles.
Visualizations include box plots, line plots, and histograms.

Normal distribution, univariate analysis

" data-image-caption="

Normal distribution, univariate analysis

" data-large-file="https://www.relataly.com/wp-content/uploads/2022/09/output.png" src="https://www.relataly.com/wp-content/uploads/2022/09/output.png" alt="" class="wp-image-9261" srcset="https://www.relataly.com/wp-content/uploads/2022/09/output.png 838w, https://www.relataly.com/wp-content/uploads/2022/09/output.png 300w, https://www.relataly.com/wp-content/uploads/2022/09/output.png 768w" sizes="(max-width: 838px) 100vw, 838px" />

Normal distribution, univariate analysis

Bi-variate Analysis

Bi-variate (two-variate) analysis is a kind of statistical analysis that focuses on the relationship between two variables, for example, between a feature column and the target variable. In the case of machine learning projects, bivariate analysis can help to identify features that are potentially predictive of the label or the regression target.

Model performance will benefit from strong linear dependencies. In addition, we are also interested in examining the relationships among the features used to train the model. Different types of relations exist that can be examined using various plots and statistical measures:

Numerical/Numerical

Both variables have numerical values. We can illustrate their relation using lineplots or dot plots. We can examine such relations with correlation analysis.

The ideal feature subset contains features that are not correlated with each other but are heavily correlated with the target variable. We can use dimensionality reduction to reduce a dataset with many features to a lower-dimensional space in which the remaining features are less correlated.

Traditional correlation analysis (e.g., Pearson) cannot consider non-linear relations. We can identify such a relation manually by visualizing the data, for example, using line plots. Once we denote a non-linear relation, we could try to apply mathematical transformations to one of the variables to make their relation more linear.

For pairwise analysis, we must understand which variables we deal with. We can differentiate between three categories:

Numerical/Categorical
Numerical/Numerical
Categorical/Categorical

Heatmaps illustrate the relation between features and a target variable.

Numerical/Categorical

Plots that visualize the relationship between a categorical and a numerical variable include barplots and lineplots.

Especially helpful are histograms (count plots). They can highlight differences in the distribution of the numerical variable for different categories.

A specific subcase is a numerical/date relation. Such relations are typically visualized using line plots. In addition, we want to look out for linear or non-linear dependencies.

Line charts are useful when examining trends.

Categorical/Categorical

The relation between two categorical variables can be studied, including density plots, histograms, and bar plots.

For example, with car types (attributes: sedan and coupe) and colors (characteristics: red, blue, yellow), we can use a barplot to see if sedans are more often red than coupes. Differences in the distribution of characteristics can be a starting point for attempts to manipulate the features and improve model performance.

Bar and column charts are a great way to compare numeric values for discrete categories visually.

Multivariate Analysis

Multivariate analysis encompasses the simultaneous analysis of more than two variables. The approach can uncover multi-dimensional dependencies and is often used in advanced feature engineering. For example, you may find that two variables are weakly correlated with the target variable, but when combined, their relation intensifies. So you might try to create a new feature that uses the two variables as input. Plots that can visualize relations between several variables include dot plots and violin plots.

In addition, multivariate analysis refers to techniques to reduce the dimensionality of a dataset. For example, principal component analysis (PCA) or factor analysis can condense the information in a data set into a smaller number of synthetic features.

Now that we have a good understanding of what feature selection techniques are available, we can start the practical part and apply them.

Scatter charts are useful when you want to compare two numeric quantities and see a relationship or correlation between them.

Also: Color-Coded Cryptocurrency Price Charts in Python

Feature Engineering for Car Price Regression with Python and Scikit-learn

The value of a car on the market depends on various factors. The distance traveled with the vehicle and the year of manufacture is obvious dependencies. But beyond that, we can use many other factors to train a machine learning model that predicts the selling price of the used car market. The following hands-on Python tutorial will create such a model. We will work with a dataset containing used cars’ characteristics in the following. For marketing, it is crucial to understand what car characteristics determine the price of a vehicle. Our goal is to model the car price from the available independent variables. We aim to build a model that performs well on a small but powerful input subset.

Exploring and creating features varies between different application domains. For example, feature engineering in computer vision will differ greatly from feature engineering for regression or classification models or NLP models. So the example provided in this article is just for regression models.

We follow an exploratory process that includes the following steps:

Loading the data
Cleaning the data
Univariate analysis
Bivariate analysis
Selecting features
Data preparation
Model training
Measuring performance

Finally, we compare the performance of our model, which was trained on a minimal set of features, to a model that uses the original data.

Yes, you can judge by the length of the beard that this guy is a legendary feature engineer. Image created with Midjourney.

The Python code is available in the relataly GitHub repository.

View on GitHub Relataly Github Repo

Prerequisites

Before you proceed, ensure that you have set up your Python environment (3.8 or higher) and the required packages. If you don’t have an environment, follow this tutorial to set up the Anaconda environment.

Also, make sure you install all required packages. In this tutorial, we will be working with the following standard packages:

pandas
NumPy
matplotlib
Seaborn
Scikit-learn

You can install packages using console commands:

pip install
conda install (if you are using the anaconda packet manager)

About the Dataset

In this tutorial, we will be working with a dataset containing listings for 111763 used cars. The data includes 13 variables, including the dependent target variable

prod_date: The year of production
maker: The manufacturer’s name
model: The car edition
trim: Different versions of the model
body_type: The body style of a vehicle
transmission_type: The way the power is brought to the wheels
state: The state in which the car is auctioned
condition: The condition of the cars
odometer: The distance the car has traveled since manufactured
exterior_color: Exterior color
interior_color: Interior color
sale_price (target variable): The price a car was sold
sale_date: The date on which the car has been sold

The dataset is available for download from Kaggle.com, but you can execute the code below and load the data from the relataly GitHub repository.

Car price prediction is a solid use case for machine learning. Image created with Midjourney.

Step #1 Load the Data

We begin by importing the necessary libraries and downloading the dataset from the relataly GitHub repository. Next, we will read the dataset into a pandas DataFrame. In addition, we store the name of our regression target variable to ‘price_usd,’ which is one of the columns in the initial dataset. The “.head ()” function displays the first records of our DataFrame.

# Tested with Python 3.8.8, Matplotlib 3.5, Scikit-learn 0.24.1, Seaborn 0.11.1, numpy 1.19.5
from codecs import ignore_errors
import math
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('white', {'axes.spines.right': False, 'axes.spines.top': False})
from pandas.api.types import is_string_dtype, is_numeric_dtype 
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import mean_absolute_error, mean_absolute_percentage_error
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.inspection import permutation_importance
from sklearn.model_selection import ShuffleSplit
# Original Data Source: 
# https://www.kaggle.com/datasets/tunguz/used-car-auction-prices
# Load train and test datasets
df = pd.read_csv("https://raw.githubusercontent.com/flo7up/relataly_data/main/car_prices2/car_prices.csv")
df.head(3)

	prod_year	maker			model		trim		body_type		transmission_type	state	condition	odometer	exterior_color	interior	sellingprice	date
0	2015		Kia				Sorento		LX			SUV				automatic			ca		5.0			16639.0		white			black		21500	2014-12-16
1	2015		Nissan			Altima		2.5 S		Sedan			automatic			ca		1.0			5554.0		gray			black		10900	2014-12-30
2	2014		Audi			A6	3.0T 	Prestige 	quattro	Sedan	automatic			ca		4.8			14414.0		black			black		49750	2014-12-16

We now have a dataframe that contains 12 columns and the dependent target variable we want to predict.

Step #2 Data Cleansing

Now that we have loaded the data, we begin with the exploratory analysis. First, we will put it into shape.

2.1 Check Names and Datatypes

If the names in a dataset are not self-explaining, it is easy to get confused with all the data. Therefore, will rename some of the columns and provide clearer names. There is no default naming convention, but striving for consistency, simplicity, and understandability is generally a good idea.

The following code line renames some of the columns.

# rename some columns for consistency
df.rename(columns={'exterior_color': 'ext_color', 
                   'interior': 'int_color', 
                   'sellingprice': 'sale_price'}, inplace=True)
df.head(1)

	prod_year	maker	model	trim	body_type	transmission_type	state	condition	odometer	ext_color	int_color	sale_price	date
0	2015		Kia		Sorento	LX		SUV			automatic			ca		5.0			16639.0		white		black		21500		2014-12-16

Next, we will check and remove possible duplicates.

# check and remove dublicates
print(len(df))
df = df.drop_duplicates()
print(len(df))

OUT: 111763, 111763

There were no duplicates in the data, which is good.

# check datatypes
df.dtypes

prod_year              int64
maker                 object
model                 object
trim                  object
body_type             object
transmission_type     object
state                 object
condition            float64
odometer             float64
ext_color             object
int_color             object
sale_price             int64
date                  object
dtype: object

We compare the datatypes to the first records we printed in the previous section. Be aware that categorical variables (e.g., of type “string”) are shown as “objects.” The data types look as expected.

Finally, we define our target variable’s name, “sale_price.” The target variable will be our regression target, and we will use its name often.

# consistently define the target variable
target_name = 'sale_price'

2.2 Checking Missing Values

Some machine learning algorithms are sensitive to missing values. Handling missing values is, therefore a crucial step in exploratory feature engineering.

Let’s first gain an overview of null values. With a larger DataFrame, it would be inefficient to review all the rows and columns individually for missing values. Instead, we use the sum function and visualize the results to get a quick overview of missing data in the DataFrame.

# check for missing values
null_df = pd.DataFrame(df.isna().sum(), columns=['null_values']).sort_values(['null_values'], ascending=False)
fig = plt.subplots(figsize=(16, 6))
ax = sns.barplot(data=null_df, x='null_values', y=null_df.index, color='royalblue')
pct_values = [' {:g}'.format(elm) + ' ({:.1%})'.format(elm/len(df)) for elm in list(null_df['null_values'])]
ax.bar_label(container=ax.containers[0], labels=pct_values, size=12)
ax.set_title('Overview of missing values')

overview of missing values in the car price regression dataset

" data-image-caption="

overview of missing values in the car price regression dataset

" data-large-file="https://www.relataly.com/wp-content/uploads/2022/09/missing-values-bar-chart-for-car-price-regression.png" src="https://www.relataly.com/wp-content/uploads/2022/09/missing-values-bar-chart-for-car-price-regression-1024x384.png" alt="overview of missing values in the car price regression dataset" class="wp-image-9365" srcset="https://www.relataly.com/wp-content/uploads/2022/09/missing-values-bar-chart-for-car-price-regression.png 1024w, https://www.relataly.com/wp-content/uploads/2022/09/missing-values-bar-chart-for-car-price-regression.png 300w, https://www.relataly.com/wp-content/uploads/2022/09/missing-values-bar-chart-for-car-price-regression.png 768w, https://www.relataly.com/wp-content/uploads/2022/09/missing-values-bar-chart-for-car-price-regression.png 1026w" sizes="(max-width: 1024px) 100vw, 1024px" />

The bar chart shows that there are several variables with missing values. Variables with many missing values can negatively affect model performance, which is why we should try to treat them.

2.3 Overview of Techniques for Handling Missing Values

There are various ways to handle missing data. The most common options to handle missing values are:

Custom substitution value: Sometimes, the information that a value is missing can be important information to a predictive model. We can substitute missing values with a placeholder value such as “missing” or “unknown.” The approach works particularly well for variables with many missing values.
Statistical filling: We can fill in a statistically chosen measure, such as the mean or median for numeric variables, or the mode for categorical variables.
Replace using Probabilistic PCA: PCA uses a linear approximation function that tries to reconstruct the missing values from the data.
Remove entire rows: It is crucial to ensure that we only use data we know is correct. In those cases, we can drop an entire row if it contains a missing value. This also solves the problem but comes at the cost of losing potentially important information – especially if the data quantity is small.
Remove the entire column: It is another alternative way of resolving missing values. This is typically the least option, as we lose an entire feature.

How we handle missing values can dramatically affect our prediction results. To find the ideal method, it is often necessary to experiment with different techniques. Sometimes, the information that a value is missing can also be important. This occurs when the missing values are not randomly distributed in the data and show a pattern. In such a case, you should create an additional feature that states whether values are missing.

2.4 Handle Missing Values

In this example, we will use the median value to fill in the missing values of our numeric variables and the mode to replace the missing values of categorical variables. When we check again, we can see that odometer and condition have no more missing values.

# fill missing values with the mean for numeric columns
for col_name in df.columns:
    if (is_numeric_dtype(df[col_name])) and (df[col_name].isna().sum() > 0):
        df[col_name].fillna(df[col_name].median(), inplace=True) # alternatively you could also drop the columns with missing values using .drop(columns=['engine_capacity']) 
print(df.isna().sum())

prod_year                0
maker                 2078
model                 2096
trim                  2157
body_type             2641
transmission_type    13135
state                    0
condition                0
odometer                 0
ext_color              173
int_color              173
sale_price               0
date                     0
dtype: int64

Next, we handle the missing values of transmission_type by filling them with the mode.

# check the distribution of missing values for transmission type
print(df['transmission_type'].value_counts())
# fill values with the mode
df['transmission_type'].fillna(df['transmission_type'].mode()[0], inplace=True)
print(df['transmission_type'].isna().sum())

automatic    108198
manual         3565
Name: transmission_type, dtype: int64
0

We handle body_type analogs as transmission_type and fill the missing values with the mode. The mode is the value that appears most often in the data. The mode of transmission_type is “Sedan.” However, this value is not that prevalent, as half of the cars have other body types, e.g., “SUV.” Therefore, we will replace the missing values with “Unknown.”

# check the distribution of missing values for body type
print(df['body_type'].value_counts())
# fill values with 'Unknown'
df['body_type'].fillna("Unknown", inplace=True)
print(df['body_type'].isna().sum())

Sedan                 39955
SUV                   23836
sedan                  8377
suv                    4934
Hatchback              4241
                      ...  
cts-v coupe               2
Ram Van                   1
Transit Van               1
CTS Wagon                 1
beetle convertible        1
Name: body_type, Length: 74, dtype: int64
0

Now we have handled most of the missing values in our data. However, some variables are still left, with a few missing values. We will make things easy and simply drop all remaining records with missing values. Considering that we have more than 100k records and only a few variables, we can afford to do this without fear of a severe impact on our model performance.

# remove all other records with missing values
df.dropna(inplace=True)
print(df.isna().sum())

prod_year            0
maker                0
model                0
trim                 0
body_type            0
transmission_type    0
state                0
condition            0
odometer             0
ext_color            0
int_color            0
sale_price           0
date                 0
dtype: int64

Finally, we check again for missing values and see that everything has been filled. Now, we have a cleansed dataset with 13 columns.

2.3 Save a Copy of the Cleaned Data

Before exploring the features, let’s make a copy of the cleaned data. We will later use this “full” dataset to compare the performance of our model with a baseline model.

# Create a copy of the dataset with all features for comparison reasons
df_all = df.copy()

Step #3 Getting started with Statistical Univariate Analysis

Now it’s time to analyze the data and explore potential useful features for our subset. Although the process follows a linear flow in this example, you may notice in practice that you must go back and forth between different steps of the feature exploration and engineering process.

First, we will look at the variance of the features in the initial dataset. Machine learning models can only learn from variables that have adequate variance. So, low-variance features are often candidates to exclude from the feature subset.

We use the .describe() method to display univariate descriptive statistics about the numerical columns in our dataset.

# show statistics for numeric variables
print(df.columns)
df.describe()

Next, we check the categorical variables. All variables seem to have a good variance. We can measure the variance with statistical measures or observe it manually using bar charts and scatterplots.

We can use histplots to visualize the distributions of the numeric variables. The example below shows the histplot for our target variable sale_price.

# Explore the variance of the target variable
variable_name = 'sale_price'
fig, ax = plt.subplots(figsize=(14,5))
sns.histplot(data=df[[variable_name]].dropna(), ax=ax, color='royalblue', kde=True)
ax.get_legend().remove()
ax.set_title(variable_name + ' Distribution')
ax.set_xlim(0, df[variable_name].quantile(0.99))

The histplot shows that sale prices are skewed to the left. This means there are many cheap cars and fewer expensive ones, which makes sense.

Next, we create bar plots for categorical values.

# 3.2 Illustrate the Variance of Numeric Variables 
f_list_numeric = [x for x in df.columns if (is_numeric_dtype(df[x]) and df[x].nunique() > 2)]
f_list_numeric
# box plot design
PROPS = {
    'boxprops':{'facecolor':'none', 'edgecolor':'royalblue'},
    'medianprops':{'color':'coral'},
    'whiskerprops':{'color':'royalblue'},
    'capprops':{'color':'royalblue'}
    }
sns.set_style('ticks', {'axes.edgecolor': 'grey',  
                        'xtick.color': '0',
                        'ytick.color': '0'})
# Adjust plotsize based on the number of features
ncols = 1
nrows = math.ceil(len(f_list_numeric) / ncols)
fig, axs = plt.subplots(nrows, ncols, figsize=(14, nrows*1))
for i, ax in enumerate(fig.axes):
    if i < len(f_list_numeric):
        column_name = f_list_numeric[i]
        sns.boxplot(data=df[column_name], orient="h", ax = ax, color='royalblue', flierprops={"marker": "o"}, **PROPS)
        ax.set(yticklabels=[column_name])
        fig.tight_layout()

We can observe two things: First, the variance of transmission type is low, as most cars have an automatic transmission. So transmission_type is the first variable that we exclude from our feature subset.

# Drop features with low variety
df = df.drop(columns=['transmission_type'])
df.head(2)

	prod_year	maker	model	trim	body_type	state	condition	odometer	ext_color	int_color	sale_price	date
0	2015		Kia		Sorento	LX		SUV			ca		5.0			16639.0		white		black		21500		2014-12-16
1	2015		Nissan	Altima	2.5 S	Sedan		ca		1.0			5554.0		gray		black		10900		2014-12-30

Second, int_color and ext_color have many categorical values. By grouping some of these values that hardly ever occur, we can help the model to focus on the most relevant patterns. However, before we do that, we need to take a closer look at how the target variable differs between the categories.

Step #4 Bi-variate Analysis

Now that we have a general understanding of our dataset’s individual variables, let’s look at pairwise dependencies. We are particularly interested in the relationship between features and the target variables. Our goal is to keep features whose dependence on the target variable shows some pattern – linear or non-linear. On the other hand, we want to exclude features whose relationship with the target variable looks arbitrary.

Visualizations have to take the datatypes of our variables into account. To illustrate the relation between categorical features and the target, we create boxplots and kdeplots. For numeric (continuous) features, we use scatterplots.

4.1 Analyzing the Relation between Features and the Target Variable

We begin by taking a closer look at the int_color and ext_color. We use kdeplots to highlight the distribution of prices depending on different colors.

def make_kdeplot(column_name):
    fig, ax = plt.subplots(figsize=(20,8))
    sns.kdeplot(data=df, hue=column_name, x=target_name, ax = ax, linewidth=2,)
    ax.tick_params(axis="x", rotation=90, labelsize=10, length=0)
    ax.set_title(column_name)
    ax.set_xlim(0, df[target_name].quantile(0.99))
    plt.show()
    
make_kdeplot('ext_color')

make_kdeplot('int_color')

In both cases, a few colors are prevalent and account for most observations. Moreover, distributions of the car price differ for these prevalent colors. These differences look promising as they may help our model to differentiate cheaper cars from more expensive ones. To simplify things, we group the colors that hardly occur into a color category called “other.”

# Binning features
df['int_color'] = [x if  x in(['black', 'gray', 'white', 'silver', 'blue', 'red']) else 'other' for x in df['int_color']]
df['ext_color'] = [x if  x in(['black', 'gray', 'white', 'silver', 'blue', 'red']) else 'other' for x in df['ext_color']]

Next, we create plots for all remaining features.

# Vizualising Distributions
f_list = [x for x in df.columns if ((is_numeric_dtype(df[x])) and x != target_name) or (df[x].nunique() < 50)]
f_list_len = len(f_list)
print(f'numeric features: {f_list_len}')
# Adjust plotsize based on the number of features
ncols = 1
nrows = math.ceil(f_list_len / ncols)
fig, axs = plt.subplots(nrows, ncols, figsize=(18, nrows*5))
for i, ax in enumerate(fig.axes):
    if i < f_list_len:
        column_name = f_list[i]
        print(column_name)
        # If a variable has more than 8 unique values draw a scatterplot, else draw a violinplot 
        if df[column_name].nunique() > 100 and is_numeric_dtype(df[column_name]):
            # Draw a scatterplot for each variable and target_name
            sns.scatterplot(data=df, y=target_name, x=column_name, ax = ax)
        else: 
            # Draw a vertical violinplot (or boxplot) grouped by a categorical variable:
            myorder = df.groupby(by=[column_name])[target_name].median().sort_values().index
            sns.boxplot(data=df, x=column_name, y=target_name, ax = ax, order=myorder)
            #sns.violinplot(data=df, x=column_name, y=target_name, ax = ax, order=myorder)
        ax.tick_params(axis="x", rotation=90, labelsize=10, length=0)
        ax.set_title(column_name)
    fig.tight_layout()

Again, for categorical variables, we want to see differences in the distribution of the categories. Based on the boxplot’s median and the quantiles, we can denote that prod_year, int_color, and condition show adequate variance. The scatterplot for the odometer value also looks good. So we want to keep these features. In contrast, the differences between “state” and “ext_color” are rather weak. Therefore, we exclude these variables from our subset.

# drop columns with low variance
df.drop(columns=['state', 'ext_color'], inplace=True)

Finally, if you want to take a more detailed look at the numeric features, you can use jointplots. These are scatterplots with additional information about the distributions. The example below shows the jointplot for the odometer value vs price.

# detailed univariate and bivariate analysis of 'odometer' using a jointplot 
def make_jointplot(feature_name):
    p = sns.jointplot(data=df, y=feature_name, x=target_name, height=6, ratio=6, kind='reg', joint_kws={'line_kws':{'color':'coral'}})
    p.fig.suptitle(feature_name + ' Distribution')
    p.ax_joint.collections[0].set_alpha(0.3)
    p.ax_joint.set_ylim(df[feature_name].min(), df[feature_name].max())
    p.fig.tight_layout()
    p.fig.subplots_adjust(top=0.95)
make_jointplot ('odometer')
# Alternatively you can use hex_binning
# def make_joint_hexplot(feature_name):
#     p = sns.jointplot(data=df, y=feature_name, x=target_name, height=10, ratio=1, kind="hex")
#     p.ax_joint.set_ylim(0, df[feature_name].quantile(0.999))
#     p.ax_joint.set_xlim(0, df[target_name].quantile(0.999))
#     p.fig.suptitle(feature_name + ' Distribution')

Here is another example of a jointplot for the variable ‘condition.’

# detailed univariate and bivariate analysis of 'condition' using a jointplot 
make_jointplot('condition')

The graphs show a linear relationship between the price for the condition and the odometer value.

4.2 Correlation Matrix

Correlation analysis is a technique to quantify the dependency between numeric features and a target variable. Different ways exist to calculate the correlation coefficient. For example, we can use Pearson correlation (linear relation), Kendall correlation (ordinal association), or Spearman (monotonic dependence).

The example below uses Pearson correlation, which concentrates on the linear relationship between two variables. The Pearson correlation score lies between -1 and 1. General interpretations of the absolute value of the correlation coefficient are:

.00-.19 “very weak”
.20-.39 “weak”
.40-.59 “moderate”
.60-.79 “strong”
.80-1.0 “very strong”

More information on the Pearson correlation can be found here and in this article on the correlation between covid-19 and the stock market.

We will calculate a correlation matrix that provides the correlation coefficient for all features in our subset, incl. sale_price.

# 4.1 Correlation Matrix
# correlation heatmap allows us to identify highly correlated explanatory variables and reduce collinearity
plt.figure(figsize = (9,8))
plt.yticks(rotation=0)
correlation = df.corr()
ax =  sns.heatmap(correlation, cmap='GnBu',square=True, linewidths=.1, cbar_kws={"shrink": .82},annot=True,
            fmt='.1',annot_kws={"size":10})
sns.set(font_scale=0.8)
for f in ax.texts:
        f.set_text(f.get_text())

All our remaining numeric features strongly correlate with price (positive or negative). However, this is not all that matters. Ideally, we want to have features that have a low correlation with each other. We can see that prod_year and condition are moderately correlated (coefficient: 0.5). Because prod_year is more correlated with price (coefficient: 0.6) than condition (coefficient: 0.5), we drop the condition variable.

df.drop(columns='condition', inplace=True)

Step #5 Data Preprocessing

Now our subset contains the following variables:

prod_year
maker
model
trim
body_type
odometer
int_color
sale_price

Next, we prepare the data for use as input to train a regression model. Before we train the model, we need to make a few final preparations. For example, we use a label encoder to replace the strong_values of the categorical variables with numeric values.

# encode categorical variables 
def encode_categorical_variables(df):
    # create a list of categorical variables that we want to encode
    categorical_list = [x for x in df.columns if is_string_dtype(df[x])]
    le = LabelEncoder()
    # apply the encoding to the categorical variables
    # because the apply() function has no inplace argument,  we use the following syntax to transform the df
    df[categorical_list] = df[categorical_list].apply(LabelEncoder().fit_transform)
    return df
df_final_subset = encode_categorical_variables(df)
df_all_ = encode_categorical_variables(df_all)
# create a copy of the dataframe but without the target variable
df_without_target = df.drop(columns=[target_name])
df_final_subset.head()

	prod_year	maker	model	trim	body_type	odometer	int_color	sale_price	date
0	2015		23		594		794		31			16639.0		0			21500		8
1	2015		34		59		98		32			5554.0		0			10900		17
2	2014		2		46		180		32			14414.0		0			49750		8
3	2015		34		59		98		32			11398.0		0			14100		13
4	2015		7		325		789		32			14538.0		0			7200		158

Step #6 Splitting the Data and Training the Model

To ensure that our regression model does not know the target variable, we separate car price (y) from features (x). Last, we split the data into separate datasets for training and testing. The result is four different data sets: x_train, y_train, x_test, and y_test.

Once the split function has prepared the datasets, we the regression model. Our model uses the Random Decision Forest algorithm from the scikit learn package. As a so-called ensemble model, the Random Forest is a robust Machine Learning algorithm. It considers predictions from a set of multiple independent estimators.

The Random Forest algorithm has a wide range of hyperparameters. While we could optimize our model further by testing various configurations (hyperparameter tuning), this is not the focus of this article. Therefore, we will use the default hyperparameters for our model as defined by scikit-learn. Please visit one of my recent articles on hyperparameter tuning, if you want to learn more about this topic.

For comparison reasons, we train two models—one model with our subset of selected features. The second model uses all features, cleansed but without any further manipulations.

We use shuffled cross-validation (cv=5) to evaluate our model’s performance on different data folds.

def splitting(df, name):
    # separate labels from training data
    X = df.drop(columns=[target_name])
    y = df[target_name] #Prediction label
    # split the data into x_train and y_train data sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, random_state=0)
    # print the shapes: the result is: (rows, training_sequence, features) (prediction value, )
    print(name + '')
    print('train: ', X_train.shape, y_train.shape)
    print('test: ', X_test.shape, y_test.shape)
    return X, y, X_train, X_test, y_train, y_test
# train the model
def train_model(X, y, X_train, y_train):
    estimator = RandomForestRegressor() 
    cv = ShuffleSplit(n_splits=5, test_size=0.3, random_state=0)
    scores = cross_val_score(estimator, X, y, cv=cv)
    estimator.fit(X_train, y_train)
    return scores, estimator
# train the model with the subset of selected features
X_sub, y_sub, X_train_sub, X_test_sub, y_train_sub, y_test_sub = splitting(df_final_subset, 'subset')
scores_sub, estimator_sub = train_model(X_sub, y_sub, X_train_sub, y_train_sub)
    
# train the model with all features
X_all, y_all, X_train_all, X_test_all, y_train_all, y_test_all = splitting(df_all_, 'fullset')
scores_all, estimator_all = train_model(X_all, y_all, X_train_all, y_train_all)

subset
train:  (76592, 8) (76592,)
test:  (32826, 8) (32826,)

Step #7 Comparing Regression Models

Finally, we want to see how the model performs and how its performance compares against the model that uses all variables.

7.1 Model Scoring

We use different regression metrics to measure the performance. Then we create a barplot that compares the performance scores across the different validation folds (due to cross-validation).

# 7.1 Model Scoring 
def create_metrics(scores, estimator, X_test, y_test, col_name):
    scores_df = pd.DataFrame({col_name:scores})
    # predict on the test set
    y_pred = estimator.predict(X_test)
    y_df = pd.DataFrame(y_test)
    y_df['PredictedPrice']=y_pred
    # Mean Absolute Error (MAE)
    MAE = mean_absolute_error(y_test, y_pred)
    print('Mean Absolute Error (MAE): ' + str(np.round(MAE, 2)))
    # Mean Absolute Percentage Error (MAPE)
    MAPE = mean_absolute_percentage_error(y_test, y_pred)
    print('Mean Absolute Percentage Error (MAPE): ' + str(np.round(MAPE*100, 2)) + ' %')
    
    # calculate the feature importance scores
    r = permutation_importance(estimator, X_test, y_test, n_repeats=30, random_state=0)
    data_im = pd.DataFrame(r.importances_mean, columns=['feature_permuation_score'])
    data_im['feature_names'] = X_test.columns
    data_im = data_im.sort_values('feature_permuation_score', ascending=False)
    
    return scores_df, data_im
scores_df_sub, data_im_sub = create_metrics(scores_sub, estimator_sub, X_test_sub, y_test_sub, 'subset')
scores_df_all, data_im_all = create_metrics(scores_all, estimator_all, X_test_all, y_test_all, 'fullset')
scores_df = pd.concat([scores_df_sub, scores_df_all],  axis=1)
# visualize how the two models have performed in each fold
fig, ax = plt.subplots(figsize=(10, 6))
scores_df.plot(y=["subset", "fullset"], kind="bar", ax=ax)
ax.set_title('Cross validation scores')
ax.set(ylim=(0, 1))
ax.tick_params(axis="x", rotation=0, labelsize=10, length=0)

Mean Absolute Error (MAE): 1643.39
Mean Absolute Percentage Error (MAPE): 24.36 %
Mean Absolute Error (MAE): 1813.78
Mean Absolute Percentage Error (MAPE): 25.23 %

The subset model achieves an absolute percentage error of around 24%, which is not so bad. But more importantly, our model performs better than the model that uses all features. However, the subset model is less complex as it only uses eight features instead of 12. So it is easier to understand and less costly to train.

7.2 Feature Permutation Importance Scores

Next, we calculate feature importance scores. In this way, we can determine which features attribute the most to the predictive power of our model. Feature importance scores are a useful tool in the feature engineering process, as they provide insights into how the features in our subset contribute to the overall performance of our predictive model. Features with low importance scores can be eliminated from the subset or replaced with other features.

Again we will compare our subset model to the model that uses all available features from the initial dataset.

# compare the feature importance scores of the subset model to the fullset model
fig, axs = plt.subplots(1, 2, figsize=(20, 8))
sns.barplot(data=data_im_sub, y='feature_names', x="feature_permuation_score", ax=axs[0])
axs[0].set_title("Feature importance scores of the subset model")
sns.barplot(data=data_im_all, y='feature_names', x="feature_permuation_score", ax=axs[1])
axs[1].set_title("Feature importance scores of the fullset model")

In the subset model, most features are relevant to the model’s performance. Only date and int_color do not seem to have a significant impact. For the full set model, five out of 12 features hardly contribute to the model performance (date, int_color, ext_color, state, transmission_type).

Once you have a strong subset of features, you can automate the feature selection process using different techniques, e.g., forward or backward selection. Automated feature selection techniques will test different model variants with varying feature combinations to determine the best input dataset. This step is often done at the end of the feature engineering process. However, this is something for another article.

Conclusions

That’s it for now! This tutorial has presented an exploratory approach to feature exploration, engineering, and selection. You have gained an overview of tools and graphs that are useful in identifying and preparing features. The second part was a Python hands-on tutorial. We followed an exploratory feature engineering process to build a regression model for car prices. We used various techniques to discover and sort features and make a vital feature subset. These techniques include data cleansing, descriptive statistics, and univariate and bivariate analysis (incl. correlation). We also used some techniques for feature manipulation, including binning. Finally, we compared our subset model to one that uses all available data.

If you take away one learning from this article, remember that in machine learning, less is often more. So training classic machine learning models on carefully curated feature subsets likely outperforms models that use all available information.

I hope this article was helpful. I am always trying to improve and learn from my audience. So, if you have any questions or suggestions, please write them in the comments.

Sources and Further Reading

The links above to Amazon are affiliate links. By buying through these links, you support the Relataly.com blog and help to cover the hosting costs. Using the links does not affect the price.

Stock-market prediction is a typical regression problem. To learn more about feature engineering for stock-market prediction, check out this article on multivariate stock-market forecasting.

The post Feature Engineering and Selection for Regression Models with Python and Scikit-learn appeared first on relataly.com.

Using Random Search to Tune the Hyperparameters of a Random Decision Forest with Python

Florian Follonier — Thu, 07 Apr 2022 17:55:36 +0000

Perfecting your machine learning model’s hyperparameters can often feel like hunting for a proverbial needle in a haystack. But with the Random Search algorithm, this intricate process of hyperparameter tuning can be efficiently automated, saving you valuable time and effort. Hyperparameters are properties intrinsic to your model, like the number of estimators in an ensemble model, and heavily influence its performance. Unlike model parameters, which are discovered during training by the machine learning algorithm, hyperparameters require pre-specification.

In this comprehensive Python tutorial, we’ll guide you on how to harness the power of Random Search to optimize a regression model’s hyperparameters. Our illustrative example utilizes a Support Vector Machine (SVM) for predicting house prices. However, the fundamental principles you’ll learn can be seamlessly applied to any model. So why painstakingly fine-tune hyperparameters manually when Random Search can handle the task efficiently?

Here’s a preview of what this Python tutorial entails:

A brief overview of how Random Search operates and instances where it might be preferable to Grid Search.
A hands-on Python tutorial featuring a public house price dataset from Kaggle.com. The aim here is to train a regression model capable of predicting US house prices based on various properties.
Training a ‘best-guess’ model in Python, followed by using Random Search to discover a model with enhanced performance.
Finally, we’ll implement cross-validation to validate our models’ performance.

By the end of this tutorial, you’ll be well-equipped to let Random Search efficiently fine-tune your model’s hyperparameters, freeing up your time for other crucial tasks.

Hyperparameter Tuning

Hyperparameters are configuration options that allow us to customize machine learning models and improve their performance. While normal parameters are the internal coefficients that the model learns during training, we need to specify hyperparameters before the training. It is usually impossible to find the best configuration without testing different configurations.

Searching for a suitable model configuration is called “hyperparameter tuning” or “hyperparameter optimization.” Machine learning algorithms have varying hyperparameters and parameter values. For example, a random decision forest classifier allows us to configure varying parameters such as the number of trees, the maximum tree depth, and the minimum number of nodes required for a new branch.

The hyperparameters and the range of possible parameter values span a search space in which we seek to identify the best configuration. The larger the search space, the more difficult it gets to find an optimal model. We can use random search to automatize this process.

Random search can be an efficient way to tune the hyperparameters of a machine learning model. Image generated with Midjourney

Techniques for Tuning Hyperparameters

Hyperparameter tuning is the process of adjusting the hyperparameters of a machine learning algorithm to optimize its performance on a specific dataset or task. Several techniques can be used for hyperparameter tuning, including:

Grid Search: grid search is a brute-force search algorithm that systematically evaluates a given set of hyperparameter values by training and evaluating a model for each combination of values. It is a simple and effective technique, but it can be computationally expensive, especially for large or complex datasets.
Random Search: As mentioned, random search is an alternative to grid search that randomly samples a given set of hyperparameter values rather than evaluating all possible combinations. It can be more efficient than grid search, but it may not find the optimal set of hyperparameters.
Bayesian Optimization: A bayesian optimization is a probabilistic approach to hyperparameter tuning, which uses Bayesian inference to model the distribution of hyperparameter values that are likely to produce a good performance. It can be more efficient and effective than grid search or random search, but it can be more challenging to implement and interpret.
Genetic Algorithms: genetic algorithms are optimization algorithms inspired by the principles of natural selection and genetics. They use a population of candidate solutions, which are iteratively evolved and selected based on their fitness or performance, to find the optimal set of hyperparameters.

In this article, we specifically look at the Random Search technique.

You can spend much time tuning a machine learning model. Image generated with Midjourney.

What is Random Search?

The random search algorithm generates models from hyperparameter permutations randomly selected from a grid of parameter values. The idea behind the randomized approach is that testing random configurations efficiently identifies a good model. We can use random search both for regression and classification models.

Random Search and Grid Search are the most popular techniques for hyperparametric tuning, and both methods are often compared. Unlike random search, grid search covers the search space exhaustively by trying all possible variants. The technique works well for testing a small number of configurations already known to work well.

As long as both search space and training time are small, the grid search technique is excellent for finding the best model. However, the number of model variants increases exponentially with the size of the search space. It is often more efficient for large search spaces or complex models to use random search.

Since random search does not exhaustively cover the search space, it does not necessarily yield the best model. However, it is also much faster than grid search and efficient in delivering a suitable model in a short time.

Random Search vs. Exhaustive Grid Search

Tuning the Hyperparameters of a Random Decision Forest Regressor in Python using Random Search

In this tutorial, we delve into the use of the Random Search algorithm in Python, specifically for predicting house prices. We’ll be using a dataset rich in diverse house characteristics. Various elements, such as data quality and quantity, model intricacy, the selection of machine learning algorithms, and housing market stability, significantly influence the accuracy of house price predictions.

Our initial model employs a Random Decision Forest algorithm, which we’ll optimize using a random search approach for hyperparameters tuning. By identifying and implementing a more advantageous configuration, we aim to enhance our model’s performance significantly.

Here’s a concise outline of the steps we’ll undertake:

Loading the house price dataset
Exploring the dataset intricacies
Preparing the data for modeling
Training a baseline Random Decision Forest model
Implementing a random search approach for model optimization
Measuring and evaluating the performance of our optimized model

Through this step-by-step guide, you’ll learn to enhance model performance, further refining your understanding of Random Search algorithm implementation in Python.

The Python code is available in the relataly GitHub repository.

View on GitHub Relataly Github Repo

Once we have trained a house price prediction model, we can use it to asses the price of new houses. Image generated with Midjourney.

Prerequisites

Before starting the coding part, ensure that you have set up your Python (3.8 or higher) environment and required packages. If you don’t have an environment, follow this tutorial to set up the Anaconda environment.

Also, make sure you install all required packages. In this tutorial, we will be working with the following standard packages:

In addition, we will be using the Python Machine Learning library Scikit-learn to implement the random forest and the grid search technique.

You can install packages using console commands:

pip install
conda install (if you are using the anaconda packet manager)

House Price Prediction: About the Use Case and the Data

House price prediction is the process of using statistical and machine learning techniques to predict the future value of a house. This can be useful for a variety of applications, such as helping homeowners and real estate professionals to make informed decisions about buying and selling properties. In order to make accurate predictions, it is important to have access to high-quality data about the housing market.

In this tutorial, we will work with a house price dataset from the house price regression challenge on Kaggle.com. The dataset is available via a git hub repository. It contains information about 4800 houses sold between 2016 and 2020 in the US. The data includes the sale price and a list of 48 house characteristics, such as:

Year – The year of construction,
SaleYear – The year in which the house was sold
Lot Area – The lot area of the house
Quality – The overall quality of the house from one (lowest) to ten (highest)
Road – The type of road, e.g., paved, etc.
Utility – The type of the utility
Park Lot Area – The parking space included with the property
Room number – The number of rooms

Predicting house prices with machine learning. Image generated with Midjourney.

Step #1 Load the Data

We begin by loading the house price data from the relataly GitHub repository. A separate download is not required.

# A tutorial for this file is available at www.relataly.com

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import RandomizedSearchCV, train_test_split
from sklearn.metrics import mean_absolute_error, mean_absolute_percentage_error
from sklearn import svm

# Source: 
# https://www.kaggle.com/c/house-prices-advanced-regression-techniques

# Load train and test datasets
path = "https://raw.githubusercontent.com/flo7up/relataly_data/main/house_prices/train.csv"
df = pd.read_csv(path)
print(df.columns)
df.head()

Index(['Id', 'MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street',
       'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig',
       'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType',
       'HouseStyle', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd',
       'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType',
       'MasVnrArea', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual',
       'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1',
       'BsmtFinType2', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating',
       'HeatingQC', 'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF',
       'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath',
       'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual',
       'TotRmsAbvGrd', 'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType',
       'GarageYrBlt', 'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual',
       'GarageCond', 'PavedDrive', 'WoodDeckSF', 'OpenPorchSF',
       'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'PoolQC',
       'Fence', 'MiscFeature', 'MiscVal', 'MoSold', 'YrSold', 'SaleType',
       'SaleCondition', 'SalePrice'],
      dtype='object')
	Id	MSSubClass	MSZoning	LotFrontage	LotArea	Street	Alley	LotShape	LandContour	Utilities	...	PoolArea	PoolQC	Fence	MiscFeature	MiscVal	MoSold	YrSold	SaleType	SaleCondition	SalePrice
0	1	60			RL			65.0		8450	Pave	NaN		Reg			Lvl			AllPub		...	0			NaN		NaN		NaN			0		2		2008	WD			Normal			208500
1	2	20			RL			80.0		9600	Pave	NaN		Reg			Lvl			AllPub		...	0			NaN		NaN		NaN			0		5		2007	WD			Normal			181500
2	3	60			RL			68.0		11250	Pave	NaN		IR1			Lvl			AllPub		...	0			NaN		NaN		NaN			0		9		2008	WD			Normal			223500
3	4	70			RL			60.0		9550	Pave	NaN		IR1			Lvl			AllPub		...	0			NaN		NaN		NaN			0		2		2006	WD			Abnorml			140000
4	5	60			RL			84.0		14260	Pave	NaN		IR1			Lvl			AllPub		...	0			NaN		NaN		NaN			0		12		2008	WD			Normal			250000
5 rows × 81 columns

Step #2 Explore the Data

Before jumping into preprocessing and model training, let’s quickly explore the data. A distribution plot can help us understand our dataset’s frequency of regression values.

# Create histograms for feature columns separated by prediction label value
ax = sns.displot(data=df[['SalePrice']].dropna(), height=6, aspect=2)
plt.title('Sale Price Distribution')

For feature selection, it is helpful to understand the predictive power of the different variables in a dataset. We can use scatterplots to estimate the predictive power of specific features. Running the code below will create a scatterplot that visualizes the relation between the sale price, lot area, and the house’s overall quality.

# Create histograms for feature columns separated by prediction label value
plt.figure(figsize=(16,6))
df_features = df[['SalePrice', 'LotArea', 'OverallQual']]
sns.scatterplot(data=df_features, x='LotArea', y='SalePrice', hue='OverallQual')
plt.title('Sale Price Distribution')

As expected, the scatterplot shows that the sale price increases with the overall quality. On the other hand, the LotArea has only a minor effect on the sale price.

Step #3 Data Preprocessing

Next, we prepare the data for use as input to train a regression model. Because we want to keep things simple, we reduce the number of variables and use only a small set of features. In addition, we encode categorical variables with integer dummy values.

To ensure that our regression model does not know the target variable, we separate house price (y) from features (x). Last, we split the data into separate datasets for training and testing. The result is four different data sets: x_train, y_train, x_test, and y_test.

def preprocessFeatures(df):   
    # Define a list of relevant features
    feature_list = ['SalePrice', 'OverallQual', 'Utilities', 'GarageArea', 'LotArea', 'OverallCond']
    df_dummy = pd.get_dummies(df[feature_list])
    # Cleanse records with na values
    #df_prep = df_prep.dropna()
    return df_dummy

df_base = preprocessFeatures(df)

# Split the data into x_train and y_train data sets
x_train, x_test, y_train, y_test = train_test_split( df_base.copy(), df_base['SalePrice'].copy(), train_size=0.7, random_state=0)
x_train

		OverallQual	GarageArea	LotArea	OverallCond	Utilities_AllPub	Utilities_NoSeWa
682		6			431			2887	5			1					0
960		5			0			7207	7			1					0
1384	6			280			9060	5			1					0
1100	2			246			8400	5			1					0
416		6			440			7844	7			1					0

Step #4 Train Different Regression Models using Random Search

Now that the dataset is ready, we can train the random decision forest regressor. To do this, we first define a dictionary with different parameter ranges. In addition, we need to define the number of model variants (n) that the algorithm should try. The random search algorithm then selects n random permutations from the grid and uses them to train the model.

We use the RandomSearchCV algorithm from the scikit-learn package. The “CV” in the function name stands for cross-validation. Cross-validation involves splitting the data into subsets (folds) and rotating them between training and validation runs. This way, each model is trained and tested multiple times on different data partitions. When the search algorithm finally evaluates the model configuration, it summarizes these results into a test score.

We use a Random Decision Forest – a robust machine learning algorithm that can handle classification and regression tasks. As a so-called ensemble model, the Random Forest considers predictions from a set of multiple independent estimators. The estimator is an important parameter to pass to the RandomSearchCV function. Random decision forests have several hyperparameters that we can use to influence their behavior. We define the following parameter ranges:

max_leaf_nodes = [2, 3, 4, 5, 6, 7]
min_samples_split = [5, 10, 20, 50]
max_depth = [5,10,15,20]
max_features = [3,4,5]
n_estimators = [50, 100, 200]

These parameter ranges define the search space from which the randomized search algorithm (RandomSearchCV) will select random configurations. Other parameters will use default values as defined by scikit-learn.

# Define the Estimator and the Parameter Ranges
dt = RandomForestRegressor()
number_of_iterations = 20
max_leaf_nodes = [2, 3, 4, 5, 6, 7]
min_samples_split = [5, 10, 20, 50]
max_depth = [5,10,15,20]
max_features = [3,4,5]
n_estimators = [50, 100, 200]

# Define the param distribution dictionary
param_distributions = dict(max_leaf_nodes=max_leaf_nodes, 
                           min_samples_split=min_samples_split, 
                           max_depth=max_depth,
                           max_features=max_features,
                           n_estimators=n_estimators)

# Build the gridsearch
grid = RandomizedSearchCV(estimator=dt, 
                          param_distributions=param_distributions, 
                          n_iter=number_of_iterations, 
                          cv = 5)

grid_results = grid.fit(x_train, y_train)

# Summarize the results in a readable format
print("Best params: {0}, using {1}".format(grid_results.cv_results_['mean_test_score'], grid_results.best_params_))
results_df = pd.DataFrame(grid_results.cv_results_)

Best params: [0.68738293 0.49581669 0.52138751 0.61235299 0.65360944 0.61165147
 0.70392285 0.52278886 0.67687248 0.68219638 0.70031536 0.65842909
 0.51939338 0.70801017 0.70911805 0.69543885 0.67983801 0.60744371
 0.68270285 0.70741042], using {'n_estimators': 100, 'min_samples_split': 5, 'max_leaf_nodes': 7, 'max_features': 3, 'max_depth': 15}
	mean_fit_time	std_fit_time	mean_score_time	std_score_time	param_n_estimators	param_min_samples_split	param_max_leaf_nodes	param_max_features	param_max_depth	params	split0_test_score	split1_test_score	split2_test_score	split3_test_score	split4_test_score	mean_test_score	std_test_score	rank_test_score
0	0.049196		0.002071		0.004074		0.000820		50					20						5	4	15	{'n_estimators': 50, 'min_samples_split': 20, ...	0.662973	0.705533	0.669520	0.702608	0.696280	0.687383	0.017637	7
1	0.041115		0.000554		0.003046		0.000094		50					50						2	3	10	{'n_estimators': 50, 'min_samples_split': 50, ...	0.490984	0.527231	0.426270	0.523086	0.511513	0.495817	0.036978	20
2	0.043325		0.000779		0.003486		0.000447		50					50						2	5	20	{'n_estimators': 50, 'min_samples_split': 50, ...	0.484524	0.559358	0.485459	0.517253	0.560343	0.521388	0.033545	18
3	0.162083		0.005665		0.012420		0.004788		200					5						3	3	20	{'n_estimators': 200, 'min_samples_split': 5, ...	0.586586	0.638341	0.573437	0.626793	0.636608	0.612353	0.027021	14
4	0.166659		0.003026		0.010958		0.000084		200					10						4	3	15	{'n_estimators': 200, 'min_samples_split': 10,...	0.633305	0.679161	0.623236	0.661864	0.670481	0.653609	0.021636	13

These are the five best models and their respective hyperparameter configurations.

Step #5 Select the best Model and Measure Performance

Finally, we will choose the best model from the list using the “best_model” function. We then calculate the MAE and the MAPE to understand how the model performs on the overall test dataset. We then print a comparison between actual sale prices and predicted sale prices.

# Select the best Model and Measure Performance
best_model = grid_results.best_estimator_
y_pred = best_model.predict(x_test)
y_df = pd.DataFrame(y_test)
y_df['PredictedPrice']=y_pred
y_df.head()

	SalePrice	PredictedPrice
529	200624		166037.831002
491	133000		135860.757958
459	110000		123030.336177
279	192000		206488.444327
655	88000		130453.604206

Next, let’s take a look at the classification errors.

# Mean Absolute Error (MAE)
MAE = mean_absolute_error(y_pred, y_test)
print('Mean Absolute Error (MAE): ' + str(np.round(MAE, 2)))

# Mean Absolute Percentage Error (MAPE)
MAPE = mean_absolute_percentage_error(y_pred, y_test)
print('Median Absolute Percentage Error (MAPE): ' + str(np.round(MAPE*100, 2)) + ' %')

Mean Absolute Error (MAE): 29591.56 
Median Absolute Percentage Error (MAPE): 15.57 %

On average, the model deviates from the actual value by 16 %. Considering we only used a fraction of the available features and defined a small search space, there is much room for improvement.

Summary

This article has shown how we can use grid Search in Python to efficiently search for the optimal hyperparameter configuration of a machine learning model. In the conceptual part, you learned about hyperparameters and how to use random search to try out all permutations of a predefined parameter grid. The second part was a Python hands-on tutorial, in which you learned to use random search to tune the hyperparameters of a regression model. We worked with a house price dataset and trained a random decision forest regressor that predicts the sale price for houses depending on several characteristics. Then we defined parameter ranges and tested random permutations. In this way, we quickly identified a configuration that outperforms our initial baseline model.

Remember that a random search efficiently identifies a good-performing model but does not necessarily return the best-performing one. Tech random search techniques can be used to tune the hyperparameters of both regression and classification models.

Sources and Further Reading

I hope this article was helpful. If you have any questions or suggestions, please write them in the comments.

The links above to Amazon are affiliate links. By buying through these links, you support the Relataly.com blog and help to cover the hosting costs. Using the links does not affect the price.

The post Using Random Search to Tune the Hyperparameters of a Random Decision Forest with Python appeared first on relataly.com.

Forecasting Beer Sales with ARIMA in Python

Florian Follonier — Wed, 03 Feb 2021 22:23:08 +0000

Time series analysis and forecasting is a tough nut to crack, but the ARIMA model has been cracking it for decades. ARIMA, short for “Auto-Regressive Integrated Moving Average,” is a powerful statistical modeling technique for time series analysis. It’s particularly effective when the time series you’re analyzing follows a clear pattern, like seasonal changes in weather or sales. ARIMA has been used to forecast everything from beer sales to order quantities, and this tutorial will show you how to build your own ARIMA model in Python. You’ll be making predictions like a pro in no time!

This tutorial proceeds in two parts: The first part covers the concepts behind ARIMA. You will learn how ARIMA works, what Stationarity means, and when it is appropriate to use ARIMA. The second part is a Python hands-on tutorial that applies auto-ARIMA to the Sales Forecasting domain. We’ll be working with a time series of beer sales, and our goal is to predict how the beer sales quantities will evolve in the coming years. First, we check if the time series is stationary. Then we train an ARIMA forecasting model. Finally, we use the model to produce a sales forecast and measure the model’s performance.

About Sales Forecasting

Sales forecasting is a crucial business strategy that involves predicting future sales volumes for a product (for example, beer) or service. It leverages sophisticated statistical and analytical techniques, such as time series analysis or machine learning algorithms, to scrutinize historical sales data. By identifying trends and patterns within this data, businesses can make informed predictions about their future sales performance.

This strategic forecasting plays a pivotal role in business operations. It is instrumental in guiding key decisions surrounding production, inventory management, staffing, and various other operational elements. By honing in on accurate sales forecasting, businesses can strike the perfect balance – maintaining enough inventory to meet customer demand without overproducing or overstocking. This equilibrium ensures a smooth flow in the supply chain and avoids unnecessary costs tied to excess production or storage.

Furthermore, sales forecasting serves as a roadmap for business growth. It aids in identifying potential market opportunities and predicting future sales revenue. This valuable foresight enables businesses to strategically plan their expansion, ensuring resources are optimally utilized and future goals are met. With this in-depth understanding of sales forecasting, businesses can stay ahead of market trends, navigate through business challenges, and ultimately steer towards success.

Businesses rely on sales forecasting to make informed decisions about production, inventory management, staffing, and other key operational aspects. Image created with Midjourney

" data-image-caption="

Businesses rely on sales forecasting to make informed decisions about production, inventory management, staffing, and other key operational aspects. Image created with Midjourney

" data-large-file="https://www.relataly.com/wp-content/uploads/2023/03/brewery-arima-beer-sales-forecasting-python-tutorial-machine-learning-relataly-min.png" src="https://www.relataly.com/wp-content/uploads/2023/03/brewery-arima-beer-sales-forecasting-python-tutorial-machine-learning-relataly-min.png" alt="Businesses rely on sales forecasting to make informed decisions about production, inventory management, staffing, and other key operational aspects. Image created with Midjourney" class="wp-image-12602" srcset="https://www.relataly.com/wp-content/uploads/2023/03/brewery-arima-beer-sales-forecasting-python-tutorial-machine-learning-relataly-min.png 506w, https://www.relataly.com/wp-content/uploads/2023/03/brewery-arima-beer-sales-forecasting-python-tutorial-machine-learning-relataly-min.png 300w, https://www.relataly.com/wp-content/uploads/2023/03/brewery-arima-beer-sales-forecasting-python-tutorial-machine-learning-relataly-min.png 140w" sizes="(max-width: 506px) 100vw, 506px" />

Businesses rely on sales forecasting to make informed decisions about production, inventory management, staffing, and other key operational aspects. Image created with Midjourney

Introduction to ARIMA Time Series Modelling

ARIMA models provide an alternative approach to time series forecasting that differs significantly from machine learning methods. Working with ARIMA requires a good understanding of Stationarity and knowledge of the transformations used to make time-series data stationary. The concept of Stationarity is, therefore, first on our schedule.

The Concept of Stationarity

Stationarity is an essential concept in stochastic processes that describes the nature of a time series. We consider a time series strictly stationary if its statistical properties do not change over time. In this case, summary statistics, such as the mean and variance, do not change over time. However, the time-series data we encounter in the real world often show a trend or significant irregular fluctuations, making them non-stationary or weakly stationary.

So why is Stationarity such an essential concept for ARIMA? If a time series is stationary, we can assume that the past values of the time series are predictive of future development. In other words, a stationary time series exhibits consistent behavior that makes it predictable. On the other hand, a non-stationary time series is characterized by a kind of random behavior that will be difficult to capture in modeling. Namely, if random movements characterized the past, there is a high probability that the future will be no different.

Fortunately, in many cases, it is possible to transform a time series that is non-stationary into a stationary form and, in this way, build better prediction models.

A stationary Vs. a non-stationary time series

How to Test Whether a Time Series is Stationary

The first step in the ARIMA modeling approach is determining whether a time series is stationary. There are different ways to determine whether a time series is stationary:

Plotting: We can plot the time series and visually check if it shows consistent behavior or changes over a more extended period.
Summary statistics: We can split the time series into different periods and calculate the summary statistics, such as the variance. If these metrics are subject to significant changes, the time series is non-stationary. However, the results will also depend on the respective periods, leading to false conclusions.
Statistic tests: There are various tests to determine the stationary of a time series, such as Kwiatkowski–Phillips–Schmidt–Shin, Augmented Dickey-Fuller, or Phillips–Perron. These tests systematically check a time series and measure the results against the null hypothesis, providing an indicator of the trustworthiness of the results.

What is an (S)ARIMA Model?

As the name implies, ARIMA uses autoregression (AR), integration (differencing), and moving averages (MA) to fit a linear regression model to a time series.

ARIMA Parameters

The default notation for ARIMA is a model with parameters p, d, and q, whereby each parameter takes an integer value:

d (differencing): In the case of a non-stationary time series, there is a chance to remove a trend from the data by differencing once or several times, thus bringing the data to a stationary state. The model parameter d determines the order of the differentiation. A value of d = 0 simplifies the ARIMA model to an ARMA model, lacking the integration aspect. If this is the case, we do not need to integrate the function because the time series is already stationary.
p (order of the AR terms): The autoregressive process describes the dependent relationship between an observation and several lagged observations (lags). Predictions are then based on past data from the same time series using linear functions. p = 1 means the model uses values that lag by one period.
q (order of the MA terms): The parameter q determines the number of lagged forecast errors in the prediction equation. In contrast to the AR process, the MA process assumes that values at a future point in time depend on the errors made by predictions at current and past points in time. This means that it is not previous events that determine the predictions but rather the previous estimation or prediction errors used to calculate the following time series value.

SARIMA

In the real world, many time series have seasonal effects. Examples are monthly retail sales figures, temperature reports, weekly airline passenger data, etc. To consider this, we can specify a seasonal range (e.g., m=12 for monthly data) and additional seasonal AR or MA components for our model that deal with seasonality. Such a model is also called a SARIMA model, and we can define it as a model(p, d, q)(P, D, Q)[m].

Auto-(S)ARIMA

When working with ARIMA, we can set the model parameters manually or use auto-ARIMA and let the model search for the optimal parameters. We do this by varying the parameters and then testing against Stationarity. With the seasonal option enabled, the process tries to identify the optimal hyperparameters for the seasonal components of the model. Auto-ARIMA works by conducting differencing tests to determine the order of differencing, d and then fitting models with parameters in defined ranges, e.g., start_p, max_p as well as start_q, max_q. If our model has a seasonal component, we can also define parameter ranges for the seasonal part of the model.

Creating a Sales Forecast with ARIMA in Python

Having grasped the fundamental concepts behind ARIMA (AutoRegressive Integrated Moving Average), we’re now ready to dive into the practical aspect of crafting a sales forecasting model in Python. Utilizing ARIMA for forecasting sales data is an esteemed practice owing to the algorithm’s adeptness in modeling seasonal changes combined with long-term trends – a characteristic commonly exhibited by sales data.

In this tutorial, we’ll be employing a dataset representing the monthly beer sales across the United States from 1992 through 2018, recorded in millions of US dollars. Our objective is to construct a robust time series model using ARIMA to accurately predict future sales trends.

When it comes to the technological aspect, we’ll be using the Python-based ‘statsmodels’ and ‘pmdarima’ libraries to build our ARIMA sales forecasting model. So, if you’re ready to harness the power of Python and ARIMA for sales prediction, let’s get started!

The code is available on the GitHub repository.

View on GitHub Relataly GitHub Repo

A fluffy cat drinking beer after creating an ARIMA sales forecast. Image created with Midjourney

Prerequisites

Before we start coding, ensure you have set up your Python 3 environment and required packages. If you don’t have an environment, you can follow this tutorial to set up the Anaconda environment.

Also, make sure you install all required packages. In this tutorial, we will be working with the following standard packages:

In addition, we will be using the statsmodels library and pmdarima.

You can install packages using console commands:

pip install
conda install (if you are using the anaconda packet manager)

Step #1 Load the Sales Data to Our Python Project

In the initial step of this tutorial, we commence by setting up the necessary Python environment. We import several packages that we’ll be using for data manipulation, visualization, and implementing machine learning models. We then fetch the dataset we’ll be working with – the monthly beer sales in the United States from 1992 through 2018. This data is sourced from a publicly accessible URL and loaded into a pandas DataFrame.

# A tutorial for this file is available at www.relataly.com
# Tested with Python 3.88

# Setting up packages for data manipulation and machine learning
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import pmdarima as pm
from statsmodels.graphics.tsaplots import plot_acf
from statsmodels.tsa.seasonal import seasonal_decompose
import seaborn as sns
sns.set_style('white', { 'axes.spines.right': False, 'axes.spines.top': False})

# Link to the dataset: 
# https://www.kaggle.com/bulentsiyah/for-simple-exercises-time-series-forecasting

path = "https://raw.githubusercontent.com/flo7up/relataly_data/main/alcohol_sales/BeerWineLiquor.csv"
df = pd.read_csv(path)
df.head()

		date	beer
0	1/1/1992	1509
1	2/1/1992	1541
2	3/1/1992	1597
3	4/1/1992	1675
4	5/1/1992	1822

As shown above, the sales figures in this dataset stem from the first day of each month.

Step #2 Visualize the Time Series and Check it for Stationarity

Before modeling the sales data, we visualize the time series and test it for Stationarity. Visualization helps us choose the parameters for our ARIMA model, thus making it an essential step.

First, we will look at the different components of the time series. We do this by using the seasonal_decompose function of the statsmodels library.

# Decompose the time series
plt.rcParams["figure.figsize"] = (10,6)
result = seasonal_decompose(df['beer'], model='multiplicative', period = 12)
result.plot()
plt.show()

To test for Stationarity, we use the ADFuller test. It is common to run this test multiple times throughout a data science project. Therefore, we create a function that we can then reuse later.

def check_stationarity(df_sales, title_string, labels):
    # Visualize the data
    fig, ax = plt.subplots(figsize=(16, 8))
    plt.title(title_string, fontsize=14)
    if df_sales.index.size > 12:
        df_sales['ma_12_month'] = df_sales['beer'].rolling(window=12).mean()
        df_sales['ma_25_month'] = df_sales['beer'].rolling(window=25).mean()
        sns.lineplot(data=df_sales[['beer', 'ma_25_month', 'ma_12_month']], palette=sns.color_palette("mako_r", 3))
        plt.legend(title='Smoker', loc='upper left', labels=labels)
    else:
        sns.lineplot(data=df_sales[['beer']])
    
    plt.show()
    
    sales = df_sales['beer'].dropna()
    # Perform an Ad Fuller Test
    # the default alpha = .05 stands for a 95% confidence interval
    adf_test = pm.arima.ADFTest(alpha = 0.05) 
    print(adf_test.should_diff(sales))
    
df_sales = pd.DataFrame(df['beer'], columns=['beer'])
df_sales.index = pd.to_datetime(df['date']) 
title = "Beer sales in the US between 1992 and 2018 in million US$/month"
labels = ['beer', 'ma_12_month', 'ma_25_month']
check_stationarity(df_sales, title, labels)

The data does not appear to be stationary. We can see that our time series is steadily increasing and shows annual seasonality. The steady increase indicates a continuous growth in beer consumption over the last decades. The seasonality in the sales data likely results from people drinking more beer in summer than in other seasons.

Step #3 Exemplary Differencing and Autocorrelation

The chart from the previous section shows that our time series is non-stationary. The reason is that it follows a clear upward trend. We also know that the time series has a seasonal component. Therefore, we need to define additional parameters and construct a SARIMA model.

Before we use auto-correlation to determine the optimal parameters, we will try manual differencing to make the time series stationary. There is no guarantee that differencing works. It is essential to remember that differencing can sometimes also worsen prediction performance. So be careful, not to overdifference! We could also trust that the auto-ARIMA model chooses the best parameters for us. However, we should always validate the selected parameters.

The ideal differencing parameter is the least number of differencing steps to achieve a stationary time series. We will monitor the results with autocorrelation plots to check whether differencing was successful.

We print the autocorrelation for the original time series and after the first and second-order differencing.

# 3.1 Non-seasonal part
def auto_correlation(df, prefix, lags):
    plt.rcParams.update({'figure.figsize':(7,7), 'figure.dpi':120})
    
    # Define the plot grid
    fig, axes = plt.subplots(3,2, sharex=False)

    # First Difference
    axes[0, 0].plot(df)
    axes[0, 0].set_title('Original' + prefix)
    plot_acf(df, lags=lags, ax=axes[0, 1])

    # First Difference
    df_first_diff = df.diff().dropna()
    axes[1, 0].plot(df_first_diff)
    axes[1, 0].set_title('First Order Difference' + prefix)
    plot_acf(df_first_diff, lags=lags - 1, ax=axes[1, 1])

    # Second Difference
    df_second_diff = df.diff().diff().dropna()
    axes[2, 0].plot(df_second_diff)
    axes[2, 0].set_title('Second Order Difference' + prefix)
    plot_acf(df_second_diff, lags=lags - 2, ax=axes[2, 1])
    plt.tight_layout()
    plt.show()
    
auto_correlation(df_sales['beer'], '', 10)

(0.019143247561160443, False)

The charts above show that the time series becomes stationary after one order differencing. However, we can see that the lag goes into the negative very quickly, which indicates overdifferencing.

Next, we perform the same procedure for the seasonal part of our time series.

# 3.2 Seasonal part

# Reduce the timeframe to a single seasonal period
df_sales_s = df_sales['beer'][0:12]

# Autocorrelation for the seasonal part
auto_correlation(df_sales_s, '', 10)

# Check if the first difference of the seasonal period is stationary
df_diff = pd.DataFrame(df_sales_s.diff())
df_diff.index = pd.date_range(df_sales_s.diff().iloc[1], periods=12, freq='MS') 
check_stationarity(df_diff, "First Difference (Seasonal)", ['difference'])

(0.99, True)

After first order differencing, the seasonal part of the time series is stationary. The autocorrelation plot shows that the values go into the negative but remain within acceptable boundaries. Second-order differencing does not seem to improve these values. Consequently, we conclude that first-order differencing is a good choice for the D parameter.

Step #4 Finding an Optimal Model with Auto-ARIMA

Next, we auto-fit an ARIMA model to our time series. In this way, we ensure that we can later measure the performance of our model against a fresh set of data that the model has not seen so far. We will split our dataset into train and test in preparation for this.

Once we have created the train and test data sets, we can configure the parameters for the auto_arima stepwise optimization. By setting max_d = 1, we tell the model to test no-differencing and first-order differencing. Also, we set max_p and max_q to 3.

To deal with the seasonality in our time series, we set the “seasonal” parameter to True and the “m” parameter to 12 data points. We turn our model into a SARIMA model that allows us to configure additional D, P, and Q parameters. We define a max value for Q and P of 3. Previously we have already seen that further differencing does not improve the Stationarity. Therefore, we can set the value of D to 1.

After configuring the parameters, we next fit the model to the time series. The model will try to find the optimal parameters and choose the model with the least AIC.

# split into train and test
pred_periods = 30
split_number = df_sales['beer'].count() - pred_periods # corresponds to a prediction horizion  of 2,5 years
df_train = pd.DataFrame(df_sales['beer'][:split_number]).rename(columns={'beer':'y_train'})
df_test = pd.DataFrame(df_sales['beer'][split_number:]).rename(columns={'beer':'y_test'})

# auto_arima
model_fit = pm.auto_arima(df_train, test='adf', 
                         max_p=3, max_d=3, max_q=3, 
                         seasonal=True, m=12,
                         max_P=3, max_D=2, max_Q=3,
                         trace=True,
                         error_action='ignore',  
                         suppress_warnings=True, 
                         stepwise=True)

# summarize the model characteristics
print(model_fit.summary())

Performing stepwise search to minimize aic
 ARIMA(2,0,2)(1,1,1)[12] intercept   : AIC=inf, Time=3.89 sec
 ARIMA(0,0,0)(0,1,0)[12] intercept   : AIC=3383.210, Time=0.02 sec
 ARIMA(1,0,0)(1,1,0)[12] intercept   : AIC=3351.655, Time=0.38 sec
 ARIMA(0,0,1)(0,1,1)[12] intercept   : AIC=3364.350, Time=1.09 sec
 ARIMA(0,0,0)(0,1,0)[12]             : AIC=3604.145, Time=0.02 sec
 ARIMA(1,0,0)(0,1,0)[12] intercept   : AIC=3349.908, Time=0.11 sec
 ARIMA(1,0,0)(0,1,1)[12] intercept   : AIC=3351.532, Time=0.29 sec
 ARIMA(1,0,0)(1,1,1)[12] intercept   : AIC=3353.520, Time=1.24 sec
 ARIMA(2,0,0)(0,1,0)[12] intercept   : AIC=3312.656, Time=0.10 sec
 ARIMA(2,0,0)(1,1,0)[12] intercept   : AIC=3314.483, Time=0.57 sec
 ARIMA(2,0,0)(0,1,1)[12] intercept   : AIC=3314.378, Time=0.30 sec
 ARIMA(2,0,0)(1,1,1)[12] intercept   : AIC=3305.552, Time=3.02 sec
 ARIMA(2,0,0)(2,1,1)[12] intercept   : AIC=3291.425, Time=4.19 sec
 ARIMA(2,0,0)(2,1,0)[12] intercept   : AIC=3306.914, Time=3.06 sec
 ARIMA(2,0,0)(3,1,1)[12] intercept   : AIC=3276.501, Time=4.67 sec
 ARIMA(2,0,0)(3,1,0)[12] intercept   : AIC=3282.240, Time=5.24 sec
 ARIMA(2,0,0)(3,1,2)[12] intercept   : AIC=inf, Time=7.39 sec
 ARIMA(2,0,0)(2,1,2)[12] intercept   : AIC=inf, Time=4.74 sec
 ARIMA(1,0,0)(3,1,1)[12] intercept   : AIC=3313.877, Time=5.17 sec
 ARIMA(3,0,0)(3,1,1)[12] intercept   : AIC=3246.820, Time=5.72 sec
 ARIMA(3,0,0)(2,1,1)[12] intercept   : AIC=3255.313, Time=5.33 sec
 ARIMA(3,0,0)(3,1,0)[12] intercept   : AIC=3249.998, Time=6.77 sec
 ARIMA(3,0,0)(3,1,2)[12] intercept   : AIC=inf, Time=8.39 sec
 ARIMA(3,0,0)(2,1,0)[12] intercept   : AIC=3259.938, Time=3.55 sec
...
===================================================================================

Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).

Auto-ARIMA has determined that the best model is (3,0,0)(3,1,1). These results match the results from section 3, in which we manually performed differencing.

Step #5 Simulate the Time Series using in-sample Forecasting

Now that we have trained our model, we want to use it to simulate the entire time series. We will do this by calling the predict method in the sample function. The prediction will match the same period as the original time series with which we trained the model. Because the model predicts one step, the prediction results will naturally be close to the actual values.

# Generate in-sample Predictions
# The parameter dynamic=False means that the model makes predictions upon the lagged values.
# This means that the model is trained until a point in the time-series and then tries to predict the next value.
pred = model_fit.predict_in_sample(dynamic=False) # works only with auto-arima
df_train['y_train_pred'] = pred

# Calculate the percentage difference
df_train['diff_percent'] = abs((df_train['x_train'] - pred) / df_train['x_train'])* 100

# Print the predicted time-series
fig, ax1 = plt.subplots(figsize=(16, 8))
plt.title("In Sample Sales Prediction", fontsize=14)
sns.lineplot(data=df_train[['x_train', 'y_train_pred']], linewidth=1.0)

# Print percentage prediction errors on a separate axis (ax2)
ax2 = ax1.twinx() 
ax2.set_ylabel('Prediction Errors in %', color='purple', fontsize=14)  
ax2.set_ylim([0, 50])
ax2.bar(height=df_train['diff_percent'][20:], x=df_train.index[20:], width=20, color='purple', label='absolute errors')
plt.legend()
plt.show()

Next, we take a look at the prediction errors.

Step #6 Generate and Visualize a Sales Forecast

Now that we have trained an optimal model, we are ready to generate a sales forecast. First, we specify the number of periods that we want to predict. In addition, we create an index from the number of predictions adjacent to the original time series and continue it (prediction_index).

# Generate prediction for n periods, 
# Predictions start from the last date of the training data
test_pred = model_fit.predict(n_periods=pred_periods, dynamic=False)
df_test['y_test_pred'] = test_pred
df_union = pd.concat([df_train, df_test])
df_union.rename(columns={'beer':'y_test'}, inplace=True)

# Print the predicted time-series
fig, ax = plt.subplots(figsize=(16, 8))
plt.title("Test/Pred Comparison", fontsize=14)
sns.despine();
sns.lineplot(data=df_union[['y_train', 'y_train_pred', 'y_test', 'y_test_pred']], linewidth=1.0, dashes=False, palette='muted')
ax.set_xlim([df_union.index[150],df_union.index.max()])
plt.legend()
plt.show()

As shown above, our model’s forecast continues the seasonal pattern of the beer sales time series. On the one hand, this indicates that US beer sales will continue to rise and, on the other hand, that our model works just fine 🙂

Step #7 Measure the Performance of the Sales Forecasting Model

In this section, we will measure the performance of our ARIMA model. To learn more about this topic, check out this relataly article measuring regression performance.

The previous section’s simulation chart shows a few outliers among the prediction errors. Therefore, we focus our analysis on the percentage errors. Two helpful metrics are the mean absolute error (MAPE) and the mean absolute percentage error (MDAPE).

# Mean Absolute Percentage Error (MAPE)
MAPE = np.mean((np.abs(np.subtract(df_test['y_test'], df_test['y_test_pred'])/ df_test['y_test']))) * 100
print(f'Mean Absolute Percentage Error (MAPE): {np.round(MAPE, 2)} %')

# Median Absolute Percentage Error (MDAPE)
MDAPE = np.median((np.abs(np.subtract(df_test['y_test'], df_test['y_test_pred'])/ df_test['y_test'])) ) * 100
print(f'Median Absolute Percentage Error (MDAPE): {np.round(MDAPE, 2)} %')

Mean Absolute Percentage Error (MAPE): 3.94 %  Median Absolute Percentage Error (MDAPE): 3.49 %

The percent errors show that our ARIMA model achieves a decent predictive performance.

Summary

This Python tutorial has shown how to use SARIMA for sales forecasting. Sales forecasting is important for businesses because it can help them to make informed decisions about production, inventory management, and staffing, among other things. By accurately forecasting sales, businesses can ensure that they have the right amount of product available to meet customer sales, avoid overproduction and excess inventory, and plan for future growth. The use cases presented were forecasting beer sales, and we have used arima to analyze seasonal sales data.

In the first part, we have learned how ARIMA works, what Stationarity is and how to check if a time series is stationary. In the second part, we developed an ARIMA model in Python to create a forecast for US beer sales. For this purpose, we created an in-sample forecast and used Auto-tARIMA to find the optimal parameters for our sales forecasting model.

If you have any questions or suggestions, please let me know in the comments, and I will do my best to answer.

Now that we have learned to use ARIMA to forecast beer sales, you really deserved yourself a beer. Cheers! Image created with Midjourney

" data-image-caption="

Now that we have learned to use ARIMA to forecast beer sales, you really deserved yourself a beer. Cheers! Image created with Midjourney

" data-large-file="https://www.relataly.com/wp-content/uploads/2023/03/brewery-arima-beer-sales-forecasting-python-tutorial-machine-learning-relataly2-min.png" src="https://www.relataly.com/wp-content/uploads/2023/03/brewery-arima-beer-sales-forecasting-python-tutorial-machine-learning-relataly2-min.png" alt="Now that we have learned to use ARIMA to forecast beer sales, you really deserved yourself a beer. Cheers! Image created with Midjourney" class="wp-image-12603" srcset="https://www.relataly.com/wp-content/uploads/2023/03/brewery-arima-beer-sales-forecasting-python-tutorial-machine-learning-relataly2-min.png 506w, https://www.relataly.com/wp-content/uploads/2023/03/brewery-arima-beer-sales-forecasting-python-tutorial-machine-learning-relataly2-min.png 300w" sizes="(max-width: 506px) 100vw, 506px" />

Now that you have learned to use ARIMA to forecast beer sales, you really earned yourself a beer. Cheers! Image created with Midjourney

Sources and Further Reading

The links above to Amazon are affiliate links. By buying through these links, you support the Relataly.com blog and help to cover the hosting costs. Using the links does not affect the price.

Want to learn more about time series analysis and prediction?
Check out these recent relataly tutorials:

The post Forecasting Beer Sales with ARIMA in Python appeared first on relataly.com.

Classifying Purchase Intention of Online Shoppers with Python

Florian Follonier — Mon, 11 May 2020 21:42:35 +0000

Online shopping has become a part of our daily lives, and online stores are continually seeking to improve their sales. One way to achieve this is by using machine learning to predict customers’ purchase intentions. This innovative process can help businesses understand their customers’ behavior and tailor their marketing strategies accordingly.

In this article, we will explore the practical side of purchase intention prediction. Our focus is on developing a classification model that predicts whether a visitor will make a purchase or not. We’ll use Scikit-Learn’s machine learning library to train a Logistic Regression algorithm, and evaluate the model’s performance. Our ultimate goal is to provide insights into the circumstances under which customers make purchase decisions.

Predicting purchase intentions can offer significant benefits to online stores, such as identifying potential customers who are most likely to buy and targeting their marketing efforts accordingly. By understanding the practical application of machine learning for purchase intention prediction, online businesses can gain a competitive edge and increase their revenue.

Also: Sentiment Analysis with Naive Bayes and Logistic Regression in Python

About Modeling Customer Purchase Intentions

Customer purchase intention prediction is the process of using machine learning algorithms to predict the likelihood that a particular customer will make a purchase. This can be useful for various applications, such as identifying potential customers most likely interested in a particular product or service and targeting marketing and sales efforts accordingly.

To make accurate predictions about customer purchase intentions, it is important to have access to high-quality data about the customer, such as their demographic information, purchasing history, and other relevant factors. By analyzing this data and applying appropriate machine learning algorithms, it is possible to identify patterns and trends that can predict the likelihood that a particular customer will make a purchase.

There are many different approaches to customer purchase intention prediction, and the specific methods used can vary depending on the application and the data available. Some common techniques for predicting customer purchase intentions include using regression analysis to model the relationship between purchase intentions and other variables and using classification algorithms to classify customers as likely or unlikely to make a purchase. By using these techniques, it is possible to make more accurate and useful predictions about customer purchase intentions.

Also: Customer Churn Prediction – Understanding Models with Feature Permutation Importance

Customer purchase intentions sometimes follow patterns that can be used for predictive purposes. Image created with Midjourney.

How Modeling Purchase Intentions can Lead to a Better Customer Understanding

Predicting the purchase intentions of online shoppers can be a step for online stores to understand their customers better. Creating predictive models makes it possible to conclude the factors influencing customers’ buying behavior. At what time of day are our customers most inclined to buy? For which products do customers often abandon the purchase process? Such questions are fascinating for marketing departments. Once understood, they can enable marketers to optimize their customers’ buying experience and achieve a higher conversion rate. In this way, intention prediction can help online stores target customers with the right products at the right time and thus take a step toward marketing automation.

Classifying Purchase Intentions of Online Shoppers with Python

" data-image-caption="

Classifying Purchase Intentions of Online Shoppers with Python

" data-large-file="https://www.relataly.com/wp-content/uploads/2022/04/image-13.png" src="https://www.relataly.com/wp-content/uploads/2022/04/image-13-1024x478.png" alt="A classification model that predicts the buying intention of online shoppers" class="wp-image-6828" width="760" height="355" srcset="https://www.relataly.com/wp-content/uploads/2022/04/image-13.png 1024w, https://www.relataly.com/wp-content/uploads/2022/04/image-13.png 300w, https://www.relataly.com/wp-content/uploads/2022/04/image-13.png 768w, https://www.relataly.com/wp-content/uploads/2022/04/image-13.png 1536w, https://www.relataly.com/wp-content/uploads/2022/04/image-13.png 1846w" sizes="(max-width: 760px) 100vw, 760px" />

Implementing a Prediction Model for Purchase Intentions with Python

Logistic regression is a widely-used algorithm in machine learning that is particularly useful for solving two-class classification problems. One of the primary benefits of using logistic regression models is that they can help us understand the factors that influence the predictions made by the model. This interpretability is a key advantage of logistic regression, making it a popular choice in many real-world applications.

In the next steps of our analysis, we will develop a two-class classification model that utilizes the logistic regression algorithm to predict the purchase intentions of online shoppers. By analyzing a set of features that are likely to influence a shopper’s decision to purchase, such as product price, customer reviews, and shipping time, we can build a model that accurately predicts the likelihood of a shopper completing a purchase. The logistic regression algorithm will be particularly useful in this case, as it allows us to identify which features are the most significant predictors of purchase intention.

The code is available on the GitHub repository.

View on GitHub Relataly GitHub Repo

Prerequisites

Before starting the coding part, make sure that you have set up your Python 3 environment and required packages. If you don’t have an environment, consider the Anaconda Python environment. To set it up, you can follow the steps in this tutorial. Please ensure to install all required packages:

In addition, we will be using the machine learning library Scikit-learn and Seaborn for visualization. You can install packages using console commands:

pip install
conda install (if you are using the anaconda packet manager)

About the Dataset

In this tutorial, we will be working with a public dataset from Kaggle.com. The data consists of 18 feature vectors belonging to 12,330 shopping sessions. You can download the data via the link below:

online_shoppers_intention.csv Download

The data stems from a big shopping website that has recorded the session for one year. Each record belongs to a separate shopping session and user. Thus, there is no bias in the data, such as a specific period, user, or day to avoid.

Below you will find an overview of the features contained in the data (Source: Kaggle.com):

“Administrative,” “Administrative Duration,” “Informational,” “Informational Duration,” “Product Related,” and “Product-Related Duration” represent the number of different types of pages visited by the visitor in that session and the total time spent in each of these page categories.
The “Bounce Rate,” “Exit Rate,” and “Page Value” features represent the metrics measured by “Google Analytics” for each page on the e-commerce site.
The “Special Day” feature indicates the closeness of the site visiting time to a specific special day (e.g., Mother’s Day, Valentine’s Day)
The dataset also includes an operating system, browser, region, traffic type, visitor type as returning or new visitor, a Boolean value indicating whether the date of the visit is a weekend, and the month of the year.

The ‘Revenue’ attribute is the class label, called the “prediction label.”

Step #1 Load the Data

We begin by loading the shopping dataset into a Pandas DataFrame. Afterward, we will print a brief overview of the data.

import calendar
import math 
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt 
from matplotlib import cm
import seaborn as sns

from sklearn.model_selection import train_test_split as train_test_split
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import confusion_matrix, roc_curve, auc, roc_auc_score
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Load train data
filepath = "data/classification-online-shopping/"
df_shopping_base = pd.read_csv(filepath + 'online_shoppers_intention.csv') 
df_shopping_base

	Administrative	Administrative_Duration	Informational	Informational_Duration	ProductRelated	ProductRelated_Duration	BounceRates	ExitRates	PageValues	SpecialDay	Month	OperatingSystems	Browser	Region	TrafficType	VisitorType			Weekend	Revenue
0	0.0				0.0						0.0				0.0						1.0				0.000000				0.20		0.20		0.0			0.0			Feb		1					1		1		1			Returning_Visitor	False	False
1	0.0				0.0						0.0				0.0						2.0				64.000000				0.00		0.10		0.0			0.0			Feb		2					2		1		2			Returning_Visitor	False	False
2	0.0				-1.0					0.0				-1.0					1.0				-1.000000				0.20		0.20		0.0			0.0			Feb		4					1		9		3			Returning_Visitor	False	False
3	0.0				0.0						0.0				0.0						2.0				2.666667				0.05		0.14		0.0			0.0			Feb		3					2		2		4			Returning_Visitor	False	False
4	0.0				0.0						0.0				0.0						10.0			627.500000				0.02		0.05		0.0			0.0			Feb		3					3		1		4			Returning_Visitor	True	False

Step #2 Cleaning the Data

Before we can start training our prediction model, we’ll do some cleanups (handling missing data, data type conversions, treating outliers, and so on).

# Replacing visitor_type to int
print(df_shopping_base['VisitorType'].unique())
df_shop = df_shopping_base.replace({'VisitorType' : { 'New_Visitor' : 0, 'Returning_Visitor' : 1, 'Other' : 2 }})

# Coverting month column to numeric numeric values
monthlist = df_shop['Month'].replace('June', 'Jun')
mlist = []
m = np.array(monthlist)
for mi in m:
    a = list(calendar.month_abbr).index(mi)
    mlist.append(a)
df_shop['Month'] =  mlist

# Delete records with NAs
df_shop.dropna(inplace=True)

df_shop.head()

['Returning_Visitor' 'New_Visitor' 'Other']
	Administrative	Administrative_Duration	Informational	Informational_Duration	ProductRelated	ProductRelated_Duration	BounceRates	ExitRates	PageValues	SpecialDay	Month	OperatingSystems	Browser	Region	TrafficType	VisitorType	Weekend	Revenue
  0	0.0				0.0						0.0				0.0						1.0				0.000000				0.20		0.20		0.0			0.0			2		1					1		1		1			1			False	False
1	0.0				0.0						0.0				0.0						2.0				64.000000				0.00		0.10		0.0			0.0			2		2					2		1		2			1			False	False
2	0.0				-1.0					0.0				-1.0					1.0				-1.000000				0.20		0.20		0.0			0.0			2		4					1		9		3			1			False	False
3	0.0				0.0						0.0				0.0						2.0				2.666667				0.05		0.14		0.0			0.0			2		3					2		2		4			1			False	False
4	0.0				0.0						0.0				0.0						10.0			627.50

Step #3 Exploring the Data

Next, we will familiarize ourselves with the data.

3.1 Class Labels

First, we take a look at the class labels to see how balanced they are. If class labels are balanced, it means that each class has an approximately equal number of examples in the training data. This is important because it helps ensure that the trained model will be able to make accurate predictions on new data. If the class labels are unbalanced, then the model is more likely to be biased towards the more common classes, which can lead to poor performance on less common classes. Additionally, unbalanced class labels can make it more difficult to evaluate the performance of a machine learning model, because the model’s accuracy may not be an accurate reflection of its ability to generalize to new data.

# Checking the balance of prediction labels
plt.figure(figsize=(16,2))
fig = sns.countplot(y="Revenue", data=df_shop, palette="muted")
plt.show()

Our class labels are somewhat imbalanced, as there are much more cases in the data with a prediction “false.” The reason is that more visitors won’t buy anything. Imbalanced data can affect the performance of classification models. But now that we are aware of the imbalance in our data, we can choose appropriate evaluation metrics later.

3.2 Feature Correlation

When developing classification models, not all features are usually equally useful. It is important that features are not correlated because correlated features can provide redundant information to a machine learning model. If two or more features are highly correlated, they may convey the same information to the model, which can make the model’s predictions less accurate. Additionally, having correlated features can make it more difficult to interpret the model’s predictions, because it is not clear which features are actually contributing to the model’s decision-making process.

Let’s check which of our features are correlated. First, we will create a series of Whiskerplots for the features in our dataset. They help us identify potential outliers and get a better idea of how the data looks.

# Whiskerplots
c= 'black'
df_shop.drop('Revenue', axis=1).plot(kind='box', 
                                subplots=True, layout=(4,4), 
                                sharex=False, sharey=False, 
                                figsize=(14,14), 
                                title='Whister plot for input variables')
plt.show()

Feature Whiskerplots

The Whiskerplots show that there are a couple of outliers in the data. However, the outliers are not significant enough to worry about them.

Histograms are another way of visualizing the distribution of numerical or categorical variables. They give a rough sense of the density of the distribution. To create the histograms, run the code below.

# # Create pariplots for feature columns separated by prediction label value
df_plot = df_shop.copy()

# class_columnname = 'Revenue'
sns.pairplot(df_plot, hue="Revenue", height=2.5)

Shopper-Buying-Intention pair plots with seaborn

" data-image-caption="

Shopper-Buying-Intention pair plots with seaborn

" data-large-file="https://www.relataly.com/wp-content/uploads/2022/04/Shopper-Buying-Intention.png" src="https://www.relataly.com/wp-content/uploads/2022/04/Shopper-Buying-Intention-1024x994.png" alt="Purchase Intention Prediction, Feature Permutation Importance, Feature Correlation plot" class="wp-image-6829" width="1117" height="1085" srcset="https://www.relataly.com/wp-content/uploads/2022/04/Shopper-Buying-Intention.png 1024w, https://www.relataly.com/wp-content/uploads/2022/04/Shopper-Buying-Intention.png 300w, https://www.relataly.com/wp-content/uploads/2022/04/Shopper-Buying-Intention.png 768w, https://www.relataly.com/wp-content/uploads/2022/04/Shopper-Buying-Intention.png 1536w, https://www.relataly.com/wp-content/uploads/2022/04/Shopper-Buying-Intention.png 2048w, https://www.relataly.com/wp-content/uploads/2022/04/Shopper-Buying-Intention.png 2475w" sizes="(max-width: 1117px) 100vw, 1117px" />

Finally, we create a correlation matrix and visualize it as a heat map. The matrix provides a quick overview of which features are correlated and not.

# Feature correlation
plt.figure(figsize=(15,4))
f_cor = df_shop.corr()
sns.heatmap(f_cor, cmap="Blues_r")

The correlation plot shows that some features are highly correlated. The following features are highly correlated:

ProductRelated and ProductRelated_Duration.
BounceRates and ExitRates

plt.figure(figsize=(8,5))
sns.scatterplot(x= 'BounceRates',y='ExitRates',data=df_shop,hue='Revenue')
plt.title('Bounce Rate vs. Exit Rate', fontweight='bold', fontsize=15)
plt.show()

plt.figure(figsize=(8,5))
sns.scatterplot(x= 'ProductRelated',y='ProductRelated_Duration',data=df_shop,hue='Revenue')
plt.title('Bounce Rate vs. Exit Rate', fontweight='bold', fontsize=15)
plt.show()

When we start to train our model, we will only use one of the features from the two pairs.

Step #4 Data Preprocessing

Now that we are familiar with the data, we can prepare the data to train the purchase intention classification model. Firstly, we will include only selecting the features from the original shopping dataset. Second, we will split the data into two separate datasets: train and test with a ratio of 70%. Train X_train and X_test datasets contain the features, while y_train and y_test include the respective prediction labels. Thirdly, we will use the MinMaxScaler to scale the numeric features between 0 and 1. Scaling makes it easier for the algorithm to interpret the data and improve classification performance.

# Separate labels from training data
features = ['Administrative', 'Administrative_Duration', 'Informational', 
            'Informational_Duration', 'ProductRelated', 'BounceRates', 'PageValues', 
            'Month', 'Region', 'TrafficType', 'VisitorType']
X = df_shop[features] #Training data
y = df_shop['Revenue'] #Prediction label

# Split the data into x_train and y_train data sets
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, random_state=0)

# Scale the numeric values
scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

Step #5 Train a Purchase Intention Classifier

Next, it is time to train our prediction model. Various classification algorithms could be used to solve this problem, for example, decision trees, random forests, neural networks, or support-vector machines. We will use the logistic regression algorithm, a common choice for simple two-class prediction problems.

We start the training process using the “fit” method of the logistic regression algorithm.

# Training a classification model using logistic regression 
logreg = LogisticRegression(solver='lbfgs')
score = logreg.fit(X_train, y_train).decision_function(X_test)

The trained model returns a training score showing how well the model has performed on the test dataset.

Step #6 Evaluate Model Performance

Finally, we will evaluate the performance of our classification model. For this purpose, we first create a confusion matrix. Then we calculate and compare different error metrics.

6.1 Confusion Matrix

The confusion matrix is a holistic and clean way to illustrate the results of a classification model. It differentiates between predicted labels and actual labels. For a binary classification model, the matrix comprises 2×2 quadrants that show the number of cases in each quadrant.

# create a confusion matrix
y_pred = logreg.predict(X_test)
cnf_matrix = confusion_matrix(y_test, y_pred)

# create heatmap
%matplotlib inline
class_names=[False, True] # name  of classes
fig, ax = plt.subplots(figsize=(7, 6))
tick_marks = np.arange(len(class_names))
plt.xticks(tick_marks, class_names)
plt.yticks(tick_marks, class_names)
sns.heatmap(pd.DataFrame(cnf_matrix), annot=True, cmap="YlGnBu", fmt='g')
ax.xaxis.set_label_position("top")
plt.tight_layout()
plt.title('Confusion matrix')
plt.ylabel('Actual label')
plt.xlabel('Predicted label')

In the upper left (0,0), we see that the model correctly predicted for 3102 online shopping sessions that these sessions will not lead to a purchase (True negatives). In 30 cases, the model was wrong and expected that there would be a purchase, but there wasn’t (False positives). For 412 buyers, the model predicted that they would not buy anything, even though they were buying something (False negatives). In the lower right corner, we see that only in 151 cases could buyers be correctly identified as such (True positives).

6.2 Performance Metrics for Classification Models

Next, let’s take a brief look at the performance metrics. Four standard metrics that measure the performance of classification models are Accuracy, Precision, Recall, and f1_score.

print('Accuracy: {:.2f}'.format(accuracy_score(y_test, y_pred)))
print('Precision: {:.2f}'.format(precision_score(y_test, y_pred)))
print('Recall: {:.2f}'.format(recall_score(y_test, y_pred)))
print('f1_score: {:.2f}'.format(f1_score(y_test, y_pred)))

Accuracy

The accuracy of the test set shows that 88% of the online shopper sessions were correctly classified. However, our data is imbalanced. That is to say, most labels have the value “False,” and only a few target labels are “True.” Consequently, we must ensure that our model does not classify all online shoppers as “non-buyers” (label: False) but also correctly predicts the buyers (label: True).

Precision

We calculate the precision as the number of True Positives divided by the number of True Positives and False Positives. Similar to Accuracy, Precision puts too much emphasis on the True negatives. Therefore, it does not say much about our model. The precision score for our model is just a little lower than the accuracy (83%).

Recall

We calculate the Recall by dividing the number of True Positives by the sum of the True Positives and the False Negatives. The Recall of our model is 27%, which is significantly below accuracy and precision. In our case, the precision call is more meaningful than precision and Recall because it puts a higher penalty on the low number of True positives.

F1-Score

The formula for the F1-Score is 2*((precision*recall)/(precision+recall)). Because the formula includes the Recall, the F-1 Score of our model is only 41%. Imagine we want to optimize our classification model further. In this case, we should look out for both F1-Score and Recall.

6.3 Interpretation

Metrics for classification models can be misleading. We should thus choose them carefully. Depending on which use case we are dealing with, False-negative and False-positive predictions can have different costs. Therefore, model evaluation is not always about exactness (precision and accuracy). Instead, the choice of performance metrics depends on what we want to achieve.

The challenge for our model is to correctly classify the smaller group of buyers (True positives). So, optimizing our model would be about achieving a balance between good accuracy without significantly lowering the F1_Score and Recall.

Step #7 Insights on Customer Purchase Intentions

Finally, we will use permutation feature importance to gain additional insights into our prediction model’s features. Permutation Feature Importance is a technique that measures the influence of features on the predictions of our model. Features with a high positive or negative score substantially impact predicting the prediction label. In contrast, features with scores close to zero play a lesser role in the predictions.

# Load the data
r = permutation_importance(model_lgr, X_test, y_test, n_repeats=30, random_state=0)

# Plot the barchart
data_im = pd.DataFrame(r.importances_mean, columns=['feature_permuation_score'])
data_im['feature_names'] = X.columns
data_im = data_im.sort_values('feature_permuation_score', ascending=False)

fig, ax = plt.subplots(figsize=(16, 5))
sns.barplot(y=data_im['feature_names'], x="feature_permuation_score", data=data_im, palette='nipy_spectral')
ax.set_title("Logistic Regression Feature Importances")

We can see that the three features with the highest impact are PageValues, BounceRates and Administration_Duration.

The higher the page’s value, the higher the customer’s chance to make a purchase.
The higher the average bounce rate that the customer visits, the higher the chance the customer makes a purchase.
In contrast, the more time a customer spends on administrative settings, the lower the chance the customer completes the purchase.

These were just a few sample findings. There is much more to explore in the data, and deeper analysis can uncover much more about the customers’ buying decisions.

Summary

This article has presented customer purchase prediction as an interesting use case for machine learning in e-commerce. After discussing the use case, we have developed a classification model that predicts the purchase intentions of online shoppers. You have learned to preprocess the data, train a logistic regression model and evaluate the model’s performance. Classifying purchase intentions can help online shops understand their customers better and automate certain online marketing activities. The previous section showed how marketers could use this to gain further insights into their customers’ behavior.

Thanks for reading and if you have any questions, let me know in the comments.

Sources and Further Reading

I hope this article was helpful. If you have any remarks or questions, please write them in the comments.

The links above to Amazon are affiliate links. By buying through these links, you support the Relataly.com blog and help to cover the hosting costs. Using the links does not affect the price.

The post Classifying Purchase Intention of Online Shoppers with Python appeared first on relataly.com.