Classification (two-class) Archives - relataly.com

Using Random Search to Tune the Hyperparameters of a Random Decision Forest with Python

Florian Follonier — Thu, 07 Apr 2022 17:55:36 +0000

Perfecting your machine learning model’s hyperparameters can often feel like hunting for a proverbial needle in a haystack. But with the Random Search algorithm, this intricate process of hyperparameter tuning can be efficiently automated, saving you valuable time and effort. Hyperparameters are properties intrinsic to your model, like the number of estimators in an ensemble model, and heavily influence its performance. Unlike model parameters, which are discovered during training by the machine learning algorithm, hyperparameters require pre-specification.

In this comprehensive Python tutorial, we’ll guide you on how to harness the power of Random Search to optimize a regression model’s hyperparameters. Our illustrative example utilizes a Support Vector Machine (SVM) for predicting house prices. However, the fundamental principles you’ll learn can be seamlessly applied to any model. So why painstakingly fine-tune hyperparameters manually when Random Search can handle the task efficiently?

Here’s a preview of what this Python tutorial entails:

A brief overview of how Random Search operates and instances where it might be preferable to Grid Search.
A hands-on Python tutorial featuring a public house price dataset from Kaggle.com. The aim here is to train a regression model capable of predicting US house prices based on various properties.
Training a ‘best-guess’ model in Python, followed by using Random Search to discover a model with enhanced performance.
Finally, we’ll implement cross-validation to validate our models’ performance.

By the end of this tutorial, you’ll be well-equipped to let Random Search efficiently fine-tune your model’s hyperparameters, freeing up your time for other crucial tasks.

Hyperparameter Tuning

Hyperparameters are configuration options that allow us to customize machine learning models and improve their performance. While normal parameters are the internal coefficients that the model learns during training, we need to specify hyperparameters before the training. It is usually impossible to find the best configuration without testing different configurations.

Searching for a suitable model configuration is called “hyperparameter tuning” or “hyperparameter optimization.” Machine learning algorithms have varying hyperparameters and parameter values. For example, a random decision forest classifier allows us to configure varying parameters such as the number of trees, the maximum tree depth, and the minimum number of nodes required for a new branch.

The hyperparameters and the range of possible parameter values span a search space in which we seek to identify the best configuration. The larger the search space, the more difficult it gets to find an optimal model. We can use random search to automatize this process.

Random search can be an efficient way to tune the hyperparameters of a machine learning model. Image generated with Midjourney

Techniques for Tuning Hyperparameters

Hyperparameter tuning is the process of adjusting the hyperparameters of a machine learning algorithm to optimize its performance on a specific dataset or task. Several techniques can be used for hyperparameter tuning, including:

Grid Search: grid search is a brute-force search algorithm that systematically evaluates a given set of hyperparameter values by training and evaluating a model for each combination of values. It is a simple and effective technique, but it can be computationally expensive, especially for large or complex datasets.
Random Search: As mentioned, random search is an alternative to grid search that randomly samples a given set of hyperparameter values rather than evaluating all possible combinations. It can be more efficient than grid search, but it may not find the optimal set of hyperparameters.
Bayesian Optimization: A bayesian optimization is a probabilistic approach to hyperparameter tuning, which uses Bayesian inference to model the distribution of hyperparameter values that are likely to produce a good performance. It can be more efficient and effective than grid search or random search, but it can be more challenging to implement and interpret.
Genetic Algorithms: genetic algorithms are optimization algorithms inspired by the principles of natural selection and genetics. They use a population of candidate solutions, which are iteratively evolved and selected based on their fitness or performance, to find the optimal set of hyperparameters.

In this article, we specifically look at the Random Search technique.

You can spend much time tuning a machine learning model. Image generated with Midjourney.

What is Random Search?

The random search algorithm generates models from hyperparameter permutations randomly selected from a grid of parameter values. The idea behind the randomized approach is that testing random configurations efficiently identifies a good model. We can use random search both for regression and classification models.

Random Search and Grid Search are the most popular techniques for hyperparametric tuning, and both methods are often compared. Unlike random search, grid search covers the search space exhaustively by trying all possible variants. The technique works well for testing a small number of configurations already known to work well.

As long as both search space and training time are small, the grid search technique is excellent for finding the best model. However, the number of model variants increases exponentially with the size of the search space. It is often more efficient for large search spaces or complex models to use random search.

Since random search does not exhaustively cover the search space, it does not necessarily yield the best model. However, it is also much faster than grid search and efficient in delivering a suitable model in a short time.

Random Search vs. Exhaustive Grid Search

Tuning the Hyperparameters of a Random Decision Forest Regressor in Python using Random Search

In this tutorial, we delve into the use of the Random Search algorithm in Python, specifically for predicting house prices. We’ll be using a dataset rich in diverse house characteristics. Various elements, such as data quality and quantity, model intricacy, the selection of machine learning algorithms, and housing market stability, significantly influence the accuracy of house price predictions.

Our initial model employs a Random Decision Forest algorithm, which we’ll optimize using a random search approach for hyperparameters tuning. By identifying and implementing a more advantageous configuration, we aim to enhance our model’s performance significantly.

Here’s a concise outline of the steps we’ll undertake:

Loading the house price dataset
Exploring the dataset intricacies
Preparing the data for modeling
Training a baseline Random Decision Forest model
Implementing a random search approach for model optimization
Measuring and evaluating the performance of our optimized model

Through this step-by-step guide, you’ll learn to enhance model performance, further refining your understanding of Random Search algorithm implementation in Python.

The Python code is available in the relataly GitHub repository.

View on GitHub Relataly Github Repo

Once we have trained a house price prediction model, we can use it to asses the price of new houses. Image generated with Midjourney.

Prerequisites

Before starting the coding part, ensure that you have set up your Python (3.8 or higher) environment and required packages. If you don’t have an environment, follow this tutorial to set up the Anaconda environment.

Also, make sure you install all required packages. In this tutorial, we will be working with the following standard packages:

In addition, we will be using the Python Machine Learning library Scikit-learn to implement the random forest and the grid search technique.

You can install packages using console commands:

pip install
conda install (if you are using the anaconda packet manager)

House Price Prediction: About the Use Case and the Data

House price prediction is the process of using statistical and machine learning techniques to predict the future value of a house. This can be useful for a variety of applications, such as helping homeowners and real estate professionals to make informed decisions about buying and selling properties. In order to make accurate predictions, it is important to have access to high-quality data about the housing market.

In this tutorial, we will work with a house price dataset from the house price regression challenge on Kaggle.com. The dataset is available via a git hub repository. It contains information about 4800 houses sold between 2016 and 2020 in the US. The data includes the sale price and a list of 48 house characteristics, such as:

Year – The year of construction,
SaleYear – The year in which the house was sold
Lot Area – The lot area of the house
Quality – The overall quality of the house from one (lowest) to ten (highest)
Road – The type of road, e.g., paved, etc.
Utility – The type of the utility
Park Lot Area – The parking space included with the property
Room number – The number of rooms

Predicting house prices with machine learning. Image generated with Midjourney.

Step #1 Load the Data

We begin by loading the house price data from the relataly GitHub repository. A separate download is not required.

# A tutorial for this file is available at www.relataly.com

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import RandomizedSearchCV, train_test_split
from sklearn.metrics import mean_absolute_error, mean_absolute_percentage_error
from sklearn import svm

# Source: 
# https://www.kaggle.com/c/house-prices-advanced-regression-techniques

# Load train and test datasets
path = "https://raw.githubusercontent.com/flo7up/relataly_data/main/house_prices/train.csv"
df = pd.read_csv(path)
print(df.columns)
df.head()

Index(['Id', 'MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street',
       'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig',
       'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType',
       'HouseStyle', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd',
       'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType',
       'MasVnrArea', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual',
       'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1',
       'BsmtFinType2', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating',
       'HeatingQC', 'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF',
       'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath',
       'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual',
       'TotRmsAbvGrd', 'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType',
       'GarageYrBlt', 'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual',
       'GarageCond', 'PavedDrive', 'WoodDeckSF', 'OpenPorchSF',
       'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'PoolQC',
       'Fence', 'MiscFeature', 'MiscVal', 'MoSold', 'YrSold', 'SaleType',
       'SaleCondition', 'SalePrice'],
      dtype='object')
	Id	MSSubClass	MSZoning	LotFrontage	LotArea	Street	Alley	LotShape	LandContour	Utilities	...	PoolArea	PoolQC	Fence	MiscFeature	MiscVal	MoSold	YrSold	SaleType	SaleCondition	SalePrice
0	1	60			RL			65.0		8450	Pave	NaN		Reg			Lvl			AllPub		...	0			NaN		NaN		NaN			0		2		2008	WD			Normal			208500
1	2	20			RL			80.0		9600	Pave	NaN		Reg			Lvl			AllPub		...	0			NaN		NaN		NaN			0		5		2007	WD			Normal			181500
2	3	60			RL			68.0		11250	Pave	NaN		IR1			Lvl			AllPub		...	0			NaN		NaN		NaN			0		9		2008	WD			Normal			223500
3	4	70			RL			60.0		9550	Pave	NaN		IR1			Lvl			AllPub		...	0			NaN		NaN		NaN			0		2		2006	WD			Abnorml			140000
4	5	60			RL			84.0		14260	Pave	NaN		IR1			Lvl			AllPub		...	0			NaN		NaN		NaN			0		12		2008	WD			Normal			250000
5 rows × 81 columns

Step #2 Explore the Data

Before jumping into preprocessing and model training, let’s quickly explore the data. A distribution plot can help us understand our dataset’s frequency of regression values.

# Create histograms for feature columns separated by prediction label value
ax = sns.displot(data=df[['SalePrice']].dropna(), height=6, aspect=2)
plt.title('Sale Price Distribution')

For feature selection, it is helpful to understand the predictive power of the different variables in a dataset. We can use scatterplots to estimate the predictive power of specific features. Running the code below will create a scatterplot that visualizes the relation between the sale price, lot area, and the house’s overall quality.

# Create histograms for feature columns separated by prediction label value
plt.figure(figsize=(16,6))
df_features = df[['SalePrice', 'LotArea', 'OverallQual']]
sns.scatterplot(data=df_features, x='LotArea', y='SalePrice', hue='OverallQual')
plt.title('Sale Price Distribution')

As expected, the scatterplot shows that the sale price increases with the overall quality. On the other hand, the LotArea has only a minor effect on the sale price.

Step #3 Data Preprocessing

Next, we prepare the data for use as input to train a regression model. Because we want to keep things simple, we reduce the number of variables and use only a small set of features. In addition, we encode categorical variables with integer dummy values.

To ensure that our regression model does not know the target variable, we separate house price (y) from features (x). Last, we split the data into separate datasets for training and testing. The result is four different data sets: x_train, y_train, x_test, and y_test.

def preprocessFeatures(df):   
    # Define a list of relevant features
    feature_list = ['SalePrice', 'OverallQual', 'Utilities', 'GarageArea', 'LotArea', 'OverallCond']
    df_dummy = pd.get_dummies(df[feature_list])
    # Cleanse records with na values
    #df_prep = df_prep.dropna()
    return df_dummy

df_base = preprocessFeatures(df)

# Split the data into x_train and y_train data sets
x_train, x_test, y_train, y_test = train_test_split( df_base.copy(), df_base['SalePrice'].copy(), train_size=0.7, random_state=0)
x_train

		OverallQual	GarageArea	LotArea	OverallCond	Utilities_AllPub	Utilities_NoSeWa
682		6			431			2887	5			1					0
960		5			0			7207	7			1					0
1384	6			280			9060	5			1					0
1100	2			246			8400	5			1					0
416		6			440			7844	7			1					0

Step #4 Train Different Regression Models using Random Search

Now that the dataset is ready, we can train the random decision forest regressor. To do this, we first define a dictionary with different parameter ranges. In addition, we need to define the number of model variants (n) that the algorithm should try. The random search algorithm then selects n random permutations from the grid and uses them to train the model.

We use the RandomSearchCV algorithm from the scikit-learn package. The “CV” in the function name stands for cross-validation. Cross-validation involves splitting the data into subsets (folds) and rotating them between training and validation runs. This way, each model is trained and tested multiple times on different data partitions. When the search algorithm finally evaluates the model configuration, it summarizes these results into a test score.

We use a Random Decision Forest – a robust machine learning algorithm that can handle classification and regression tasks. As a so-called ensemble model, the Random Forest considers predictions from a set of multiple independent estimators. The estimator is an important parameter to pass to the RandomSearchCV function. Random decision forests have several hyperparameters that we can use to influence their behavior. We define the following parameter ranges:

max_leaf_nodes = [2, 3, 4, 5, 6, 7]
min_samples_split = [5, 10, 20, 50]
max_depth = [5,10,15,20]
max_features = [3,4,5]
n_estimators = [50, 100, 200]

These parameter ranges define the search space from which the randomized search algorithm (RandomSearchCV) will select random configurations. Other parameters will use default values as defined by scikit-learn.

# Define the Estimator and the Parameter Ranges
dt = RandomForestRegressor()
number_of_iterations = 20
max_leaf_nodes = [2, 3, 4, 5, 6, 7]
min_samples_split = [5, 10, 20, 50]
max_depth = [5,10,15,20]
max_features = [3,4,5]
n_estimators = [50, 100, 200]

# Define the param distribution dictionary
param_distributions = dict(max_leaf_nodes=max_leaf_nodes, 
                           min_samples_split=min_samples_split, 
                           max_depth=max_depth,
                           max_features=max_features,
                           n_estimators=n_estimators)

# Build the gridsearch
grid = RandomizedSearchCV(estimator=dt, 
                          param_distributions=param_distributions, 
                          n_iter=number_of_iterations, 
                          cv = 5)

grid_results = grid.fit(x_train, y_train)

# Summarize the results in a readable format
print("Best params: {0}, using {1}".format(grid_results.cv_results_['mean_test_score'], grid_results.best_params_))
results_df = pd.DataFrame(grid_results.cv_results_)

Best params: [0.68738293 0.49581669 0.52138751 0.61235299 0.65360944 0.61165147
 0.70392285 0.52278886 0.67687248 0.68219638 0.70031536 0.65842909
 0.51939338 0.70801017 0.70911805 0.69543885 0.67983801 0.60744371
 0.68270285 0.70741042], using {'n_estimators': 100, 'min_samples_split': 5, 'max_leaf_nodes': 7, 'max_features': 3, 'max_depth': 15}
	mean_fit_time	std_fit_time	mean_score_time	std_score_time	param_n_estimators	param_min_samples_split	param_max_leaf_nodes	param_max_features	param_max_depth	params	split0_test_score	split1_test_score	split2_test_score	split3_test_score	split4_test_score	mean_test_score	std_test_score	rank_test_score
0	0.049196		0.002071		0.004074		0.000820		50					20						5	4	15	{'n_estimators': 50, 'min_samples_split': 20, ...	0.662973	0.705533	0.669520	0.702608	0.696280	0.687383	0.017637	7
1	0.041115		0.000554		0.003046		0.000094		50					50						2	3	10	{'n_estimators': 50, 'min_samples_split': 50, ...	0.490984	0.527231	0.426270	0.523086	0.511513	0.495817	0.036978	20
2	0.043325		0.000779		0.003486		0.000447		50					50						2	5	20	{'n_estimators': 50, 'min_samples_split': 50, ...	0.484524	0.559358	0.485459	0.517253	0.560343	0.521388	0.033545	18
3	0.162083		0.005665		0.012420		0.004788		200					5						3	3	20	{'n_estimators': 200, 'min_samples_split': 5, ...	0.586586	0.638341	0.573437	0.626793	0.636608	0.612353	0.027021	14
4	0.166659		0.003026		0.010958		0.000084		200					10						4	3	15	{'n_estimators': 200, 'min_samples_split': 10,...	0.633305	0.679161	0.623236	0.661864	0.670481	0.653609	0.021636	13

These are the five best models and their respective hyperparameter configurations.

Step #5 Select the best Model and Measure Performance

Finally, we will choose the best model from the list using the “best_model” function. We then calculate the MAE and the MAPE to understand how the model performs on the overall test dataset. We then print a comparison between actual sale prices and predicted sale prices.

# Select the best Model and Measure Performance
best_model = grid_results.best_estimator_
y_pred = best_model.predict(x_test)
y_df = pd.DataFrame(y_test)
y_df['PredictedPrice']=y_pred
y_df.head()

	SalePrice	PredictedPrice
529	200624		166037.831002
491	133000		135860.757958
459	110000		123030.336177
279	192000		206488.444327
655	88000		130453.604206

Next, let’s take a look at the classification errors.

# Mean Absolute Error (MAE)
MAE = mean_absolute_error(y_pred, y_test)
print('Mean Absolute Error (MAE): ' + str(np.round(MAE, 2)))

# Mean Absolute Percentage Error (MAPE)
MAPE = mean_absolute_percentage_error(y_pred, y_test)
print('Median Absolute Percentage Error (MAPE): ' + str(np.round(MAPE*100, 2)) + ' %')

Mean Absolute Error (MAE): 29591.56 
Median Absolute Percentage Error (MAPE): 15.57 %

On average, the model deviates from the actual value by 16 %. Considering we only used a fraction of the available features and defined a small search space, there is much room for improvement.

Summary

This article has shown how we can use grid Search in Python to efficiently search for the optimal hyperparameter configuration of a machine learning model. In the conceptual part, you learned about hyperparameters and how to use random search to try out all permutations of a predefined parameter grid. The second part was a Python hands-on tutorial, in which you learned to use random search to tune the hyperparameters of a regression model. We worked with a house price dataset and trained a random decision forest regressor that predicts the sale price for houses depending on several characteristics. Then we defined parameter ranges and tested random permutations. In this way, we quickly identified a configuration that outperforms our initial baseline model.

Remember that a random search efficiently identifies a good-performing model but does not necessarily return the best-performing one. Tech random search techniques can be used to tune the hyperparameters of both regression and classification models.

Sources and Further Reading

I hope this article was helpful. If you have any questions or suggestions, please write them in the comments.

The links above to Amazon are affiliate links. By buying through these links, you support the Relataly.com blog and help to cover the hosting costs. Using the links does not affect the price.

The post Using Random Search to Tune the Hyperparameters of a Random Decision Forest with Python appeared first on relataly.com.

How to Measure the Performance of a Machine Learning Classifier with Python and Scikit-Learn?

Florian Follonier — Fri, 31 Dec 2021 17:37:00 +0000

Have you ever received a spam email and wondered how your email provider was able to identify it as spam? Well, the answer is likely machine learning! One common type of machine learning problem is called classification. The goal is to predict the correct class labels for a given set of observations. For example, we could train a classifier to identify whether an email is spam or not or to classify images of animals into different species. But before we can use a classifier in a real-world setting, we need to evaluate its performance to understand how well it can correctly classify observations. There are several tools and techniques we can use to do this, including the confusion matrix, error metrics, and the ROC curve. In this article, we’ll dive into these evaluation methods and see how they can help us understand the capabilities of our classifier.

This tutorial is divided into two parts: a conceptual introduction to evaluating classification performance and a hands-on example using Python and Scikit-Learn. In the first part, we will discuss some of the common error metrics that are used to evaluate the performance of a classifier. This includes the confusion matrix, error metrics, and the ROC curve. The second part of the tutorial is hands-on. We use Python and Scikit-Learn to build a breast cancer detection model classifying tissue samples as benign or malignant. We then apply various techniques to evaluate the model’s performance.

Models can be wrong, but we should know how often they are. Image created with Midjourney.

Why even bother Measuring Classification Performance?

Measuring classification performance in machine learning is important because it allows us to evaluate how well a model is able to predict the class of a given input accurately. This is important because the ultimate goal of many machine learning models is to make accurate predictions in real-world applications.

There are several reasons why it is important to measure classification performance. First, by measuring performance, we can determine whether a model is able to make accurate predictions. If a model cannot make accurate predictions, it may not be useful for the task it was designed for. Second, by measuring performance, we can compare the performance of different models and choose the best one for a given task. This can be especially important when working with large, complex datasets where multiple models may be applicable.

In order to measure classification performance, we need to use a performance metric appropriate for the task at hand. Next’s let’s understand what this means.

Example confusion matrix of a two-class classifier

Techniques for Measuring Classification Performance

This first part of the tutorial presents essential techniques for measuring the performance of classification models, including confusion matrix, error metrics, and roc curves. But why are there so many different techniques? Isn’t it enough to calculate the rate between correct and false classifications?

The answer depends on the balance of the class labels and their importance. Let’s compare a simple two-class case vs. a more complex one. In the most simple case, the following applies:

The class labels in the sample are perfectly balanced (for example, 50 positives and 50 negatives).
Both class labels are equally important, so it does not matter if the model is better at predicting class one or two.

In this case, we can measure the model performance as the rate between correctly predicted labels and those that a model falsely predicted. It is as simple as that. However, most classification problems are more complex:

The class labels are imbalanced, so the model encounters one class more often than the other.
One class is more important than the other. For example, consider a binary classification problem that aims to identify the few positive cases from a sample with many negative ones. Especially in disease detection, it is crucial that the model correctly identifies the few positive cases, even if some of the observations classified as positive are negative.

Confusion matrix and error techniques help us objectively evaluate such models built for more complex problems.

The Confusion Matrix

A confusion matrix is an essential tool for evaluating a classification model. The confusion matrix is a table with four combinations of predicted and actual values for a problem where the output may include two classes (negative and positive). As a result, each prediction falls into one of the following four squares:

True Positives (TP): the outcome from a prediction is “positive,” and the actual value is also “positive.”
False Positives (FP): The model predicted a positive value, but this prediction is false.
True Negatives (TN): Predicted was a negative value, which is correct.
False Negatives (FN): The model predicted a negative value while the actual class was positive.

We can assign each classification to a cell in the matrix. The diagonal contains the correctly classified cases whose actual class matches the predicted class. All other cells outside the diagonal represent possible errors. Using the confusion matrix, you can see at a glance how well the model works and what errors it makes.

The confusion matrix is the basis for calculating various error metrics, which we will look at in more detail in the following section.

Confusion matrix

Metrics for Measuring Classification Errors

To objectively measure the performance of a classifier, we can count up the cases in the different squares and use this information to calculate essential error metrics, including accuracy, precision, recall, f-1 score, and specificity.

Precision

Precision is a metric for the rate of missed positive values. Mathematically, it is the sum of true positives divided by the sum of False Positives and True Positives.

In other words, it measures the ability of a classification model to identify the relevant data points without misclassifying too many irrelevant cases.

\[Precision = {TP \over FP + TP}\]

Accuracy

Accuracy tells us the rate of the positive values that were classified correctly. It is calculated as the sum of all correct classifications divided by the number of false positives.

The usefulness of Accuracy ends when the class labels are imbalanced so that one class is underrepresented. The Accuracy can be misleading as it can become nearly 100% even if the classification model has not identified any of the data points in the underrepresented class. If your data is imbalanced, you should combine accuracy with the Recall.

\[Accuracy= {TP + TN \over TP + FN + FP + TN}\]

\[= {Correct Classifications \over Total Sample Size}\]

F1-Score

The F1-Score combines Precision and Recall into a single metric. It is calculated as the harmonic mean of Precision and Recall.

The F1-Score is a single overall metric based on precision and recall. We can use this metric to compare the performance of two classifiers with different recall and precision.

\[F1Score = {TP + TN \over FN}\]

\[= {2 * Precision * Recall\over Precision + Recall}\]

Recall (Sensitivity)

Recall, sometimes called “Sensitivity,” measures the percentage of correctly classified positives among the entire sum of actual positives. We calculate it as the number of True Positives divided by the False Negatives and True Positives.

The Recall is particularly helpful if we deal with an imbalanced dataset, for example, when the goal is to identify a few critical cases among a large sample.

\[Recall= {TP \over FN + TP}\]

Specificity

We calculate the number of negative samples. It is also called the True-Negative Rate and plays a vital role in the ROC Curve, which we will look at in more detail in the following section.

\[Specificity= {TP \over FN + TP}\]

None of the five metrics is sufficient to measure the model performance. We, therefore, use different metrics in combination. Note the following rules:

If the classes in the dataset are balanced, measure performance using Accuracy.
If the dataset is imbalanced or one class is more important than the other, look at Recall and Precision.
For classification problems where you want to compare different models with similar recall and precision, use the F1Score.

Decision Boundary

A classifier determines class labels by calculating the probabilities of samples falling into a particular category. Since the probabilities are continuous values between 0.0 and 1.0, we use a decision boundary to convert them to class labels. The default threshold for a binary classifier is 0.5. Samples with probabilities above 0.5 are assigned to the first class, and samples below 0.5 to the second class.

In practice, we often encounter classification problems, where the cost of an error varies between class labels. In such cases, we can alter the decision boundary to give one of the classes a higher priority. Consider the case of credit card fraud detection. In this case, it is critical for service providers to reliably detect the few fraud cases among the many legitimate credit card transactions. We can alter the decision threshold to increase the probability that the model detects fraud (high True Positive rate). The cost of detecting more fraud is a higher number of transactions that the model misclassifies as fraud. However, in this particular example, this is acceptable because the service provider can quickly resolve misunderstandings with the customer.

Comparison of different decision boundaries (0.5 vs. 0.25 vs. 0.9) and illustration of the effects on the classification error and confusion matrix

The ROC Curve

The ROC curve is another helpful tool to measure classification performance and is particularly useful for comparing different classification models’ performance. ROC stands for “Receiver Operating Characteristic.” The ROC curve is a graphical plot that illustrates the diagnostic ability of a binary classifier system as its discrimination threshold is varied. The curve emerges when we plot the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings.

The more the ROC curve tends to the upper left corner, the better the performance of the classification model. A perfect classifier would show a point in the upper left corner or coordinate (0,1), which is the ideal point for a diagnostic test. This is because a point at (0,1) indicates that the classifier has a 100% true positive rate and a 0% false positive rate. A curve near the diagonal indicates that the True Positive Rate and False Positive Rate are equal, which corresponds to the expected prediction result of a random classifier with no predictive power. If the ROC curve remains significantly below the diagonal, this indicates a classifier with inverse prediction power.

The ROC for classification models is not necessarily a curve and often runs as a jumpy line with several plateaus. Plateaus range where changes to the threshold do not change the classification results. Curves with plateaus can signify tiny sample sizes, but they may also have other reasons.

Example of an ROC curve

Interpretation of the ROC curve

Measuring Classification Performance in Python (Two-Class)

In this tutorial, we will show how to implement various techniques for evaluating classification models using a breast cancer dataset and a simple logistic regression model in Python with Scikit-Learn. Abnormal changes in the breast may be a sign of cancer and need to be investigated. However, changes are not necessarily malignant and, in many cases, are benign. We will work with a breast cancer dataset and train a machine learning classifier to make this distinction (benign/malignant). We will use the model to predict the type of breast cancer based on various characteristics and explore how machine learning can be applied in the life sciences to support medical diagnostics. After training the model, we will use the Confusion Matrix, Error Metrics, and the ROC Curve to measure its performance.

View on GitHub Relataly Github Repo

About the Breast Cancer Dataset

The breast cancer dataset contains 569 samples, with 30 features derived from digitized images of tissue samples. The features in the dataset describe the characteristics of the cell nuclei present in the image, including color, size, and symmetry. In addition, the dataset includes a binary target variable that indicates whether the sample is benign or malignant. 212 Samples are malignant, and 357 are benign.

You can find more information on the dataset on the UCI.edu webpage. The breast cancer dataset is included in the scikit-learn package, so there is no need to download the data upfront.

Exemplary images of benign and malignant samples. Source: kurzweilai

Prerequisites

Before starting the coding part, make sure that you have set up your Python 3 environment and required packages. If you don’t have an environment, follow this tutorial to set up the Anaconda environment. Also, make sure you install all required packages. In this tutorial, we will be working with the following standard packages:

pandas
NumPy
math
matplotlib
scikit-learn

You can install packages using console commands:

pip install 
conda install  (if you are using the anaconda packet manager)

Step #1 Loading the Data

We begin by loading the cancer dataset from scikit-learn. Then we display a list of the features and plot the balance of our classification target, the two tissue types. “1” is type “benign,” and 0 corresponds to type “malignant.”

# A tutorial for this file is available at www.relataly.com

import numpy as np 
import pandas as pd 
import seaborn as sns
import matplotlib.pyplot as plt 
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, classification_report, roc_auc_score, plot_roc_curve
from sklearn.model_selection import cross_val_predict
from sklearn import datasets

df = datasets.load_breast_cancer(as_frame=True)

df_dia = df.data
df_dia['cancer_type'] = df.target

plt.figure(figsize=(16,2))
plt.title(f'labels')
fig = sns.countplot(y="cancer_type", data=df_dia)

df_dia.head()

The barplot shows more benign observations among the sample than malignant ones.

Step #2 Data Preparation and Model Training

Next, we will prepare the data and use it for training a random decision forest classifier. It is important to remember that the performance of a classifier is dependent on the specific data it is trained on. Therefore, it is crucial to evaluate the classifier using a separate, unseen test dataset to avoid overfitting and ensure that the classifier generalizes well to new data. The code below therefore splits the data into train and test datasets.

# Select a small number of features that we use as input to the classification model
features = ['carwidth', 'carlength']
df_base = df[features + ['Price_label']]

# Separate labels from training data
X = df_base[features] #Training data
y = df_base['Price_label'] #Prediction label

# Split the data into x_train and y_train data sets
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.5, random_state=0)

Now that we have prepared the data, it is time to train our classifier. We use a random forest algorithm from the Scikit-learn package. If you want to learn more about this topic, check out the relataly tutorials on random forests.

# Create the Random Forest Classifier
dfrst = RandomForestClassifier(n_estimators=3, max_depth=4, min_samples_split=6, class_weight='balanced')
ranfor = dfrst.fit(X_train, y_train)
y_pred = ranfor.predict(X_test)

After running the code, you have a trained classifier.

Step #3 Creating a Confusion Matrix

Next, we will create the confusion matrix and several standard error metrics. First, we create the matrix by running the code below. Remember that the matrix will contain only the tabular data without any visualization. To illustrate the results in a heatmap, we first need to plot the matrix. We will use the heatmap function from the seaborn package for this task.

# Create heatmap from the confusion matrix
def createConfMatrix(class_names, matrix):
    class_names=[0, 1] 
    tick_marks = [0.5, 1.5]
    fig, ax = plt.subplots(figsize=(7, 6))
    sns.heatmap(pd.DataFrame(matrix), annot=True, cmap="Blues", fmt='g')
    ax.xaxis.set_label_position("top")
    plt.title('Confusion matrix')
    plt.ylabel('Actual label'); plt.xlabel('Predicted label')
    plt.yticks(tick_marks, class_names); plt.xticks(tick_marks, class_names)
    
# Create a confusion matrix
cnf_matrix = confusion_matrix(y_test, y_pred)
createConfMatrix(matrix=cnf_matrix, class_names=[0, 1])

The confusion matrix shows the following: In 93 samples, the model correctly predicted a malignant label, and in 181 cases the model predicted that the tissue sample was benign. In 3 cases, the model failed to recognize a malignant sample, and in 8 cases the model raised a false alarm.

Next, we calculate the error metrics (accuracy, precision, recall, f1-score). You can do this by using the separate functions from the Scikit-learn package. Alternatively, you can also use the classification report, which contains all these error metrics.

# Calculate Standard Error Metrics
print('accuracy: {:.2f}'.format(accuracy_score(y_test, y_pred)))
print('precision: {:.2f}'.format(precision_score(y_test, y_pred)))
print('recall: {:.2f}'.format(recall_score(y_test, y_pred)))
print('f1_score: {:.2f}'.format(f1_score(y_test, y_pred)))

# Classification Report (Alternative)
results_log = classification_report(y_test, y_pred, output_dict=True)
results_df_log = pd.DataFrame(results_log).transpose()
print(results_df_log)

accuracy: 0.94 
precision: 0.97 
recall: 0.94
f1_score: 0.95

Step #4 ROC and AUC

Finally, let’s calculate the ROC and the Area under the Curve (AUC).

# Compute ROC curve
fig, ax = plt.subplots(figsize=(10, 6))
RocCurveDisplay.from_estimator(ranfor, X_test, y_test, ax=ax)
plt.title('ROC Curve for the Car Price Classifier')
plt.show()

The ROC tells us, that the model already performs quite well. However, we want to know it precisely. By running the code below, you can calculate the AUC.

# Calculate probability scores 
y_scores = cross_val_predict(ranfor, X_test, y_test, cv=3, method='predict_proba')
# Because of the structure of how the model returns the y_scores, we need to convert them into binary values
y_scores_binary = [1 if x[0] < 0.5 else 0 for x in y_scores]
# Now, we can calculate the area under the ROC curve
auc = roc_auc_score(y_test, y_scores_binary, average="macro")
auc # Be aware that due to the random nature of cross validation, the results will change when you run the code

0.9035191562634525

Summary

This tutorial has shown how to evaluate the performance of a two-label classification model. We started by introducing the concept of the confusion matrix and how it can be used to evaluate the performance of a classifier. We then discussed various error metrics, such as accuracy, precision, and recall, and how we can use them to gain a better understanding of the classifier’s performance. Next, we discussed the ROC curve and how it can be used to visualize the trade-offs between precision and recall for different thresholds of the classifier. We also discussed how we could use the ROC curve to compare the performance of different classifiers. In the second part, we have applied the different tools and techniques to the practical example of a breast cancer classifier. We used the confusion matrix and error metrics to evaluate the classifier and the ROC curve to compare its performance.

Overall, this tutorial has provided an overview of the tools and techniques that are commonly used to evaluate the performance of a classification model. By understanding and applying these tools and techniques, we can gain a better understanding of how well a classifier is performing and make informed decisions about whether it is ready for production.

I hope this article helped you understand how to measure the performance of classification models. If you have any questions or feedback, please let me know. And if you are looking for error metrics to measure regression performance, check out this tutorial on regression errors.

Sources and Further Reading

The links above to Amazon are affiliate links. By buying through these links, you support the Relataly.com blog and help to cover the hosting costs. Using the links does not affect the price.

The post How to Measure the Performance of a Machine Learning Classifier with Python and Scikit-Learn? appeared first on relataly.com.

Multivariate Anomaly Detection on Time-Series Data in Python: Using Isolation Forests to Detect Credit Card Fraud

Florian Follonier — Wed, 16 Jun 2021 09:33:00 +0000

Credit card fraud has become one of the most common use cases for anomaly detection systems. The number of fraud attempts has risen sharply, resulting in billions of dollars in losses. Early detection of fraud attempts with machine learning is therefore becoming increasingly important. In this article, we take on the fight against international credit card fraud and develop a multivariate anomaly detection model in Python that spots fraudulent payment transactions. The model will use the Isolation Forest algorithm, one of the most effective techniques for detecting outliers. Isolation Forests are so-called ensemble models. They have various hyperparameters with which we can optimize model performance. However, we will not do this manually but instead, use grid search for hyperparameter tuning. To assess the performance of our model, we will also compare it with other models.

The remainder of this article is structured as follows: We start with a brief introduction to anomaly detection and look at the Isolation Forest algorithm. Equipped with these theoretical foundations, we then turn to the practical part, in which we train and validate an isolation forest that detects credit card fraud. We use an unsupervised learning approach, where the model learns to distinguish regular from suspicious card transactions. We will train our model on a public dataset from Kaggle that contains credit card transactions. Finally, we will compare the performance of our model against two nearest neighbor algorithms (LOF and KNN).

Financial crime is a real problem, although its usually not commited by cats. Image created with Midjourney

" data-image-caption="

Financial crime is a real problem, although its usually not commited by cats. Image created with Midjourney

" data-large-file="https://www.relataly.com/wp-content/uploads/2023/03/financial-crime-serious-business-python-relataly-midjourney-min.png" src="https://www.relataly.com/wp-content/uploads/2023/03/financial-crime-serious-business-python-relataly-midjourney-min-512x512.png" alt="Financial crime is a real problem, although its usually not commited by cats. Image created with Midjourney" class="wp-image-12702" srcset="https://www.relataly.com/wp-content/uploads/2023/03/financial-crime-serious-business-python-relataly-midjourney-min.png 512w, https://www.relataly.com/wp-content/uploads/2023/03/financial-crime-serious-business-python-relataly-midjourney-min.png 300w, https://www.relataly.com/wp-content/uploads/2023/03/financial-crime-serious-business-python-relataly-midjourney-min.png 140w, https://www.relataly.com/wp-content/uploads/2023/03/financial-crime-serious-business-python-relataly-midjourney-min.png 768w, https://www.relataly.com/wp-content/uploads/2023/03/financial-crime-serious-business-python-relataly-midjourney-min.png 1024w" sizes="(max-width: 512px) 100vw, 512px" />

Financial crime is a real problem, although cats are relatively seldom involved. Image created with Midjourney

Multivariate Anomaly Detection

Before we take a closer look at the use case and our unsupervised approach, let’s briefly discuss anomaly detection. Anomaly detection deals with finding points that deviate from legitimate data regarding their mean or median in a distribution. In machine learning, the term is often used synonymously with outlier detection.

Some anomaly detection models work with a single feature (univariate data), for example, in monitoring electronic signals. However, most anomaly detection models use multivariate data, which means they have two (bivariate) or more (multivariate) features. They find a wide range of applications, including the following:

Predictive Maintenance and Detection of Malfunctions and Decay
Detection of Retail Bank Credit Card Fraud
Detection of Pricing Errors
Cyber Security, for example, Network Intrusion Detection
Detecting Fraudulent Market Behavior in Investment Banking

identifying anomalous datapoints in two dimensions, credit card fraud detection

" data-image-caption="

identifying anomalous datapoints in two dimensions, credit card fraud detection

" data-large-file="https://www.relataly.com/wp-content/uploads/2022/07/output4.png" src="https://www.relataly.com/wp-content/uploads/2022/07/output4-1024x410.png" alt="identifying anomalous data points in two dimensions, credit card fraud detection" class="wp-image-9019" srcset="https://www.relataly.com/wp-content/uploads/2022/07/output4.png 1024w, https://www.relataly.com/wp-content/uploads/2022/07/output4.png 300w, https://www.relataly.com/wp-content/uploads/2022/07/output4.png 768w, https://www.relataly.com/wp-content/uploads/2022/07/output4.png 1184w" sizes="(max-width: 1024px) 100vw, 1024px" />

Identifying anomalous data points in two dimensions in credit card fraud detection

Unsupervised Algorithms for Anomaly Detection

Outlier detection is a classification problem. However, the field is more diverse as outlier detection is a problem we can approach with supervised and unsupervised machine learning techniques. It would go beyond the scope of this article to explain the multitude of outlier detection techniques. Still, the following chart provides a good overview of standard algorithms that learn unsupervised.

Unsupervised Algorithms for Anomaly Detection.

A prerequisite for supervised learning is that we have information about which data points are outliers and belong to regular data. In credit card fraud detection, this information is available because banks can validate with their customers whether a suspicious transaction is a fraud or not. In many other outlier detection cases, it remains unclear which outliers are legitimate and which are just noise or other uninteresting events in the data.

Whether we know which classes in our dataset are outliers and which are not affects the selection of possible algorithms we could use to solve the outlier detection problem. Unsupervised learning techniques are a natural choice if the class labels are unavailable. And if the class labels are available, we could use both unsupervised and supervised learning algorithms.

In the following, we will focus on Isolation Forests.

The Isolation Forest (“iForest”) Algorithm

Isolation forests (sometimes called iForests) are among the most powerful techniques for identifying anomalies in a dataset. They belong to the group of so-called ensemble models. The predictions of ensemble models do not rely on a single model. Instead, they combine the results of multiple independent models (decision trees). Nevertheless, isolation forests should not be confused with traditional random decision forests. While random forests predict given class labels (supervised learning), isolation forests learn to distinguish outliers from inliers (regular data) in an unsupervised learning process.

An Isolation Forest contains multiple independent isolation trees. The algorithm invokes a process that recursively divides the training data at random points to isolate data points from each other to build an Isolation Tree. The number of partitions required to isolate a point tells us whether it is an anomalous or regular point. The underlying assumption is that random splits can isolate an anomalous data point much sooner than nominal ones.

Also: Stock Market Prediction using Multivariate Time Series Data

Isolation Tree and Isolation Forest (Tree Ensemble)

How the Isolation Forest Algorithm Works

The illustration below shows exemplary training of an Isolation Tree on univariate data, i.e., with only one feature. The algorithm has already split the data at five random points between the minimum and maximum values of a random sample. The isolated points are colored in purple. The example below has taken two partitions to isolate the point on the far left. The other purple points were separated after 4 and 5 splits.

The partitioning process ends when the algorithm has isolated all points from each other or when all remaining points have equal values. The algorithm has calculated and assigned an outlier score to each point at the end of the process, based on how many splits it took to isolate it.

When using an isolation forest model on unseen data to detect outliers, the algorithm will assign an anomaly score to the new data points. These scores will be calculated based on the ensemble trees we built during model training.

So how does this process work when our dataset involves multiple features? For multivariate anomaly detection, partitioning the data remains almost the same. The significant difference is that the algorithm selects a random feature in which the partitioning will occur before each partitioning. Consequently, multivariate isolation forests split the data along multiple dimensions (features).

Exemplary partitioning process of an isolation tree (5 Steps)

Credit Card Fraud Detection using Isolation Forests

Monitoring transactions has become a crucial task for financial institutions. In 2019 alone, more than 271,000 cases of credit card theft were reported in the U.S., causing billions of dollars in losses and making credit card fraud one of the most common types of identity theft. The vast majority of fraud cases are attributable to organized crime, which often specializes in this particular crime.

Anything that deviates from the customer’s normal payment behavior can make a transaction suspicious, including an unusual location, time, or country in which the customer conducted the transaction. Credit card providers use similar anomaly detection systems to monitor their customers’ transactions and look for potential fraud attempts. They can halt the transaction and inform their customer as soon as they detect a fraud attempt. We train an Isolation Forest algorithm for credit card fraud detection using Python in the following.

Now that we have established the context for our machine learning problem, we can begin implementing an anomaly detection model in Python.

The code is available on the GitHub repository.

View on GitHub Relataly GitHub Repo

Each application of a credit card creates a new data point to review. Image created with Midjourney.

Prerequisites

Before starting the coding part, make sure that you have set up your Python 3 environment and required packages. If you don’t have an environment, consider the Anaconda Python environment. To set it up, you can follow the steps in this tutorial.

Also, make sure you install all required packages. In this tutorial, we will be working with the following standard packages:

In addition, we will be using the machine learning library Scikit-learn and Seaborn for visualization.

You can install packages using console commands:

pip install
conda install (if you are using the anaconda packet manager)

Dataset: Credit Card Transactions

In the following, we will work with a public dataset containing anonymized credit card transactions made by European cardholders in September 2013. You can download the dataset from Kaggle.com.

The dataset contains 28 features (V1-V28) obtained from the source data using Principal Component Analysis (PCA). In addition, the data includes the date and the amount of the transaction.

Transactions are labeled fraudulent or genuine, with 492 fraudulent cases out of 284,807 transactions. The positive class (frauds) accounts for only 0.172% of all credit card transactions, so the classes are highly unbalanced.

Step #1: Load the Data

In the following, we will go through several steps of training an Anomaly detection model for credit card fraud. We will carry out several activities, such as:

Loading and preprocessing the data: this involves cleaning, transforming, and preparing the data for analysis, in order to make it suitable for use with the isolation forest algorithm.
Feature engineering: this involves extracting and selecting relevant features from the data, such as transaction amounts, merchant categories, and time of day, in order to create a set of inputs for the anomaly detection algorithm.
Model training: We will train several machine learning models on different algorithms (incl. the isolation forest) on the preprocessed and engineered data. The models will learn the normal patterns and behaviors in credit card transactions. This activity includes hyperparameter tuning.
Model evaluation and testing: this involves evaluating the performance of the trained model on a test dataset in order to assess its accuracy, precision, recall, and other metrics and to identify any potential issues or improvements. As part of this activity, we compare the performance of the isolation forest to other models.

We begin by setting up imports and loading the data into our Python project. Then we’ll quickly verify that the dataset looks as expected.

import math 
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
from datetime import date, timedelta, datetime
import seaborn as sns
from sklearn.ensemble import IsolationForest
from sklearn.model_selection import train_test_split
from sklearn.neighbors import LocalOutlierFactor, KNeighborsClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, plot_confusion_matrix

# The Data can be downloaded from Kaggle.com: https://www.kaggle.com/mlg-ulb/creditcardfraud?select=creditcard.csv
path = 'data/credit-card-transactions/'
df = pd.read_csv(f'{path}creditcard.csv')
df

		Time	V1			V2			V3			V4			V5			V6			V7			V8			V9			...	V21	V22	V23	V24	V25	V26	V27	V28	Amount	Class
0		0.0		-1.359807	-0.072781	2.536347	1.378155	-0.338321	0.462388	0.239599	0.098698	0.363787	...	-0.018307	0.277838	-0.110474	0.066928	0.128539	-0.189115	0.133558	-0.021053	149.62	0
1		0.0		1.191857	0.266151	0.166480	0.448154	0.060018	-0.082361	-0.078803	0.085102	-0.255425	...	-0.225775	-0.638672	0.101288	-0.339846	0.167170	0.125895	-0.008983	0.014724	2.69	0
2		1.0		-1.358354	-1.340163	1.773209	0.379780	-0.503198	1.800499	0.791461	0.247676	-1.514654	...	0.247998	0.771679	0.909412	-0.689281	-0.327642	-0.139097	-0.055353	-0.059752	378.66	0
3		1.0		-0.966272	-0.185226	1.792993	-0.863291	-0.010309	1.247203	0.237609	0.377436	-1.387024	...	-0.108300	0.005274	-0.190321	-1.175575	0.647376	-0.221929	0.062723	0.061458	123.50	0
4		2.0		-1.158233	0.877737	1.548718	0.403034	-0.407193	0.095921	0.592941	-0.270533	0.817739	...	-0.009431	0.798278	-0.137458	0.141267	-0.206010	0.502292	0.219422	0.215153	69.99	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
284802	172786.0	-11.881118	10.071785	-9.834783	-2.066656	-5.364473	-2.606837	-4.918215	7.305334	1.914428	...	0.213454	0.111864	1.014480	-0.509348	1.436807	0.250034	0.943651	0.823731	0.77	0
284803	172787.0	-0.732789	-0.055080	2.035030	-0.738589	0.868229	1.058415	0.024330	0.294869	0.584800	...	0.214205	0.924384	0.012463	-1.016226	-0.606624	-0.395255	0.068472	-0.053527	24.79	0
284804	172788.0	1.919565	-0.301254	-3.249640	-0.557828	2.630515	3.031260	-0.296827	0.708417	0.432454	...	0.232045	0.578229	-0.037501	0.640134	0.265745	-0.087371	0.004455	-0.026561	67.88	0
284805	172788.0	-0.240440	0.530483	0.702510	0.689799	-0.377961	0.623708	-0.686180	0.679145	0.392087	...	0.265245	0.800049	-0.163298	0.123205	-0.569159	0.546668	0.108821	0.104533	10.00	0
284806	172792.0	-0.533413	-0.189733	0.703337	-0.506271	-0.012546	-0.649617	1.577006	-0.414650	0.486180	...	0.261057	0.643078	0.376777	0.008797	-0.473649	-0.818267	-0.002415	0.013649	217.00	0

Everything should look good so that we can continue.

Step #2: Data Exploration

The purpose of data exploration in anomaly detection is to gain a better understanding of the data and the underlying patterns and trends that it contains. This can help to identify potential anomalies or outliers in the data and to determine the appropriate approaches and algorithms for detecting them.

In the following, we will create histograms that visualize the distribution of the different features.

2.1 Features

First, we will create a series of frequency histograms for our dataset’s features (V1 – V28). We will subsequently take a different look at the Class, Time, and Amount so that we can drop them at the moment.

# create histograms on all features
df_hist = df_base.drop(['Time','Amount', 'Class'], 1)
df_hist.hist(figsize=(20,20), bins = 50, color = "c", edgecolor='black')
plt.show()

Next, we will look at the correlation between the 28 features. We expect the features to be uncorrelated due to the use of PCA. Let’s verify that by creating a heatmap on their correlation values.

# feature correlation
f_cor = df_hist.corr()
sns.heatmap(f_cor)

As we expected, our features are uncorrelated.

2.2 Class Labels

Next, let’s print an overview of the class labels to understand better how balanced the two classes are.

# Plot the balance of class labels
fig1, ax1 = plt.subplots(figsize=(14, 7))
plt.pie(df[['Class']].value_counts(), explode=[0,0.1], labels=[0,1], autopct='%1.2f%%', shadow=True, startangle=45)

We see that the data set is highly unbalanced. While this would constitute a problem for traditional classification techniques, it is a predestined use case for outlier detection algorithms like the Isolation Forest.

2.3 Time and Amount

Finally, we will create some plots to gain insights into time and amount. Let’s first have a look at the time variable.

# Plot istribution of the Time variable, which contains transaction data for two days
fig, ax = plt.subplots(nrows=1, ncols=2, sharex=True, figsize=(14, 7))
sns.histplot(data=df_base[df_base['Class'] == 0], x='Time', kde=True, ax=ax[0])
sns.histplot(data=df_base[df_base['Class'] == 1], x='Time', kde=True, ax=ax[1])
plt.show()

The time frame of our dataset covers two days, which reflects the distribution graph well. We can see that most transactions happen during the day – which is only plausible.

Next, let’s examine the correlation between transaction size and fraud cases. To do this, we create a scatterplot that distinguishes between the two classes.

# Plot time against amount
x = df_base['Amount']
y = df_base['Time']
fig, ax = plt.subplots(figsize=(20, 7))
ax.set(xlim=(0, 1500))
sns.scatterplot(data=df_base[df_base['Class']==0][::15], x=x, y=y, hue="Class", palette=["#BECEE9"], alpha=.5, ax=ax)
sns.scatterplot(data=df_base[df_base['Class']==1][::15], x=x, y=y, hue="Class", palette=["#EF1B1B"], zorder=100, ax=ax)
plt.legend(['no fraud', 'fraud'], loc='lower right')
fig.suptitle('Transaction Amount over Time split by Class')

identifying anomalous datapoints in two dimensions, credit card fraud detection

" data-image-caption="

identifying anomalous datapoints in two dimensions, credit card fraud detection

" data-large-file="https://www.relataly.com/wp-content/uploads/2022/07/output4.png" src="https://www.relataly.com/wp-content/uploads/2022/07/output4-1024x410.png" alt="Transaction Amount over Time splot by Class; Credit card fraud detection with isolation forests" class="wp-image-9019" width="1053" height="422" srcset="https://www.relataly.com/wp-content/uploads/2022/07/output4.png 1024w, https://www.relataly.com/wp-content/uploads/2022/07/output4.png 300w, https://www.relataly.com/wp-content/uploads/2022/07/output4.png 768w, https://www.relataly.com/wp-content/uploads/2022/07/output4.png 1184w" sizes="(max-width: 1053px) 100vw, 1053px" />

The scatterplot provides the insight that suspicious amounts tend to be relatively low. In other words, there is some inverse correlation between class and transaction amount.

Step #3: Preprocessing

Now that we have a rough idea of the data, we will prepare it for training the model. For the training of the isolation forest, we drop the class label from the base dataset and then divide the data into separate datasets for training (70%) and testing (30%). We do not have to normalize or standardize the data when using a decision tree-based algorithm.

We will use all features from the dataset. So our model will be a multivariate anomaly detection model.

# Separate the classes from the train set
df_classes = df_base['Class']
df_train = df_base.drop(['Class'], axis=1)

# split the data into train and test 
X_train, X_test, y_train, y_test = train_test_split(df_train, df_classes, test_size=0.30, random_state=42)

Step #4: Model Training

Once we have prepared the data, it’s time to start training the Isolation Forest. However, to compare the performance of our model with other algorithms, we will train several different models. In total, we will prepare and compare the following five outlier detection models:

Isolation Forest (default)
Isolation Forest (hypertuned)
Local Outlier Factor (default)
K Neared Neighbour (default)
K Nearest Neighbour (hypertuned)

For hyperparameter tuning of the models, we use Grid Search.

4.1 Train an Isolation Forest

Next, we train our isolation forest algorithm. An isolation forest is a type of machine learning algorithm for anomaly detection. It is a variant of the random forest algorithm, which is a widely-used ensemble learning method that uses multiple decision trees to make predictions.

The isolation forest algorithm works by randomly selecting a feature and a split value for the feature, and then using the split value to divide the data into two subsets. This process is repeated for each decision tree in the ensemble, and the trees are combined to make a final prediction.

The isolation forest algorithm is designed to be efficient and effective for detecting anomalies in high-dimensional datasets. It has a number of advantages, such as its ability to handle large and complex datasets, and its high accuracy and low false positive rate. It is widely used in a variety of applications, such as fraud detection, intrusion detection, and anomaly detection in manufacturing.

4.1.1 Isolation Forest (baseline)

First, we train a baseline model. A baseline model is a simple or reference model used as a starting point for evaluating the performance of more complex or sophisticated models in machine learning. It provides a baseline or benchmark for comparison, which allows us to assess the relative performance of different models and to identify which models are more accurate, effective, or efficient.

# train the model on the nominal train set
model_isf = IsolationForest().fit(X_train)

We create a function to measure the performance of our baseline model and illustrate the results in a confusion matrix. Later, when we go into hyperparameter tuning, we can use this function to objectively compare the performance of more sophisticated models.

def measure_performance(model, X_test, y_true, map_labels):
    # predict on testset
    df_pred_test = X_test.copy()
    #df_pred_test['Class'] = y_test
    df_pred_test['Pred'] = model.predict(X_test)
    if map_labels:
        df_pred_test['Pred'] = df_pred_test['Pred'].map({1: 0, -1: 1})
    #df_pred_test['Outlier_Score'] = model.decision_function(X_test)

    # measure performance
    #y_true = df_pred_test['Class']
    x_pred = df_pred_test['Pred'] 
    matrix = confusion_matrix(x_pred, y_true)

    sns.heatmap(pd.DataFrame(matrix, columns = ['Actual', 'Predicted']),
                xticklabels=['Regular [0]', 'Fraud [1]'], 
                yticklabels=['Regular [0]', 'Fraud [1]'], 
                annot=True, fmt="d", linewidths=.5, cmap="YlGnBu")
    plt.ylabel('Predicted')
    plt.xlabel('Actual')
    
    print(classification_report(x_pred, y_true))
    
    model_score = score(x_pred, y_true,average='macro')
    print(f'f1_score: {np.round(model_score[2]*100, 2)}%')
    
    return model_score

model_name = 'Isolation Forest (baseline)'
print(f'{model_name} model')

map_labels = True
model_score = measure_performance(model_isf, X_test, y_test, map_labels)

performance_df = pd.DataFrame().append({'model_name':model_name, 
                                    'f1_score': model_score[0], 
                                    'precision': model_score[1], 
                                    'recall': model_score[2]}, ignore_index=True)

4.1.2 Isolation Forest (Hypertuning)

Next, we will train another Isolation Forest Model using grid search hyperparameter tuning to test different parameter configurations.

The hyperparameters of an isolation forest include:

n_estimators: The number of decision trees in the forest.
max_samples: The number of samples to draw from the dataset to train each decision tree.
contamination: The expected proportion of anomalies in the dataset.
max_features: The number of features to consider when choosing the split points in the decision trees.
bootstrap: Whether or not to use bootstrap sampling when drawing samples to train the decision trees.

These hyperparameters can be adjusted to improve the performance of the isolation forest. The optimal values for these hyperparameters will depend on the specific characteristics of the dataset and the task at hand, which is why we require several experiments.

The code below will evaluate the different parameter configurations based on their f1_score and automatically choose the best-performing model.

# Define the parameter grid
n_estimators=[50, 100]
max_features=[1.0, 5, 10]
bootstrap=[True]
param_grid = dict(n_estimators=n_estimators, max_features=max_features, bootstrap=bootstrap)

# Build the gridsearch
model_isf = IsolationForest(n_estimators=n_estimators, 
                            max_features=max_features, 
                            contamination=contamination_rate, 
                            bootstrap=False, 
                            n_jobs=-1)

# Define an f1_scorer
f1sc = make_scorer(f1_score, average='macro')

grid = GridSearchCV(estimator=model_isf, param_grid=param_grid, cv = 3, scoring=f1sc)
grid_results = grid.fit(X=X_train, y=y_train)

# Summarize the results in a readable format
print("Best: {0}, using {1}".format(grid_results.cv_results_['mean_test_score'], grid_results.best_params_))
results_df = pd.DataFrame(grid_results.cv_results_)

# Evaluate model performance
model_name = 'KNN (tuned)'
print(f'{model_name} model')

best_model = grid_results.best_estimator_
map_labels = True # if True - maps 1 to 0 and -1 to 1 - not required for scikit-learn knn models
model_score = measure_performance(best_model, X_test, y_test, map_labels)
performance_df = performance_df.append({'model_name':model_name, 
                                    'f1_score': model_score[0], 
                                    'precision': model_score[1], 
                                    'recall': model_score[2]}, ignore_index=True)
results_df

Best: [0.61083219 0.55718259 0.55912644 0.52670328 0.5317127 ], using {'n_neighbors': 1}
KNN (tuned) model
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     85385
           1       0.21      0.48      0.29        58

    accuracy                           1.00     85443
   macro avg       0.60      0.74      0.64     85443
weighted avg       1.00      1.00      1.00     85443

f1_score: 64.39%

4.2 LOF Model

We train the Local Outlier Factor Model using the same training data and evaluation procedure. The local outlier factor (LOF) is a measure of the local deviation of a data point with respect to its neighbors. It is used to identify points in a dataset that are significantly different from their surrounding points and that may therefore be considered outliers.

The LOF is a useful tool for detecting outliers in a dataset, as it considers the local context of each data point rather than the global distribution of the data. This makes it more robust to outliers that are only significant within a specific region of the dataset. However, isolation forests can often outperform LOF models.

# Train a tuned local outlier factor model
model_lof = LocalOutlierFactor(n_neighbors=3, contamination=contamination_rate, novelty=True)
model_lof.fit(X_train)

# Evaluate model performance
model_name = 'LOF (baseline)'
print(f'{model_name} model')

map_labels = True 
model_score = measure_performance(model_lof, X_test, y_test, map_labels)
performance_df = performance_df.append({'model_name':model_name, 
                                    'f1_score': model_score[0], 
                                    'precision': model_score[1], 
                                    'recall': model_score[2]}, ignore_index=True)

4.3 KNN Model

Next, we train the KNN models. KNN is a type of machine learning algorithm for classification and regression. It is a type of instance-based learning, which means that it stores and uses the training data instances themselves to make predictions, rather than building a model that summarizes or generalizes the data.

Below we add two K-Nearest Neighbor models to our list. We use the default parameter hyperparameter configuration for the first model. The second model will most likely perform better because we optimize its hyperparameters using the grid search technique.

4.3.1 KNN (default)

First, we train the default model using the same training data as before. By experimenting with different values of this parameter, you can try to identify the optimal number of neighbors that maximize the model’s performance on the given dataset. This approach could help to achieve better results compared to the default settings of the KNN algorithm, which may not be the most appropriate for the specific dataset we are working with.

# Train a KNN Model
model_knn = KNeighborsClassifier(n_neighbors=5)
model_knn.fit(X=X_train, y=y_train)

# Evaluate model performance
model_name = 'KNN (baseline)'
print(f'{model_name} model')

map_labels = False # if True - maps 1 to 0 and -1 to 1 - set to False for classification models (e.g., KNN)
model_score = measure_performance(model_knn, X_test, y_test, map_labels)
performance_df = performance_df.append({'model_name':model_name, 
                                    'f1_score': model_score[0], 
                                    'precision': model_score[1], 
                                    'recall': model_score[2]}, ignore_index=True)

4.3.1 KNN (hypertuned)

In the next step, we will train a second KNN model to improve its performance by fine-tuning its hyperparameters. Despite having only a few parameters, hyperparameter tuning can enhance the model’s ability to make accurate predictions. In this case, we will concentrate on optimizing the number of nearest neighbors considered in the KNN algorithm.

# Define hypertuning parameters
n_neighbors=[1, 2, 3, 4, 5]
param_grid = dict(n_neighbors=n_neighbors)

# Build the gridsearch
model_knn = KNeighborsClassifier(n_neighbors=n_neighbors)
grid = GridSearchCV(estimator=model_knn, param_grid=param_grid, cv = 5)
grid_results = grid.fit(X=X_train, y=y_train)

# Summarize the results in a readable format
print("Best: {0}, using {1}".format(grid_results.cv_results_['mean_test_score'], grid_results.best_params_))
results_df = pd.DataFrame(grid_results.cv_results_)

# Evaluate model performance
model_name = 'KNN (tuned)'
print(f'{model_name} model')

best_model = grid_results.best_estimator_
map_labels = False # if True - maps 1 to 0 and -1 to 1 - set to False for classification models (e.g., KNN)
model_score = measure_performance(best_model, X_test, y_test, map_labels)
performance_df = performance_df.append({'model_name':model_name, 
                                    'f1_score': model_score[0], 
                                    'precision': model_score[1], 
                                    'recall': model_score[2]}, ignore_index=True)
results_df

Step #5: Measuring and Comparing Performance

Finally, we will compare the performance of our models with a bar chart that shows the f1_score, precision, and recall. If you want to learn more about classification performance, this tutorial discusses the different metrics in more detail.

print(performance_df)

performance_df = performance_df.sort_values('model_name')

fig, ax = plt.subplots(figsize=(12, 4))
tidy = performance_df.melt(id_vars='model_name').rename(columns=str.title)
sns.barplot(y='Model_Name', x='Value', hue='Variable', data=tidy, ax=ax, palette='nipy_spectral', linewidth=1, edgecolor="w")
plt.title('Model Outlier Detection Performance (Macro)')

All three metrics play an important role in evaluating performance because, on the one hand, we want to capture as many fraud cases as possible, but we also don’t want to raise false alarms too frequently.

As we can see, the optimized Isolation Forest performs particularly well-balanced.
The default Isolation Forest has a high f1_score and detects many fraud cases but frequently raises false alarms.
The opposite is true for the KNN model. Only a few fraud cases are detected here, but the model is often correct when noticing a fraud case.
The default LOF model performs slightly worse than the other models. Compared to the optimized Isolation Forest, it performs worse in all three metrics.

Summary

Credit card fraud detection is important because it helps to protect consumers and businesses, to maintain trust and confidence in the financial system, and to reduce financial losses. It is a critical part of ensuring the security and reliability of credit card transactions.

This article has shown how to use Python and the Isolation Forest Algorithm to implement a credit card fraud detection system. We developed a multivariate anomaly detection model to spot fraudulent credit card transactions. You learned how to prepare the data for testing and training an isolation forest model and how to validate this model. Finally, we have proven that the Isolation Forest is a robust algorithm for anomaly detection that outperforms traditional techniques.

I hope you enjoyed the article and can apply what you learned to your projects. Have a great day!

Sources and Further Reading

The links above to Amazon are affiliate links. By buying through these links, you support the Relataly.com blog and help to cover the hosting costs. Using the links does not affect the price.

The post Multivariate Anomaly Detection on Time-Series Data in Python: Using Isolation Forests to Detect Credit Card Fraud appeared first on relataly.com.

Image Classification with Convolutional Neural Networks – Classifying Cats and Dogs in Python

Florian Follonier — Sun, 13 Dec 2020 14:09:31 +0000

This tutorial shows how to use Convolutional Neural Networks (CNNs) with Python for image classification. CNNs belong to the field of deep learning, a subarea of machine learning, and have become a cornerstone to many exciting innovations. There are endless applications, from self-driving cars over biometric security to automated tagging in social media. And the importance of CNNs grows steadily! So there are plenty of reasons to understand how this technology works and how we can implement it.

This article proceeds as follows: The first part introduces the core concepts behind CNNs and explains their use in image classification. The second part is a hands-on tutorial in which you will build your own CNN to distinguish images of cats and dogs. This tutorial develops a model that achieves around 82% validation accuracy. We will work with TensorFlow and Python to integrate different layers, such as Convolution Layers, Dense layers, and MaxPooling. Furthermore, we will prevent the network from overfitting the training data by using Dropout between the layers. We will also load the model and make predictions on a fresh set of images. Finally, we analyze and illustrate the performance of our image classifier.

Also: Generating Detailed Images with OpenAI DALL-E and ChatGPT in Python: A Step-By-Step API Tutorial

Image Classification with Convolutional Neural Networks

The history of image recognition dates back to the mid-1960s when the first attempts were made to identify objects by coding their characteristic shapes and lines. However, this task turned out to be incredibly complex. Our human brain is trained so well to recognize things that one can easily forget how diverse the observation conditions can be. Here are some examples:

Fotos can be taken from various viewpoints
Living things can have multiple forms and poses
Objects come in different forms, colors, and sizes
The picture may hide parts of the things in the picture
The light conditions vary from image to image
There may be one or multiple objects in the same image

At the beginning of the 1990s, the focus of research shifted to statistical approaches and learning algorithms.

The idea of computer vision is inspired by the fact that the visual cortex has cells activated by specific shapes and their orientation in the visual field.

The Emergence of CNNs

The basic concept of a neural network in computer vision has existed since the 1980s. It goes back to research from Hubel and Wiesel on the emergence of a cat’s visual system. They found that the visual cortex has cells activated by specific shapes and their orientation in the visual field. Some of their findings inspired the development of crucial computer vision technologies, such as, for example, hierarchical features with different levels of abstraction [1, 2]. However, it took another three decades of research and the availability of faster computers before the emergence of modern CNNs.

The year 2012 was a defining moment for the use of CNNs in image recognition. This year, for the first time, CNN won the ILSVRC competition for computer vision. The challenge was classifying more than a hundred thousand images into 1000 object categories. With an error rate of only 15,3%, the succeeding model was a CNN called “AlexNet.”.

AlexNet was the first model to achieve more than 75% accuracy. In the same year, CNNs succeeded in several other competitions. For example, in 2015, the CNN ResNet exceeded human performance in the ILSVRC competition. Only a decade ago, this achievement was considered almost impossible. So how was this performance increase possible? To understand this surge in performance, let us first look at what a picture is.

Top-performing models in the ImageNet image classification challenge (Alyafeai & Ghouti, 2019)

What is an Image?

A digital image is a three-dimensional array of integer values. One dimension of this array represents the pixel width, and one dimension represents the height of the picture. The third dimension contains the color depth, defined by the image format. As shown below, we can thus represent the format of a digital image as “width x height x depth.” Next, let’s have a quick look at different image formats.

A digital image is a multidimensional integer array.

Overview of Different Image Formats

We can train CNNs with different image formats, but the input data are always multidimensional arrays of integer values. One of the most commonly used color formats in deep learning is “RGB.” RGB stands for the three color channels: “Red,” “Green,” and “Blue.” RGB images are divided into three layers of integer values, one layer for each color channel—the integer values of a 16-bit RGB image in each layer range from 1 to 255. Together, the three layers can reproduce 65,536 different colors.

In contrast to RGB images, grey-scale images only have a single color layer. This layer resembles the brightness of each pixel in the image. Consequently, the format of a grey-scale image is width x height x 1. Using grey-scale images or images with black and white shades instead of RGB images can speed up the training process because less data needs to be processed. However, image data with multiple color channels provide the model with more information, leading to better predictions. The RGB format is often a good choice between prediction quality and performance. Next, let’s look at how CNNs handle digital images in the learning process.

Convolutional Neural Networks

As mentioned before, a CNN is a specific form of an artificial neural network. The main difference between the CNN and the standard multi-layer perceptron is their convolutional layers. CNNs can have other layers, but the convolutions make a CNN so good at detecting objects. They allow the network to identify patterns based on features that work regardless of where in the image they occur. Let’s see how this works in more detail.

Convolutional Layers

Convolutional layers use a rasterizing technique that breaks down an image into smaller groups of pixels called filters. Filters act as feature detectors from the original image. The primary purpose is to extract meaningful features from the input images.

During the training, the CNN slides the filter over image locations and calculates the dot product for each feature at a time. The results of these calculations are stored in a so-called feature map (sometimes called an activation map). A feature map represents where in the image a particular feature was identified. Subsequently, the values from the feature map are transformed with an activation function (usually ReLu), and the algorithm uses them as input to the next layer.

Illustration of operations in the convolutional layers

Features become more complex with the increasing depth of the network. In the first layer of the network, convolutions will detect generic geometric forms and low-level features based on edges, corners, squares, or circles. The subsequent layers of the network will look at more sophisticated shapes and may, for example, include features that resemble the form of an eye of a cat or the nose of a dog. In this way, convolutions provide the network with features at different levels of detail that enable powerful detection patterns.

Exemplary convolutions of an image that contains the number “3.”

Pooling / Downsampling

A convolutional layer is usually followed by a pooling operation, which reduces the amount of data by filtering unnecessary information. This process is also called downsampling or subsampling. There are various forms of pooling. In the most common variant – max-pooling – only the highest value in a predefined grid (e.g., 2×2) is processed, and the remaining values are discarded. For example, imagine a 2×2 grid with values 0.1, 0.5, 0.4, and 0.8. The algorithm would only process the 0,8 further for this grid and use it as part of the input to the next layer. The advantages of pooling are reduced data and faster training times. Because pooling minimizes the complexity of the network, it allows for the construction of deeper architectures with more layers. In addition, pooling offers a certain protection against overfitting during training.

Dropout

Dropout is another technique that helps prevent the network from overfitting the training data. When we activate Dropout for a layer, the algorithm will remove a random number of neurons from the layer per training step. As a result, the network needs to learn patterns that give less weight to individual layers and thus generalize better. The dropout rate controls the percentage of switched-off neurons in each training iteration. We can configure Dropout for each layer separately.

CNNs with many layers and training epochs tend to overfit the training data. Especially here, Dropout is crucial to avoid overfitting and to achieve good prediction results with data that the network does not know yet. A typical value for the rate lies between 10% to 30%.

Multi-Layer Perceptron (MLP)

The CNN architecture ends with multiple dense layers that are fully connected. The layers are part of a Multilayer Perception (MLP), which has the task of dense down the results from the previous convolutions and outputting one of the multiple classes. Consequently, the number of neurons in the final dense layer usually corresponds to the number of different classes to be predicted. It is also possible to use a single neuron in the final layer for two-class prediction problems. In this case, the last neuron outputs a binary label of 0 or 1.

Building a CNN with Tensorflow that Classifies Cats and Dogs

Now that you are familiar with the basic concepts behind convolutional neural networks, we can commence with the practical part and build an image classifier. In the following, we will train a CNN to distinguish images of cats and dogs. We first define a CNN model and then feed it a few thousand photos from a public dataset with labeled images of cats and dogs.

Distinguishing cats and dogs may not sound difficult, but many challenges exist. Imagine the almost infinite circumstances in which animals can be photographed, not to mention the many forms a cat can take. These variations lead to the fact that even humans sometimes confuse a cat with a dog or vice versa. So don’t expect our model to be perfect right from the start. Our model will score around 82% accuracy on the validation dataset.

The code is available on the GitHub repository.

View on GitHub Relataly GitHub Repo

Cat or Dog? That’s what our CNN will predict.

Prerequisites

Before starting the coding part, make sure that you have set up your Python 3 environment and required packages. If you don’t have an environment, you can follow this tutorial to set up the Anaconda environment.

Also, make sure you install all required packages. In this tutorial, we will be working with the following standard packages:

In addition, we will be using Keras (2.0 or higher) with Tensorflow backend and the machine learning library Scikit-learn.

You can install packages using console commands:

pip install
conda install (if you are using the anaconda packet manager)

Download the Dataset

We will train our image classification model with a public dataset from Kaggle.com. The dataset contains more than 25.000 JPG pictures of cats and dogs. The images are uniformly named and numbered, for example, dog.1.jpg, dog.2.jpg, dog.3.jpg, cat.1.jpg, cat.2.jpg, and so on. You can download the picture set directly from Kaggle: cats-vs-dogs.

Setup the Folder Structure

There are different ways data can be structured and loaded during model training. One approach (1) is to split the images into classes and create a separate folder for each class, class_a, class_b, etc. Another method (2) is to put all images into a single folder and define a DataFrame that splits the data into test and train. Because the cats and dogs dataset files already contain the classes in their name, I decided to go for the second approach.

Before we begin with the coding part, we create a folder structure that looks as follows:

The folder structure of our cats and dogs prediction project

If you want to use the standard pathways given in the python tutorial, make sure that your notebook resides in the parent folder of the “data” folder.

After you have created the folder structure, open the cats-vs-dogs zip file. The ZIP file contains the folders “train,” “test,” and “sample.” Unzip the JPG files from the “train” (20.000 images) and the “test” folder (5.000 pictures) to the “train” folder of your project. Afterward, the train folder should contain 25.000 images. The sample folder is intended to include your sample images, for example, of your pet. We will later use the images from the sample folder to test the model on new real-world data.

We have fulfilled all requirements and can start with the coding part.

Step #1 Make Imports and Check Training Device

We begin by setting up the imports for this project. I have put the package imports at the beginning to give you a quick overview of the packages you need to install.

Using the GPU instead of the CPU allows for faster training times. However, setting up Tensorflow to work with the GPUs can cause problems. Not everyone has a GPU; in this case, TensorFlow should usually automatically run all code on the CPU. However, should you for any reason prefer to manually switch to CPU training, change [“CUDA_VISIBLE_DEVICES”]= “1” to “-1”. As a result, Tensorflow will run all code on the CPU and ignore all available GPUs.

import os
#os.environ["CUDA_VISIBLE_DEVICES"]="-1" 

import tensorflow as tf
from tensorflow.keras.preprocessing.image import ImageDataGenerator, array_to_img, img_to_array, load_img
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Convolution2D, MaxPooling2D, ZeroPadding2D
from tensorflow.keras.layers import Conv2D, Activation, Dropout, Flatten, Dense, BatchNormalization
from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau
from tensorflow.keras.metrics import Accuracy
from tensorflow.keras import regularizers
from tensorflow.keras.optimizers import SGD, Adam
from tensorflow.python.client import device_lib
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score

tf.config.allow_growth = True
tf.config.per_process_gpu_memory_fraction = 0.9

from random import randint
import seaborn as sns
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.colors as mcolors
import seaborn as sns
from PIL import Image
import random as rdn

Running the command below checks the TensorFlow version and the number of available GPUs in our system.

# check the tensorflow version
print('Tensorflow Version: ' + tf.__version__)

# check the number of available GPUs
physical_devices = tf.config.list_physical_devices('GPU')
print("Num GPUs:", len(physical_devices))

Tensorflow Version: 2.4.0-rc3
Num GPUs: 1

My GPU is an RTX 3080. When I wrote this article, the GPU was not yet supported by the standard TensorFlow release. I have therefore used the pre-release version of TensorFlow (2.4.0-rc3). I expect the following standard release (2.3) to work fine.

In my case, the GPU check returns one because I have a single GPU on my computer. If TensorFlow doesn’t recognize any GPU, this command will return 0. Tensorflow will then run on the CPU.

Step #2 Define the Prediction Classes

Next, we will define the path to the folders that contain our train and validation images. In addition, we will define a Dataframe “image_df,” which has all the pictures from the “train” folder. With the help of this Dataframe, we can later split the data simply by defining which images from the train folder contain the training dataset and which belong to the test dataset. Important note: the dataframe “image_df” only includes the names of the images and the classes, but not the photos themselves.

It’s good to check the distribution of classes in the training data set. For this purpose, we create a bar plot, which illustrates the number of both classes in the image data. And yes, I admit, I choose some custom colors to make it look fancy.

# set the directory for train and validation images
train_path = 'data/images/cats-and-dogs/train/'
#test_path = 'data/cats-and-dogs/test/'

# function to create a list of image labels 
def createImageDf(path):
    filenames = os.listdir(path)
    categories = []

    for fname in filenames:
        category = fname.split('.')[0]
        if category == 'dog':
            categories.append(1)
        else:
            categories.append(0)
    df = pd.DataFrame({
        'filename':filenames,
        'category':categories
    })
    return df

# display the header of the train_df dataset
image_df = createImageDf(train_path)
image_df.head(5)

sns.countplot(y='category', data=image_df, palette=['#2FE5C7',"#2F8AE5"], orient="h")

The number of images in the two classes is balanced, so we don’t need to rebalance the data. That’s nice!

Step #3 Plot Sample Images

I prefer not to jump directly into preprocessing and check that the data has been correctly loaded. We will do this by plotting some random images from the train folder. This step is not necessary, but it’s a best practice.

n_pictures = 16 # number of pictures to be shown
columns = int(n_pictures / 2)
rows = 2
plt.figure(figsize=(40, 12))
for i in range(n_pictures):
    num = i + 1
    ax = plt.subplot(rows, columns, i + 1)
    if i < columns:
        image_name = 'cat.' + str(rdn.randint(1, 1000)) + '.jpg'
    else: 
        image_name = 'dog.' + str(rdn.randint(1, 1000)) + '.jpg'
    plt.xlabel(image_name)    
    plt.imshow(load_img(train_path + image_name)) 

#if you get a deprecated warning, you can ignore it

I never expected to have so many pictures of cats and dogs one day, but I guess neither did you 🙂 Neural networks require a fixed input shape where each neuron corresponds to a pixel value.

As we can see from the sample images, the images in our dataset have different sizes and aspect ratios. For the images to fit into the input shape of our neural network, we need to put the images into a standard format. But before that, we split the data into two datasets for train and test.

Step #4 Split the Data

Image classification requires splitting the data into a train and a validation set. We define a split ratio of 1/5 so that 80% of the data goes into the training dataset and 20% goes into the validation dataframe. We shuffle the data to create two DataFrameswith a mix of random cat and dog pictures. In addition, we transform the classes of the images into categorical values 0->”cat” and 1->”dog”. The result is two new DataFrames: train_df (20.000 images) and validate_df (5.000 images).

image_df["category"] = image_df["category"].replace({0:'cat',1:'dog'})

train_df, validate_df = train_test_split(image_df, test_size=0.20, random_state=42)
train_df = train_df.reset_index(drop=True)
total_train = train_df.shape[0]

validate_df = validate_df.reset_index(drop=True)
total_validate = validate_df.shape[0]
train_df.head()

print(len(train_df), len(validate_df))

Output: 20000 5000

Step #5 Preprocess the Images

The next step is to define two data generators for these DataFrames, which use the names given in the train and validation DataFrames to feed the images from the “train” path into our neural network. The data generator has various configuration options. We will perform the following operations:

Rescale the image by dividing their RGB color values (1-255) by 255
Shuffle the images (again)
Bring the images into a uniform shape of 128 x 128 pixels
We define a batch size of 32, which processes the 32 images simultaneously.
The class mode is “binary” so our two prediction labels are encoded as float32 scalars with values 0 or 1. As a result, we will only have a single end neuron in our network.
We perform some data augmentation techniques on the training data (incl. horizontal flip, shearing, and zoom). In this way, the model never sees different variants of the images, which helps to prevent overfitting.

Some augmentation techniques

It is essential to mention that the input shape of the first layer of the neural network must correspond to the image shape of 128 x 128. The reason is that each pixel becomes an input to a neuron.

# set the dimensions to which we will convert the images
img_width, img_height = 128, 128
target_size = (img_width, img_height)
batch_size = 32
rescale=1.0/255

# configure the train data generator
print('Train data:')
train_datagen = ImageDataGenerator(rescale=rescale)
train_generator = train_datagen.flow_from_dataframe(
    train_df, 
    train_path,
    shear_range=0.2, #
    zoom_range=0.2, #
    horizontal_flip=True, # 
    shuffle=True, # shuffle the image data
    x_col='filename', y_col='category',
    classes=['dog', 'cat'],
    target_size=target_size,
    batch_size=batch_size,
    color_mode="rgb",
    class_mode='binary')

# configure test data generator
# only rescaling
print('Test data:')
validation_datagen = ImageDataGenerator(rescale=rescale)
validation_generator = validation_datagen.flow_from_dataframe(
    validate_df, 
    train_path,    
    shuffle=True,
    x_col='filename', y_col='category',
    classes=['dog', 'cat'],
    target_size=target_size,
    batch_size=batch_size,
    color_mode="rgb",
    class_mode='binary')

Train data:
Found 20000 validated image filenames belonging to 2 classes.
Test data:
Found 5000 validated image filenames belonging to 2 classes.

At this point, we have already completed the data preprocessing part. The next step is to define and compile the convolutional neural network.

Step #6 Define and Compile the Convolutional Neural Network

The architecture of our image classification CNN is inspired by the famous VGGNet. In this section, we will define and compile our CNN model. We do this by defining multiple layers and stacking them on top of each other. However, to lower the amount of time needed to train the network, I reduced the number of layers.

The initial layer of our network is the initial input layer, which receives the preprocessed images. As already noted, the shape of the input layer needs to match the shape of our images. Considering how we have defined the format of the images in our data generators, the input shape is defined as 128 x 128 x 3.

The subsequent layers are four convolutional layers. Each of these layers is followed by a pooling layer. In addition, we define a Dropoutrate of 20% for each convolutional layer.

Finally, a fully connected output layer with 128 neurons and a binary layer for the output complete the structure of the CNN.

3-dimensional Input Shape of our Neural Network

Additional Info

Loss function: measures model accuracy during training. We try to minimize this function to “steer” the model in the right direction. We use binary_crossentropy.
Optimizer: defines how the model weights are updated based on the data it sees and its loss function.
Metrics are used to monitor the steps during training and testing. The following example uses accuracy, which is the fraction of the correctly classified images.

# define the input format of the model
input_shape = (img_width, img_height, 3)
print(input_shape)

# define  model
model = Sequential()
model.add(Conv2D(32, (3, 3), strides=(1, 1), activation='relu', kernel_initializer='he_uniform', padding='same', input_shape=input_shape))
model.add(MaxPooling2D((2, 2)))
model.add(Dropout(0.20))
model.add(Conv2D(64, (3, 3), strides=(1, 1), activation='relu', kernel_initializer='he_uniform', padding='same'))
model.add(MaxPooling2D((2, 2)))
model.add(Dropout(0.20))
model.add(Conv2D(64, (3, 3), strides=(1, 1), activation='relu', kernel_initializer='he_uniform', padding='same'))
model.add(MaxPooling2D((2, 2)))
model.add(Dropout(0.20))
model.add(Conv2D(128, (3, 3),  strides=(1, 1),activation='relu', kernel_initializer='he_uniform', padding='same'))
model.add(MaxPooling2D((2, 2)))
model.add(Dropout(0.20))
model.add(Flatten())
model.add(Dense(128, activation='relu', kernel_initializer='he_uniform'))
model.add(Dropout(0.5))
model.add(Dense(1, activation='sigmoid'))

# compile the model and print its architecture
opt = SGD(lr=0.001, momentum=0.9)
history = model.compile(optimizer=opt, loss='binary_crossentropy', metrics=['accuracy'])
print(model.summary())

input_shape: (100, 100, 3)
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
conv2d (Conv2D)              (None, 100, 100, 32)      896       
_________________________________________________________________
max_pooling2d (MaxPooling2D) (None, 50, 50, 32)        0         
_________________________________________________________________
dropout (Dropout)            (None, 50, 50, 32)        0         
_________________________________________________________________
conv2d_1 (Conv2D)            (None, 50, 50, 64)        18496     
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 25, 25, 64)        0         
_________________________________________________________________
dropout_1 (Dropout)          (None, 25, 25, 64)        0         
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 25, 25, 64)        36928     
_________________________________________________________________
max_pooling2d_2 (MaxPooling2 (None, 12, 12, 64)        0         
_________________________________________________________________
dropout_2 (Dropout)          (None, 12, 12, 64)        0         
_________________________________________________________________
conv2d_3 (Conv2D)            (None, 12, 12, 128)       73856     
_________________________________________________________________
...
Trainable params: 720,257
Non-trainable params: 0
_________________________________________________________________
None

At this point, we have defined and assembled our convolutional neural network. Next, it is time to train the model.

Step #7 Train the Model

Before we train the image classifier, we still have to choose the number of epochs. More epochs can improve the model performance and lead to longer training times. In addition, the risk increases that the model overfits. Finding the optimal number of epochs is difficult and often requires a trial-and-error approach. I typically start with a small number of 5 epochs and then increase this number until increases do not lead to significant improvements.

# train the model
epochs = 40
early_stop = EarlyStopping(monitor='loss', patience=6, verbose=1)

history = model.fit(
    train_generator,
    epochs=epochs,
    callbacks=[early_stop],
    steps_per_epoch=len(train_generator),
    verbose=1,
    validation_data=validation_generator,
    validation_steps=len(validation_generator))

Epoch 1/35
625/625 [==============================] - 121s 194ms/step - loss: 0.7050 - accuracy: 0.5282 - val_loss: 0.6902 - val_accuracy: 0.5824
Epoch 2/35
625/625 [==============================] - 115s 183ms/step - loss: 0.6853 - accuracy: 0.5469 - val_loss: 0.6856 - val_accuracy: 0.5806
Epoch 3/35
625/625 [==============================] - 115s 184ms/step - loss: 0.6744 - accuracy: 0.5752 - val_loss: 0.6746 - val_accuracy: 0.5806
Epoch 4/35
625/625 [==============================] - 112s 180ms/step - loss: 0.6569 - accuracy: 0.5987 - val_loss: 0.6593 - val_accuracy: 0.6110
Epoch 5/35
625/625 [==============================] - 115s 185ms/step - loss: 0.6423 - accuracy: 0.6194 - val_loss: 0.6474 - val_accuracy: 0.6134
Epoch 6/35
625/625 [==============================] - 116s 185ms/step - loss: 0.6309 - accuracy: 0.6370 - val_loss: 0.6386 - val_accuracy: 0.6260
Epoch 7/35
625/625 [==============================] - 115s 183ms/step - loss: 0.6139 - accuracy: 0.6539 - val_loss: 0.6082 - val_accuracy: 0.6682

A quick comment on the required time to train the model. Although the model is not overly complex and the size of the data is still moderate, training the model can take some time. I made two training runs – the first run on my GPU (Nvidia Geforce 3080 RTX) and the second on my CPU (AMD Ryzen 3700x). On the GPU, training took approximately 10 minutes. The CPU training was much slower and took about 30 minutes, three times longer than the GPU.

After training, you may want to save the classification model and load it at a later time. You can do this with the code below:
However, we need to define the model strictly as it was during training before loading.

# Safe the weights
model.save_weights('cats-and-dogs-weights-v1.h5')

# Define model as during training
# model architecture

# Loads the weights
model.load_weights('cats-and-dogs-weights-v1.h5')

Step #8 Visualize Model Performance

After training the model, we want to check the performance of our image classification model. For this purpose, we can apply the same performance measures as in traditional classification projects. The code below illustrates the performance of our image classifier on the validation dataset.

To learn more about measuring model performance, check out my previous post on Measuring Model Performance.

def plot_loss(history, value1, value2, title):
    fig, ax = plt.subplots(figsize=(15, 5), sharex=True)
    plt.plot(history.history[value1], 'b')
    plt.plot(history.history[value2], 'r')
    plt.title(title)
    plt.ylabel("Loss")
    plt.xlabel("Epoch")
    ax.xaxis.set_major_locator(plt.MaxNLocator(epochs))
    plt.legend(["Train", "Validation"], loc="upper left")
    plt.grid()
    plt.show()

# plot training & validation loss values
plot_loss(history, "loss", "val_loss", "Model loss")
# plot training & validation loss values
plot_loss(history, "accuracy", "val_accuracy", "Model accuracy")

Next, let’s print the accuracy and a confusion matrix on the predictions from the validation dataset.

# function that returns the label for a given probability
def getLabel(prob):
    if(prob > .5):
               return 'dog'
    else:
               return 'cat'

# get the predictions for the validation data
val_df = validate_df.copy()
val_df['pred'] = ""
val_pred_prob = model.predict(validation_generator)

for i in range(val_pred_prob.shape[0]):
    val_df['pred'][i] = getLabel(val_pred_prob[i])
          
# create a confusion matrix
y_val = val_df['category']
y_pred = val_df['pred']

print('Accuracy: {:.2f}'.format(accuracy_score(y_val, y_pred)))
cnf_matrix = confusion_matrix(y_val, y_pred)

# plot the confusion matrix in form of a heatmap

%matplotlib inline
class_names=[False, True] # name  of classes
fig, ax = plt.subplots(figsize=(8, 8))
tick_marks = np.arange(len(class_names))
plt.xticks(tick_marks, class_names)
plt.yticks(tick_marks, class_names)
sns.heatmap(pd.DataFrame(cnf_matrix), annot=True, cmap="YlGnBu", fmt='g')
plt.title('Confusion matrix')
plt.ylabel('Actual label')
plt.xlabel('Predicted label')

Accuracy: 0.82

Step #9 Image Classification on Sample Images

Now that we have trained the model, I bet you can’t wait to test the image classifier on some sample data. For this purpose, ensure that you have some sample images in the “sample” folder. Running the code below will feed the image classifier with the test dataset. Based on this dataset, the model will then predict the labels for the images from the sample folder. Finally, the code below prints the images in an image grid and the predicted labels.

# set the path to the sample images
sample_path = "data/images/cats-and-dogs/sample/"
sample_df = createImageDf(sample_path)
sample_df['category'] = sample_df['category'].replace({0:'cat',1:'dog'})
sample_df['pred'] = ""

# create an image data generator for the sample images - we will only rescale the images
test_datagen = ImageDataGenerator(rescale=1./255)
test_generator = test_datagen.flow_from_dataframe(
    sample_df, 
    sample_path,    
    shuffle=False,
    x_col='filename', y_col='category',
    target_size=target_size)

# make the predictions 
pred_prob = model.predict(test_generator)
image_number = pred_prob.shape[0]

# define the plot size
for i in range(pred_prob.shape[0]):
    sample_df['pred'][i] = getLabel(pred_prob[i])
    
print('Accuracy: {:.2f}'.format(accuracy_score(sample_df['category'], sample_df['pred'])))

nrows = 6
ncols = int(round(image_number / nrows, 0))
fig, axs = plt.subplots(nrows, ncols, figsize=(15, 15))
for i, ax in enumerate(fig.axes):
    if i < sample_df.shape[0]:
        filepath = sample_path + sample_df.at[i ,'filename']
        ax = ax
        img = Image.open(filepath).resize(target_size)
        ax.imshow(img)
        ax.set_title(sample_df.at[i ,'filename'] + '\n' + ' predicted: '  + str(sample_df.at[i ,'pred']))
        result = [True if sample_df.at[i ,'pred'] == sample_df.at[i ,'category'] else False]
        ax.set_xlabel(str(result))
        ax.set_xticks([]); ax.set_yticks([])

Our image classifier achieves an accuracy of around 83% on the validation set. The model is not perfect, but it should have labeled most images correctly. With deeper architectures, more data, and training runs, you can create classification models that achieve better results over 95%.

Summary

In this tutorial, you learned how to train an image classification model. We have prepared a dataset and performed several transformations to bring the data in shape for training. Finally, we have trained a convolutional neural network to distinguish between dogs and cats. You can now use this knowledge to train image classification models that determine other objects.

There are many other cool things that you can do with CNNs. For example, object localization in images and videos and even stock market prediction. But these are topics for further articles.

I am always happy to receive feedback. I hope you enjoyed the article and would be happy if you left a comment. Cheers

Sources and Further Reading

Andriy Burkov Machine Learning Engineering
Oliver Theobald (2020) Machine Learning For Absolute Beginners: A Plain English Introduction
Charu C. Aggarwal (2018) Neural Networks and Deep Learning
Aurélien Géron (2019) Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems
David Forsyth (2019) Applied Machine Learning Springer
[1] D. H. Hubel and T. N. Wiesel – Receptive Fields of Neurons in the Cat’s Striate Cortex, The Journal of physiology (1959)

The links above to Amazon are affiliate links. By buying through these links, you support the Relataly.com blog and help to cover the hosting costs. Using the links does not affect the price.

The post Image Classification with Convolutional Neural Networks – Classifying Cats and Dogs in Python appeared first on relataly.com.

Customer Churn Prediction – Understanding Models with Feature Permutation Importance using Python

Florian Follonier — Sun, 02 Aug 2020 13:24:28 +0000

Customer retention is a prime objective for service companies, and understanding the patterns that lead to customer churn can be the key to maintaining long-lasting client relationships. Businesses incur significant costs when customers discontinue their services, hence it’s vital to identify potential churn risks and take preemptive actions to retain these customers. Machine Learning models can be instrumental in identifying these patterns and providing valuable insights into customer behavior.

An intriguing technique, Permutation Feature Importance, allows us to discern the significance of different features of our machine learning model, thereby shedding light on their influence on customer churn. This tutorial guides you through the intricacies of this technique and its implementation.

The structure of this tutorial is as follows:

We begin by discussing the business problem of customer churn and its implications.
We introduce the concept of Permutation Feature Importance, a powerful tool to identify essential features in our machine learning model.
We transition into the hands-on coding segment, where we build a churn prediction model using Python.
Our model undergoes a classification process and hyperparameter tuning to select the most effective parameters.
Utilizing the trained model, we predict the churn probabilities for a test set of customers.
Finally, we create a feature ranking based on their impact on the model’s performance.

By employing permutation feature importance, this tutorial offers a deep-dive into the correlation between input variables and model predictions, providing actionable insights for effective customer churn management.

Also:

Customer churn prediction is a compelling use case for machine learning. It is particularly effective when combined with feature permutation importance.

What is Churn Prediction?

A company’s effort to persuade a new customer to sign a contract is many times higher than the costs incurred in retaining existing customers. According to industry experts, winning a new customer is four times more expensive than keeping an existing one. Providers that can identify churn candidates and manage to retain them can significantly reduce costs.

A crucial point is whether the provider succeeds in getting the churn candidates to stay. Sometimes it may be enough to contact the churn candidate and inquire about customer satisfaction. In other cases, this may not be enough, and the provider needs to increase the service value, for example, by offering free services or a discount. However, actions should be well thought out, as they can also negatively affect. For instance, if a customer hardly ever uses his contract, a call from the provider may even increase the desire to cancel the contract. Machine learning can help assess cases individually and identify the optimal anti-churn action.

About Permutation Feature Importance

Feature importance is a helpful technique for understanding the contribution of input variables (features) to a predictive model. The results from this technique can be as valuable as the predictions themselves, as they can help us understand the business context better. For example, let’s say we have trained a model that predicts which of our customers will likely churn. Wouldn’t it be interesting to know why specific customers are more likely to churn than others? Permutation feature importance can help us answer this question by providing us with a ranking of the input variables in our model by their usefulness. The order can validate assumptions about the business context and uncover causal relations in the data.

Compared to neural networks, one of the most significant advantages of traditional prediction models, such as a decision tree, is their interpretability. Neural networks are black boxes because it is tough to understand the relationship between input and model predictions. In traditional models, on the other hand, we can calculate the meaning of the features and use it to interpret the model and optimize its performance, for example, by removing features from the model that are not important. We, therefore, start with a simple model first and move on to more complex models once we understand the data.

Implementing a Customer Churn Prediction Model in Python

In the following, we will implement a customer churn prediction model. We will train a decision forest model on a data set from Kaggle and optimize it using grid search. The data contains customer-level information for a telecom provider and a binary prediction label of which customers canceled their contracts and did not. Finally, we will calculate the feature importance to understand how the model works.

The code is available on the GitHub repository.

View on GitHub Relataly GitHub Repo

Prerequisites

Make sure you install all required packages. In this tutorial, we will be working with the following packages:

Pandas
NumPy
Matplotlib
Seaborn

In addition, we will be using Keras (2.0 or higher) with Tensorflow backend and the machine learning library Scikit-learn.

You can install packages using console commands:

pip install
conda install (if you are using the anaconda packet manager)

Step #1 Loading the Customer Churn Data

We begin by loading a customer churn dataset from Kaggle. If you work with the Kaggle Python environment, you can directly save the dataset into your Kaggle project. After completing the download, put the dataset under the file path of your choice, but don’t forget to adjust the file path variable in the code.

The dataset contains 3333 records and the following attributes.

Churn: The prediction label: 1 if the customer canceled service, 0 if not.
AccountWeeks: number of weeks the customer has had an active account
ContractRenewal: 1 if customer recently renewed contract, 0 if not
DataPlan: 1 if the customer has a data plan, 0 if not
DataUsage: gigabytes of monthly data usage
CustServCalls: number of calls into customer service
DayMins: average daytime minutes per month
DayCalls: average number of daytime calls
MonthlyCharge: average monthly bill
OverageFee: The most considerable overage fee in the last 12 months

The following code will load the data from your local folder into your anaconda Python project:

import numpy as np 
import pandas as pd 
import math
from pandas.plotting import register_matplotlib_converters
import matplotlib.pyplot as plt 
import matplotlib.colors as mcolors
import matplotlib.dates as mdates 

from sklearn.metrics import confusion_matrix, classification_report
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.inspection import permutation_importance
import seaborn as sns


# set file path
filepath = "data/Churn-prediction/"

# Load train and test datasets
train_df = pd.read_csv(filepath + 'telecom_churn.csv')
train_df.head()

	Churn	AccountWeeks	ContractRenewal	DataPlan	DataUsage	CustServCalls	DayMins	DayCalls	MonthlyCharge	OverageFee	RoamMins
0	0		128				1				1			2.7			1				265.1		110		89.0			9.87		10.0
1	0		107				1				1			3.7			1				161.6		123		82.0			9.78		13.7
2	0		137				1				0			0.0			0				243.4		114		52.0			6.06		12.2
3	0		84				0				0			0.0			2				299.4		71		57.0			3.10		6.6
4	0		75				0				0			0.0			3				166.7		113		41.0			7.42		10.1

Step #2 Exploring the Data

Before we begin with the preprocessing, we will quickly explore the data. For this purpose, we will create histograms for the different attributes in our data.

# # Create histograms for feature columns separated by prediction label value
df_plot = train_df.copy()

# class_columnname = 'Churn'
sns.pairplot(df_plot, hue="Churn", height=2.5, palette='muted')

Histograms of the churn prediction dataset separated by prediction label (red=churn, blue= no churn)

We can see that the data distribution for several attributes looks quite good and resembles a normal distribution, for example, for OverageFeed, DayMins, and DayCalls. However, the distribution for the prediction label is unbalanced. This is because more customers remain with their contract (prediction label class = 0) than those that cancel their contract (prediction label class = 1).

Step #3 Data Preprocessing

The next step is to preprocess the data. I have reduced this part to a minimum to keep this tutorial simple. For example, I do not treat the unbalanced label classes. However, this would be appropriate to improve the model performance in a real business context. The imbalanced data is also why I chose a decision forest as a model type. Decision forests can handle unbalanced data relatively well compared to traditional models such as logistic regression.

The following code splits the data into the train (x_train) and test data (x_test) and creates the respective datasets, which only contain the label class (y_train, y_test). The ratio is 0.7, resulting in 2333 records in the training dataset and 1000 in the test dataset.

# Create Training Dataset
x_df = train_df[train_df.columns[train_df.columns.isin(['AccountWeeks', 'ContractRenewal', 'DataPlan','DataUsage', 'CustServCalls', 'DayCalls', 'MonthlyCharge', 'OverageFee', 'RoamMins'])]].copy()
y_df = train_df['Churn'].copy()

# Split the data into x_train and y_train data sets
x_train, x_test, y_train, y_test = train_test_split(x_df, y_df, train_size=0.7, random_state=0)
x_train

		AccountWeeks	ContractRenewal	DataPlan	DataUsage	CustServCalls	DayCalls	MonthlyCharge	OverageFee	RoamMins
2918	58				1				0			0.00		4				112			53.0			13.29		0.0
1884	51				0				1			3.32		2				60			74.2			10.03		12.3
2823	87				1				0			0.00		2				80			50.0			9.35		16.6
2319	83				1				1			2.35		3				105			91.5			12.65		8.7
2980	84				1				0			0.00		3				86			62.0			13.78		14.3
...		...				...				...			...			...				...			...				...			...
835	27	1				0				0.00		1			75				31.0		10.43			9.9
3264	89				1				1			1.59		0				98			50.9			10.36		5.9
1653	93				0				0			0.00		1				78			42.0			10.99		11.1
2607	91				1				0			0.00		3				100			53.0			11.97		9.9
2732	130				0				0			0.00		5				106			68.0			18.19		16.9

Step #4 Fit an Optimized Decision Forest Model for Churn Prediction using Grid Search

Now comes the exciting part. We will train a series of 36 decision forests and then choose the best-performing model. The technique used in this process is called hyperparameter tuning (more specifically, grid search), and I have recently published a separate article on this topic.

The following code defines the parameters the grid search will test (max_depth, n_estimators, and min_samples_split). Then the code runs the grid search and trains the decision forests. Finally, we print out the model ranking along with model parameters.

# Define parameters
max_depth=[2, 4, 8, 16]
n_estimators = [64, 128, 256]
min_samples_split = [5, 20, 30]

param_grid = dict(max_depth=max_depth, n_estimators=n_estimators, min_samples_split=min_samples_split)

# Build the gridsearch
dfrst = RandomForestClassifier(n_estimators=n_estimators, max_depth=max_depth, min_samples_split=min_samples_split, class_weight='balanced')
grid = GridSearchCV(estimator=dfrst, param_grid=param_grid, cv = 5)
grid_results = grid.fit(x_train, y_train)

# Summarize the results in a readable format
results_df = pd.DataFrame(grid_results.cv_results_)
results_df.sort_values(by=['rank_test_score'], ascending=True, inplace=True)

# Reduce the results to selected columns
results_filtered = results_df[results_df.columns[results_df.columns.isin(['param_max_depth', 'param_min_samples_split', 'param_n_estimators','std_fit_time', 'rank_test_score', 'std_test_score', 'mean_test_score'])]].copy()
results_filtered

std_fit_time	param_max_depth	param_min_samples_split	param_n_estimators	mean_test_score	std_test_score	rank_test_score
28				0.004742		16						5					128	0.931415	0.006950		1
27				0.002620		16						5					64	0.925848	0.008177		2
29				0.015711		16						5					256	0.925846	0.006156		3
20				0.006258		8						5					256	0.923704	0.007961		4
19				0.001816		8						5					128	0.921988	0.006458		5
18				0.002161		8						5					64	0.919847	0.007716		6
31				0.003728		16						20					128	0.902690	0.011642		7
30				0.002057		16						20					64	0.901836	0.009789		8
32				0.004940		16						20					256	0.899691	0.009813		9
21				0.001994		8						20					64	0.898408	0.008710		10
22				0.003761		8						20					128	0.897121	0.007529		11
23				0.003828		8						20					256	0.895833	0.009159		12
33				0.003798		16						30					64	0.885546	0.010394		13
26				0.005560		8						30					256	0.885541	0.014937		14
...

The best-performing model is model number 29, which scores 92,7 %. Its hyperparameters are as follows:

max_depth = 16
min_samples_split = 5
n_estimators 256

We will proceed with this model. So what does this model tell us?

We can gain an overview of the distributions of our customers according to their churn probability. Just use the following code:

# Predicting Probabilities
y_pred_prob = best_clf.predict_proba(x_test) 
churnproba = y_pred_prob[:,1]

# Create histograms for feature columns separated by prediction label value
sns.histplot(data=churnproba)

Customer Base According to their Churn Rate

Customers who tend to churn have a churn probability greater than 0.5. They are further to the right in the diagram. So, we don’t have to worry about the customers on the far left (<0.5).

Step #5 Best Model Performance Insights

Let’s take a more detailed look at the performance of the best model. We do this by calculating the confusion matrix.

If you want to learn more about measuring the performance of classification models, check out this tutorial.

# Extract the best decision forest 
best_clf = grid_results.best_estimator_
y_pred = best_clf.predict(x_test)

# Create a confusion matrix
cnf_matrix = confusion_matrix(y_test, y_pred)

# Create heatmap from the confusion matrix
class_names=[False, True] 
tick_marks = [0.5, 1.5]
fig, ax = plt.subplots(figsize=(7, 6))
sns.heatmap(pd.DataFrame(cnf_matrix), annot=True, cmap="Blues", fmt='g')
ax.xaxis.set_label_position("top")
plt.tight_layout()
plt.title('Confusion matrix')
plt.ylabel('Actual label'); plt.xlabel('Predicted label')
plt.yticks(tick_marks, class_names); plt.xticks(tick_marks, class_names)

From 1000 customers in the test dataset, our model correctly classified 100 customers as churn candidates. For 832 customers, the model accurately predicted that these customers are unlikely to churn. In 30 cases, the model falsely classified customers as churn candidates, and 38 were missed and falsely classified as non-churn candidates. The result is a model accuracy of 93,2 % (based on a 0.5 threshold).

Step #6 Permutation Feature Importance

Now that we have trained a model that gives good results, we want to understand the importance of the model’s features. With the following code, we calculate the Feature Importance score. Then we visualize the results in a barplot.

# Load the data
r = permutation_importance(best_clf, x_test, y_test, n_repeats=30, random_state=0)

# Set the color range
clist = [(0, "purple"), (1, "blue")]
rvb = mcolors.LinearSegmentedColormap.from_list("", clist)
colors = rvb(data_im['feature_permuation_score']/len(x_test.columns))

# Plot the barchart
data_im = pd.DataFrame(r.importances_mean, columns=['feature_permuation_score'])
data_im['feature_names'] = x_test.columns
data_im = data_im.sort_values('feature_permuation_score', ascending=False)

fig, ax = plt.subplots(figsize=(16, 5))
sns.barplot(y=data_im['feature_names'], x="feature_permuation_score", data=data_im, palette='nipy_spectral')
ax.set_title("Random Forest Feature Importances")

The feature ranking can provide the starting point for deeper analysis. As we can see, the most important features are the monthly fee, data usage, and customer service calls (CustServCalls). Of particular interest is the importance of customer service calls, as this could indicate that customers who encounter customer service have negative experiences.

Summary

This article has shown how to implement a churn prediction model using Python and scikit-learn Machine Learning. We have calculated the permutation feature importance to analyze which features contribute to the performance of our model. You have learned that permutation feature importance can provide data scientists with new insights into the context of a prediction model. Therefore, the technique is often a good starting point for forthleading investigations.

I am always interested in improving my articles and learning from my audience. If you liked this article, show your appreciation by leaving a comment. And if you didn’t, let me know too. Cheers

Sources and Further Reading

The links above to Amazon are affiliate links. By buying through these links, you support the Relataly.com blog and help to cover the hosting costs. Using the links does not affect the price.

And if you are interested in text mining and customer satisfaction, consider taking a look at my recent blog about sentiment analysis:

Sentiment Analysis with Naive Bayes and Logistic Regression in Python

The post Customer Churn Prediction – Understanding Models with Feature Permutation Importance using Python appeared first on relataly.com.

Tuning Model Hyperparameters with Grid Search at the Example of Training a Random Forest Classifier in Python

Florian Follonier — Mon, 06 Jul 2020 21:16:52 +0000

Are you struggling to find the best hyperparameters for your machine learning model? With Python’s Scikit-learn library, you can use grid search to fine-tune your model and improve its performance. In this article, we’ll guide you through the process of hyperparameter tuning for a classification model, using a random decision forest that predicts the survival of Titanic passengers as an example.

We’ll start by explaining the concept of grid search and how it works. Then, we’ll dive into the development and optimization of the random decision forest using Python. By defining a parameter grid and feeding it to the grid search algorithm, we can explore all possible hyperparameter combinations and find the optimal configuration for our model.

Finally, we’ll compare the performance of different model configurations to determine the best one for our classification task. Whether you’re new to machine learning or looking to boost the performance of an existing model, this step-by-step guide to hyperparameter tuning with grid search will help you achieve better results. Let’s get started!

Also: Multivariate Anomaly Detection on Time-Series Data in Python

Exemplary parameter grid for the tuning of a random decision forest with four hyperparameters

What are Hyperparameters?

Hyperparameters play a crucial role in the performance of a machine learning model. They are adjustable parameters that influence the model training process and control how a machine learning algorithm learns and how it behaves.

Unlike the internal parameters (coefficients, etc.) that the algorithm automatically optimizes during model training, hyperparameters are model characteristics (e.g., the number of estimators for an ensemble model) that we must set in advance.

Which hyperparameters are available, depends on the algorithm. For example, a random decision forest model may have hyperparameters such as the number of trees and tree depth, while a neural network model may have hyperparameters such as the number of hidden layers and nodes in each layer. Finding the optimal configuration of hyperparameters can be a challenging task, as there is often no way to know in advance what the ideal values should be.

This requires experimentation with different hyperparameter settings, which can be time-consuming if done manually. Grid search is a useful tool for automating this process and efficiently finding the best hyperparameter configuration for a given model.

Hyperparameters are the little screws that we can adjust to tune a predictive model.

Efficient Hyperparameter Tuning with Exhaustive Grid Search

When we train a machine learning model, it is usually unclear which hyperparameters lead to good results. While there are estimates and rules of thumb, there is often no way to avoid trying out hyperparameters in experiments. However, machine learning models often have several hyperparameters that affect the model’s performance in a nonlinear way.

We can use grid search to automate searching for optimal model hyperparameters. The search grid algorithm exhaustively generates models from parameter permutations of a grid of parameter values. Let’s take a look at how this works.

Hyperparameter Tuning with Grid Search: How it Works

The idea behind the grid search technique is quite simple. We have a model with parameters, and the challenge is to test various configurations until we are satisfied with the result. Grid search is exhaustive in that it tests all permutations of a parameter grid. The number of model variants results from the parameter grid and the specified parameters.

The grid search algorithm requires us to provide the following information:

The hyperparameters that we want to configure (e.g., tree depth)
For each hyperparameter, a range of values (e.g., [50, 100, 150])
A performance metric so that the algorithm knows how to measure performance (e.g., accuracy for a classification model)

For example, imagine we have a range of [16, 32, and 64] for n_estimators and a range of [8, 16, and 32] for max_depth. Then, the search grid will test 9 different parameter configurations.

Early Stopping

Running parameter optimization against an entire grid can be time-consuming, but there are ways to shorten the process. Depending on how much time you want to invest in the search process, you can test all combinations exhaustively or shorten the process with an early stopping logic. A stopping logic defines that the search ends early when a specific criterion is met. Such a criterion could be, for example, that newly trained models underperform the average performance of previously trained models by a certain value. In this case, the search stops and returns the best models found up to that point. When you define a large search grid with many parameters, defining an early stopping logic is recommended.

Strengths and Weaknesses of Grid Search

The advantage of the grid search is that the algorithm automatically identifies the optimal parameter configuration from the parameter grid. However, the number of possible configurations increases exponentially with the number of values in the parameter grid. So, in practice, defining a sparse parameter grid or defining stopping criteria is essential.

Grid Search is only one of several techniques that can be used to tune the hyperparameters of a predictive model. Alternative techniques include Random Search. In contrast to Grid Search, Random Search is a none exhaustive hyperparameter-tuning technique, which randomly selects and tests specific configurations from a predefined search space. Further optimization techniques are Bayesian Search and Gradient Descent.

A parameter grid with two hyperparameters and respectively three hyperparameter values

Evaluation Metrics

The question of which metric to optimize against inevitably arises when we talk about optimization. Generally, all common metrics available for classification or regression come into question.

Metrics for regression (more detailed description)

Mean Absolute Error (MAE)
Root Mean Squared Absolute Error (RMSAE)
Relative Squared Error (RSE).

Metrics for classification (more detailed description)

Accuracy
Precision
F-1 Score
Recall

Tuning the Hyperparameters of a Random Decision Forest Classifier in Python using Grid Search

Now that we have familiarized ourselves with the basic concept of hyperparameter tuning, let’s move on to the Python hands-on part! In this part, we will work with the Titanic dataset. We will apply the grid search optimization technique to a classification model. We will develop our Machine Learning model based on the Titanic dataset.

The sinking of the Titanic was one of the most catastrophic ship disasters, leading to more than 1500 casualties (The exact number is unknown due to several passengers being unregistered). The Titanic dataset contains a list of passengers with passenger information such as age, gender, cabin, ticket cost, etc., and whether they survived the Titanic sinking. The information about the passengers shows certain patterns that allow conclusions about the likelihood of the passengers surviving the accident. These data can be used to train a predictive model.

In the following, we will use the survival flag as a label and passenger information as input for a classification model. The goal is to predict whether a passenger will survive the Titanic sinking or not. The algorithm will be a random decision forest algorithm that classifies the passengers into two groups, survivors and non-survivors. Once we have trained a baseline model, we will apply grid search to optimize the hyperparameters of this model and select the best model.

The code is available on the GitHub repository.

View on GitHub Relataly GitHub Repo

Source: The National Archives/Heritage-Images/Imagestate

Prerequisites

Before starting the coding part, make sure that you have set up your Python 3 environment and required packages. If you don’t have a Python environment, follow this tutorial to set up the Anaconda environment.

Also, make sure you install all required packages. In this tutorial, we will be working with the following standard packages:

In addition, we will be using the Python Machine Learning library Scikit-learn to implement the random forest and the grid search technique.

You can install packages using console commands:

pip install
conda install (if you are using the anaconda packet manager)

About the Titanic Dataset

In this article, we will be working with the popular titanic dataset for classification. The Titanic dataset is a well-known dataset that contains information about the passengers on the Titanic, a British passenger liner that sank in the North Atlantic Ocean in 1912 after colliding with an iceberg. The dataset includes variables such as the passenger’s name, age, fare, and class, as well as whether or not the passenger survived.

The titanic dataset contains the following information on passengers of the titanic:

Survival: Survival 0 = No, 1 = Yes (Prediction Label)
Pclass: Ticket class 1 = 1st, 2 = 2nd, 3 = 3rd
Sex: Sex
Age: Age in years
SibSp: # of siblings/spouses aboard the Titanic
Parch: # of parents/children aboard the Titanic
Ticket: Ticket number
Fare: Passenger fare
Cabin: Cabin number
Embarked: Port of Embarkation C = Cherbourg, Q = Queenstown, S = Southampton

The Survival column contains the prediction label, which states whether a passenger survived the sinking of the Titanic or not.

You can download the titanic dataset from the Kaggle website. Once you have completed the download, you can place the dataset in the file path of your choice. Using the Kaggle Python environment, you can directly save the dataset into your Kaggle project.

We can assume that the cabin location of the passengers had an impact on their chance of surviving the sinking.

Step #1 Load the Titanic Data

The following code will load the titanic data into our python project. If you have placed the data outside the path shown below, don’t forget to adjust the file path in the code.

import math 
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt 
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
import seaborn as sns
from pandas.plotting import register_matplotlib_converters

# set file path
filepath = "data/titanic-grid-search/"

# Load train and test datasets
titanic_train_df = pd.read_csv(filepath + 'titanic-train.csv')
titanic_test_df = pd.read_csv(filepath + 'titanic-test.csv')
titanic_train_df.head()

	PassengerId	Survived	Pclass	Name							Sex	Age	SibSp	Parch	Ticket				Fare	Cabin	Embarked
0	1			0			3		Braund, Mr. Owen Harris			male	22.0	1	0	A/5 21171			7.2500	NaN		S
1	2			1			1		Cumings, Mrs. John Bradley ...	female	38.0	1	0	PC 17599			71.2833	C85		C
2	3			1			3		Heikkinen, Miss. Laina			female	26.0	0	0	STON/O2. 3101282	7.9250	NaN		S
3	4			1			1		Futrelle, Mrs. Jacques ...		female	35.0	1	0	113803				53.1000	C123	S
4	5			0			3		Allen, Mr. William Henry		male	35.0	0	0	373450				8.0500	NaN		S

Step #2 Preprocessing and Exploring the Data

Before we can train a model, we preprocess the data:

Firstly, we clean the missing values in the data and replace them with the mean.
Second, we transform categorical features (Embarked and Sex) into numeric values. In addition, we will delete some columns to reduce model complexity.
Finally, we delete the prediction label from the training dataset and place it into a separate dataset named y_df.

# Define a function for preprocessing the train and test data 
def preprocess(df):
    
    # Delete some columns that we will not use
    new_df = df[df.columns[~df.columns.isin(['Cabin', 'PassengerId', 'Name', 'Ticket'])]].copy()
    
    # Replace missing values
    for i in new_df.select_dtypes(include=['int16', 'int32', 'int64', 'float16', 'float32', 'float64']).columns:
        new_df[i].fillna(new_df[i].mean(), inplace=True)
    new_df['Embarked'].fillna('C', inplace=True)
    
    # Decode categorical values as integer values
    new_df_b = new_df.copy()
    new_df_b['Sex'] = np.where(new_df_b['Sex']=='male', 0, 1) 
    
    cleanups = {"Sex":     {"m": 0, "f": 1},
                "Embarked": {"S": 1, "Q": 2, "C": 3}}
    new_df_b.replace(cleanups, inplace=True)
    x = new_df_b.drop(columns=['Survived'])
    y = new_df_b['Survived']  
    
    return x, y

# Create the training dataset train_df and the label dataset
x_df, y_df = preprocess(train_df)
x_df.head()

		Pclass	Sex	Age		SibSp	Parch	Fare	Embarked
0		3		0	22.0	1		0		7.2500	1
1		1		1	38.0	1		0		71.2833	3
2		3		1	26.0	0		0		7.9250	1
3		1		1	35.0	1		0		53.1000	1
4		3		0	35.0	0		0		8.0500	1

Let’s take a quick look at the data by creating paired plots for the columns of our data set. Pair plots help us to understand the relationships between pairs of variables in a dataset.

# # Create histograms for feature columns separated by prediction label value
df_plot = titanic_train_df.copy()

# class_columnname = 'Churn'
sns.pairplot(df_plot, hue="Survived", height=2.5, palette='muted')

The histograms tell us various things. For example, most passengers were between 25 and 35 years old. In addition, we can see that most passengers had low-fare tickets, while some passengers had significantly more expensive tickets.

Step #3 Splitting the Data

Next, we will split the data set into training data (x_train, y_train) and test data (x_test, y_test) using a split ratio of 70/30.

# Split the data into x_train and y_train data sets
x_train, x_test, y_train, y_test = train_test_split(x_df, y_df, train_size=0.7, random_state=0)

Step #4 Building a Single Random Forest Model

After completing the preprocessing, we can train the first model. The model uses a random forest algorithm. The random forest algorithm has a large number of hyperparameters.

4.1 About the Random Forest Algorithm

A random forest is a robust predictive algorithm that can handle classification and regression tasks. As a so-called ensemble model, the random forest considers predictions from a group of several independent estimators.

Random decision forests have several hyperparameters that we can use to influence their behavior. However, not all of these hyperparameters have the same influence on model performance. Limiting the number of models by defining a sparse parameter grid is essential to reduce the amount of time needed to test the hyperparameters.

Therefore, we restrict the hyperparameters optimized by the grid search approach to the following two:

n_estimators determine the number of decision trees in the forest
max_depth defines the maximum number of branches in each decision tree

In the scikit-learn documentation, you also find a full list of available hyperparameters. For the rest of these hyperparameters, we will use the default value defined by scikit-learn.

4.2 Implementing a Random Forest Model

We train a simple baseline model and make a test prediction with the x_test dataset. Then we visualize the performance of the baseline model in a confusion matrix:

# Train a single random forest classifier
clf = RandomForestClassifier(max_depth=2, random_state=0, n_estimators = 100)
clf.fit(x_train, y_train)
y_pred = clf.predict(x_test)

# Create a confusion matrix
cnf_matrix = confusion_matrix(y_test, y_pred)

# Create heatmap from the confusion matrix
%matplotlib inline
class_names=[False, True] # name  of classes
fig, ax = plt.subplots(figsize=(7, 6))
sns.heatmap(pd.DataFrame(cnf_matrix), annot=True, cmap="YlGnBu", fmt='g')
ax.xaxis.set_label_position("top")
plt.tight_layout()
plt.title('Confusion matrix')
plt.ylabel('Actual label')
plt.xlabel('Predicted label')
tick_marks = [0.5, 1.5]
plt.xticks(tick_marks, class_names)
plt.yticks(tick_marks, class_names)

Confusion matrix of the best-guess random forest model

Our best-guess model accurately predicted that 151 passengers would not survive. The dark-blue number in the top-left is the group of titanic passengers that did not survive the sinking, and our model classified them correctly as non-survivors. The green area below shows the passengers who survived the sinking and were correctly classified. The other sections show the number of times our model was wrong.

In total, these results correspond to a model accuracy of 80%. Considering that this was a best-guess model, these results are pretty good. However, we can further optimize these results by using the grid search approach for hyperparameter tuning.

Step #5 Hyperparameter Tuning a Classification Model using the Grid Search Technique

By comparing the performance of different model configurations, we can find the best set of hyperparameters that yields the highest accuracy. This approach is a powerful tool for fine-tuning machine learning models and improving their performance. So let’s get started and see if we can beat the results of our best-guess model using the grid search technique!

Training and Tuning the Model

Next, we will use the grid search technique to optimize a random decision forest model that predicts the survival of Titanic passengers. We’ll define a grid of hyperparameter values in Python and then use the Scikit-learn library to train and test the model with different hyperparameter configurations. First, we will define a parameter range:

max_depth = [2, 8, 16]
n_estimators = [64, 128, 256]

We leave the other parameters at their default value. In addition, we need to define against which metric we want the grid search algorithm to evaluate the model performance. Since we have no personal preference and our dataset is well-balanced, we choose the mean test score as the evaluation metric. Then we run the grid search algorithm.

# Define Parameters
max_depth=[2, 8, 16]
n_estimators = [64, 128, 256]
param_grid = dict(max_depth=max_depth, n_estimators=n_estimators)

# Build the grid search
dfrst = RandomForestClassifier(n_estimators=n_estimators, max_depth=max_depth)
grid = GridSearchCV(estimator=dfrst, param_grid=param_grid, cv = 5)
grid_results = grid.fit(x_train, y_train)

# Summarize the results in a readable format
print("Best: {0}, using {1}".format(grid_results.cv_results_['mean_test_score'], grid_results.best_params_))
results_df = pd.DataFrame(grid_results.cv_results_)
results_df

Best: [0.79611613 0.78005161 0.79290323 0.81387097 0.82187097 0.81867097 
 0.78818065 0.78816774 0.78498065], using {'max_depth': 8, 'n_estimators': 128}
 
 	mean_fit_time	std_fit_time	mean_score_time	std_score_time	param_max_depth	param_n_estimators	params	split0_test_score				split1_test_score	split2_test_score	split3_test_score	split4_test_score	mean_test_score	std_test_score	rank_test_score
0	0.057045		0.001108		0.005001		0.000001		2				64					{'max_depth': 2, 'n_estimators': 64}	0.824				0.800				0.784				0.774194			0.798387		0.796116	0.016883	4
1	0.112051		0.002088		0.009490		0.000775		2				128					{'max_depth': 2, 'n_estimators': 128}	0.760				0.824				0.784				0.750000			0.782258		0.780052	0.025523	9
2	0.221600		0.003740		0.016487		0.000448		2				256					{'max_depth': 2, 'n_estimators': 256}	0.792				0.824				0.784				0.774194			0.790323		0.792903	0.016756	5
3	0.061998		0.001410		0.005801		0.000400		8				64					{'max_depth': 8, 'n_estimators': 64}	0.784				0.824				0.792				0.806452			0.862903		0.813871	0.028044	3
4	0.122886		0.002652		0.009587		0.000480		8				128					{'max_depth': 8, 'n_estimators': 128}	0.784				0.848				0.808				0.806452			0.862903		0.821871	0.029089	1
5	0.250295		0.007654		0.018557		0.000836		8				256					{'max_depth': 8, 'n_estimators': 256}	0.800				0.824				0.800				0.806452			0.862903		0.818671	0.023797	2
6	0.065602		0.000505		0.005800		0.000399		16				64					{'max_depth': 16, 'n_estimators': 64}	0.736				0.808				0.784				0.766129			0.846774		0.788181	0.037557	6
7	0.127662		0.003297		0.008600		0.004080		16				128					{'max_depth': 16, 'n_estimators': 128}	0.752				0.800				0.784				0.758065			0.846774		0.788168	0.034078	7
8	0.259617		0.003121		0.018873		0.000537		16				256					{'max_depth': 16, 'n_estimators': 256}	0.752				0.784				0.776				0.766129			0.846774		0.784981	0.032690	8

The list above is an overview of the tested model configurations, ranked by their prediction scores. Model number five achieved the best results. The parameters of this model are a maximum depth of 8 and several estimators of 256.

Model Evaluation

We select the best model and use it to predict the test data set. We visualize the results in another confusion matrix.

# Extract the best decision forest 
best_clf = grid_results.best_estimator_
y_pred = best_clf.predict(x_test)

# Create a confusion matrix
cnf_matrix = confusion_matrix(y_test, y_pred)

# Create heatmap from the confusion matrix
%matplotlib inline
class_names=[False, True] # name  of classes
fig, ax = plt.subplots(figsize=(7, 6))
sns.heatmap(pd.DataFrame(cnf_matrix), annot=True, cmap="YlGnBu", fmt='g')
ax.xaxis.set_label_position("top")
plt.tight_layout()
plt.title('Confusion matrix')
plt.ylabel('Actual label')
plt.xlabel('Predicted label')
tick_marks = [0.5, 1.5]
plt.xticks(tick_marks, class_names)
plt.yticks(tick_marks, class_names)

Confusion matrix of the best grid search model

The confusion matrix shows the best model results from the grid search technique. The result is an overall model accuracy of 83,5 %, which shows that the best grid search model outperforms our initial best guess model. This optimal model has correctly classified that 148 passengers would not survive and 76 passengers would survive. In 44 cases, the model was wrong.

Summary

In the hands-on part of this article, we developed a random decision forest that predicts the survival of Titanic passengers using Python and scikit-learn. The grid search technique applies not only to classification models but can also be used to optimize the performance of regression models. First, we developed a baseline model with best-guess parameters. Subsequently, we defined a parameter grid and used the grid search technique to tune the hyperparameters of the random decision forest. In this way, we quickly identified a configuration that outperforms our initial baseline model. In this way, we have demonstrated how Gid Search can help optimize the classification model parameters.

I hope this article was helpful. I am always interested to learn and improve. So, if you have any questions or suggestions, please write them in the comments.

Sources and Further Reading

The links above to Amazon are affiliate links. By buying through these links, you support the Relataly.com blog and help to cover the hosting costs. Using the links does not affect the price.

The post Tuning Model Hyperparameters with Grid Search at the Example of Training a Random Forest Classifier in Python appeared first on relataly.com.

Classifying Purchase Intention of Online Shoppers with Python

Florian Follonier — Mon, 11 May 2020 21:42:35 +0000

Online shopping has become a part of our daily lives, and online stores are continually seeking to improve their sales. One way to achieve this is by using machine learning to predict customers’ purchase intentions. This innovative process can help businesses understand their customers’ behavior and tailor their marketing strategies accordingly.

In this article, we will explore the practical side of purchase intention prediction. Our focus is on developing a classification model that predicts whether a visitor will make a purchase or not. We’ll use Scikit-Learn’s machine learning library to train a Logistic Regression algorithm, and evaluate the model’s performance. Our ultimate goal is to provide insights into the circumstances under which customers make purchase decisions.

Predicting purchase intentions can offer significant benefits to online stores, such as identifying potential customers who are most likely to buy and targeting their marketing efforts accordingly. By understanding the practical application of machine learning for purchase intention prediction, online businesses can gain a competitive edge and increase their revenue.

Also: Sentiment Analysis with Naive Bayes and Logistic Regression in Python

About Modeling Customer Purchase Intentions

Customer purchase intention prediction is the process of using machine learning algorithms to predict the likelihood that a particular customer will make a purchase. This can be useful for various applications, such as identifying potential customers most likely interested in a particular product or service and targeting marketing and sales efforts accordingly.

To make accurate predictions about customer purchase intentions, it is important to have access to high-quality data about the customer, such as their demographic information, purchasing history, and other relevant factors. By analyzing this data and applying appropriate machine learning algorithms, it is possible to identify patterns and trends that can predict the likelihood that a particular customer will make a purchase.

There are many different approaches to customer purchase intention prediction, and the specific methods used can vary depending on the application and the data available. Some common techniques for predicting customer purchase intentions include using regression analysis to model the relationship between purchase intentions and other variables and using classification algorithms to classify customers as likely or unlikely to make a purchase. By using these techniques, it is possible to make more accurate and useful predictions about customer purchase intentions.

Also: Customer Churn Prediction – Understanding Models with Feature Permutation Importance

Customer purchase intentions sometimes follow patterns that can be used for predictive purposes. Image created with Midjourney.

How Modeling Purchase Intentions can Lead to a Better Customer Understanding

Predicting the purchase intentions of online shoppers can be a step for online stores to understand their customers better. Creating predictive models makes it possible to conclude the factors influencing customers’ buying behavior. At what time of day are our customers most inclined to buy? For which products do customers often abandon the purchase process? Such questions are fascinating for marketing departments. Once understood, they can enable marketers to optimize their customers’ buying experience and achieve a higher conversion rate. In this way, intention prediction can help online stores target customers with the right products at the right time and thus take a step toward marketing automation.

Classifying Purchase Intentions of Online Shoppers with Python

" data-image-caption="

Classifying Purchase Intentions of Online Shoppers with Python

" data-large-file="https://www.relataly.com/wp-content/uploads/2022/04/image-13.png" src="https://www.relataly.com/wp-content/uploads/2022/04/image-13-1024x478.png" alt="A classification model that predicts the buying intention of online shoppers" class="wp-image-6828" width="760" height="355" srcset="https://www.relataly.com/wp-content/uploads/2022/04/image-13.png 1024w, https://www.relataly.com/wp-content/uploads/2022/04/image-13.png 300w, https://www.relataly.com/wp-content/uploads/2022/04/image-13.png 768w, https://www.relataly.com/wp-content/uploads/2022/04/image-13.png 1536w, https://www.relataly.com/wp-content/uploads/2022/04/image-13.png 1846w" sizes="(max-width: 760px) 100vw, 760px" />

Implementing a Prediction Model for Purchase Intentions with Python

Logistic regression is a widely-used algorithm in machine learning that is particularly useful for solving two-class classification problems. One of the primary benefits of using logistic regression models is that they can help us understand the factors that influence the predictions made by the model. This interpretability is a key advantage of logistic regression, making it a popular choice in many real-world applications.

In the next steps of our analysis, we will develop a two-class classification model that utilizes the logistic regression algorithm to predict the purchase intentions of online shoppers. By analyzing a set of features that are likely to influence a shopper’s decision to purchase, such as product price, customer reviews, and shipping time, we can build a model that accurately predicts the likelihood of a shopper completing a purchase. The logistic regression algorithm will be particularly useful in this case, as it allows us to identify which features are the most significant predictors of purchase intention.

The code is available on the GitHub repository.

View on GitHub Relataly GitHub Repo

Prerequisites

In addition, we will be using the machine learning library Scikit-learn and Seaborn for visualization. You can install packages using console commands:

pip install
conda install (if you are using the anaconda packet manager)

About the Dataset

In this tutorial, we will be working with a public dataset from Kaggle.com. The data consists of 18 feature vectors belonging to 12,330 shopping sessions. You can download the data via the link below:

online_shoppers_intention.csv Download

The data stems from a big shopping website that has recorded the session for one year. Each record belongs to a separate shopping session and user. Thus, there is no bias in the data, such as a specific period, user, or day to avoid.

Below you will find an overview of the features contained in the data (Source: Kaggle.com):

“Administrative,” “Administrative Duration,” “Informational,” “Informational Duration,” “Product Related,” and “Product-Related Duration” represent the number of different types of pages visited by the visitor in that session and the total time spent in each of these page categories.
The “Bounce Rate,” “Exit Rate,” and “Page Value” features represent the metrics measured by “Google Analytics” for each page on the e-commerce site.
The “Special Day” feature indicates the closeness of the site visiting time to a specific special day (e.g., Mother’s Day, Valentine’s Day)
The dataset also includes an operating system, browser, region, traffic type, visitor type as returning or new visitor, a Boolean value indicating whether the date of the visit is a weekend, and the month of the year.

The ‘Revenue’ attribute is the class label, called the “prediction label.”

Step #1 Load the Data

We begin by loading the shopping dataset into a Pandas DataFrame. Afterward, we will print a brief overview of the data.

import calendar
import math 
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt 
from matplotlib import cm
import seaborn as sns

from sklearn.model_selection import train_test_split as train_test_split
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import confusion_matrix, roc_curve, auc, roc_auc_score
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Load train data
filepath = "data/classification-online-shopping/"
df_shopping_base = pd.read_csv(filepath + 'online_shoppers_intention.csv') 
df_shopping_base

	Administrative	Administrative_Duration	Informational	Informational_Duration	ProductRelated	ProductRelated_Duration	BounceRates	ExitRates	PageValues	SpecialDay	Month	OperatingSystems	Browser	Region	TrafficType	VisitorType			Weekend	Revenue
0	0.0				0.0						0.0				0.0						1.0				0.000000				0.20		0.20		0.0			0.0			Feb		1					1		1		1			Returning_Visitor	False	False
1	0.0				0.0						0.0				0.0						2.0				64.000000				0.00		0.10		0.0			0.0			Feb		2					2		1		2			Returning_Visitor	False	False
2	0.0				-1.0					0.0				-1.0					1.0				-1.000000				0.20		0.20		0.0			0.0			Feb		4					1		9		3			Returning_Visitor	False	False
3	0.0				0.0						0.0				0.0						2.0				2.666667				0.05		0.14		0.0			0.0			Feb		3					2		2		4			Returning_Visitor	False	False
4	0.0				0.0						0.0				0.0						10.0			627.500000				0.02		0.05		0.0			0.0			Feb		3					3		1		4			Returning_Visitor	True	False

Step #2 Cleaning the Data

Before we can start training our prediction model, we’ll do some cleanups (handling missing data, data type conversions, treating outliers, and so on).

# Replacing visitor_type to int
print(df_shopping_base['VisitorType'].unique())
df_shop = df_shopping_base.replace({'VisitorType' : { 'New_Visitor' : 0, 'Returning_Visitor' : 1, 'Other' : 2 }})

# Coverting month column to numeric numeric values
monthlist = df_shop['Month'].replace('June', 'Jun')
mlist = []
m = np.array(monthlist)
for mi in m:
    a = list(calendar.month_abbr).index(mi)
    mlist.append(a)
df_shop['Month'] =  mlist

# Delete records with NAs
df_shop.dropna(inplace=True)

df_shop.head()

['Returning_Visitor' 'New_Visitor' 'Other']
	Administrative	Administrative_Duration	Informational	Informational_Duration	ProductRelated	ProductRelated_Duration	BounceRates	ExitRates	PageValues	SpecialDay	Month	OperatingSystems	Browser	Region	TrafficType	VisitorType	Weekend	Revenue
  0	0.0				0.0						0.0				0.0						1.0				0.000000				0.20		0.20		0.0			0.0			2		1					1		1		1			1			False	False
1	0.0				0.0						0.0				0.0						2.0				64.000000				0.00		0.10		0.0			0.0			2		2					2		1		2			1			False	False
2	0.0				-1.0					0.0				-1.0					1.0				-1.000000				0.20		0.20		0.0			0.0			2		4					1		9		3			1			False	False
3	0.0				0.0						0.0				0.0						2.0				2.666667				0.05		0.14		0.0			0.0			2		3					2		2		4			1			False	False
4	0.0				0.0						0.0				0.0						10.0			627.50

Step #3 Exploring the Data

Next, we will familiarize ourselves with the data.

3.1 Class Labels

First, we take a look at the class labels to see how balanced they are. If class labels are balanced, it means that each class has an approximately equal number of examples in the training data. This is important because it helps ensure that the trained model will be able to make accurate predictions on new data. If the class labels are unbalanced, then the model is more likely to be biased towards the more common classes, which can lead to poor performance on less common classes. Additionally, unbalanced class labels can make it more difficult to evaluate the performance of a machine learning model, because the model’s accuracy may not be an accurate reflection of its ability to generalize to new data.

# Checking the balance of prediction labels
plt.figure(figsize=(16,2))
fig = sns.countplot(y="Revenue", data=df_shop, palette="muted")
plt.show()

Our class labels are somewhat imbalanced, as there are much more cases in the data with a prediction “false.” The reason is that more visitors won’t buy anything. Imbalanced data can affect the performance of classification models. But now that we are aware of the imbalance in our data, we can choose appropriate evaluation metrics later.

3.2 Feature Correlation

When developing classification models, not all features are usually equally useful. It is important that features are not correlated because correlated features can provide redundant information to a machine learning model. If two or more features are highly correlated, they may convey the same information to the model, which can make the model’s predictions less accurate. Additionally, having correlated features can make it more difficult to interpret the model’s predictions, because it is not clear which features are actually contributing to the model’s decision-making process.

Let’s check which of our features are correlated. First, we will create a series of Whiskerplots for the features in our dataset. They help us identify potential outliers and get a better idea of how the data looks.

# Whiskerplots
c= 'black'
df_shop.drop('Revenue', axis=1).plot(kind='box', 
                                subplots=True, layout=(4,4), 
                                sharex=False, sharey=False, 
                                figsize=(14,14), 
                                title='Whister plot for input variables')
plt.show()

Feature Whiskerplots

The Whiskerplots show that there are a couple of outliers in the data. However, the outliers are not significant enough to worry about them.

Histograms are another way of visualizing the distribution of numerical or categorical variables. They give a rough sense of the density of the distribution. To create the histograms, run the code below.

# # Create pariplots for feature columns separated by prediction label value
df_plot = df_shop.copy()

# class_columnname = 'Revenue'
sns.pairplot(df_plot, hue="Revenue", height=2.5)

Shopper-Buying-Intention pair plots with seaborn

" data-image-caption="

Shopper-Buying-Intention pair plots with seaborn

" data-large-file="https://www.relataly.com/wp-content/uploads/2022/04/Shopper-Buying-Intention.png" src="https://www.relataly.com/wp-content/uploads/2022/04/Shopper-Buying-Intention-1024x994.png" alt="Purchase Intention Prediction, Feature Permutation Importance, Feature Correlation plot" class="wp-image-6829" width="1117" height="1085" srcset="https://www.relataly.com/wp-content/uploads/2022/04/Shopper-Buying-Intention.png 1024w, https://www.relataly.com/wp-content/uploads/2022/04/Shopper-Buying-Intention.png 300w, https://www.relataly.com/wp-content/uploads/2022/04/Shopper-Buying-Intention.png 768w, https://www.relataly.com/wp-content/uploads/2022/04/Shopper-Buying-Intention.png 1536w, https://www.relataly.com/wp-content/uploads/2022/04/Shopper-Buying-Intention.png 2048w, https://www.relataly.com/wp-content/uploads/2022/04/Shopper-Buying-Intention.png 2475w" sizes="(max-width: 1117px) 100vw, 1117px" />

Finally, we create a correlation matrix and visualize it as a heat map. The matrix provides a quick overview of which features are correlated and not.

# Feature correlation
plt.figure(figsize=(15,4))
f_cor = df_shop.corr()
sns.heatmap(f_cor, cmap="Blues_r")

The correlation plot shows that some features are highly correlated. The following features are highly correlated:

ProductRelated and ProductRelated_Duration.
BounceRates and ExitRates

plt.figure(figsize=(8,5))
sns.scatterplot(x= 'BounceRates',y='ExitRates',data=df_shop,hue='Revenue')
plt.title('Bounce Rate vs. Exit Rate', fontweight='bold', fontsize=15)
plt.show()

plt.figure(figsize=(8,5))
sns.scatterplot(x= 'ProductRelated',y='ProductRelated_Duration',data=df_shop,hue='Revenue')
plt.title('Bounce Rate vs. Exit Rate', fontweight='bold', fontsize=15)
plt.show()

When we start to train our model, we will only use one of the features from the two pairs.

Step #4 Data Preprocessing

Now that we are familiar with the data, we can prepare the data to train the purchase intention classification model. Firstly, we will include only selecting the features from the original shopping dataset. Second, we will split the data into two separate datasets: train and test with a ratio of 70%. Train X_train and X_test datasets contain the features, while y_train and y_test include the respective prediction labels. Thirdly, we will use the MinMaxScaler to scale the numeric features between 0 and 1. Scaling makes it easier for the algorithm to interpret the data and improve classification performance.

# Separate labels from training data
features = ['Administrative', 'Administrative_Duration', 'Informational', 
            'Informational_Duration', 'ProductRelated', 'BounceRates', 'PageValues', 
            'Month', 'Region', 'TrafficType', 'VisitorType']
X = df_shop[features] #Training data
y = df_shop['Revenue'] #Prediction label

# Split the data into x_train and y_train data sets
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, random_state=0)

# Scale the numeric values
scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

Step #5 Train a Purchase Intention Classifier

Next, it is time to train our prediction model. Various classification algorithms could be used to solve this problem, for example, decision trees, random forests, neural networks, or support-vector machines. We will use the logistic regression algorithm, a common choice for simple two-class prediction problems.

We start the training process using the “fit” method of the logistic regression algorithm.

# Training a classification model using logistic regression 
logreg = LogisticRegression(solver='lbfgs')
score = logreg.fit(X_train, y_train).decision_function(X_test)

The trained model returns a training score showing how well the model has performed on the test dataset.

Step #6 Evaluate Model Performance

Finally, we will evaluate the performance of our classification model. For this purpose, we first create a confusion matrix. Then we calculate and compare different error metrics.

6.1 Confusion Matrix

The confusion matrix is a holistic and clean way to illustrate the results of a classification model. It differentiates between predicted labels and actual labels. For a binary classification model, the matrix comprises 2×2 quadrants that show the number of cases in each quadrant.

# create a confusion matrix
y_pred = logreg.predict(X_test)
cnf_matrix = confusion_matrix(y_test, y_pred)

# create heatmap
%matplotlib inline
class_names=[False, True] # name  of classes
fig, ax = plt.subplots(figsize=(7, 6))
tick_marks = np.arange(len(class_names))
plt.xticks(tick_marks, class_names)
plt.yticks(tick_marks, class_names)
sns.heatmap(pd.DataFrame(cnf_matrix), annot=True, cmap="YlGnBu", fmt='g')
ax.xaxis.set_label_position("top")
plt.tight_layout()
plt.title('Confusion matrix')
plt.ylabel('Actual label')
plt.xlabel('Predicted label')

In the upper left (0,0), we see that the model correctly predicted for 3102 online shopping sessions that these sessions will not lead to a purchase (True negatives). In 30 cases, the model was wrong and expected that there would be a purchase, but there wasn’t (False positives). For 412 buyers, the model predicted that they would not buy anything, even though they were buying something (False negatives). In the lower right corner, we see that only in 151 cases could buyers be correctly identified as such (True positives).

6.2 Performance Metrics for Classification Models

Next, let’s take a brief look at the performance metrics. Four standard metrics that measure the performance of classification models are Accuracy, Precision, Recall, and f1_score.

print('Accuracy: {:.2f}'.format(accuracy_score(y_test, y_pred)))
print('Precision: {:.2f}'.format(precision_score(y_test, y_pred)))
print('Recall: {:.2f}'.format(recall_score(y_test, y_pred)))
print('f1_score: {:.2f}'.format(f1_score(y_test, y_pred)))

Accuracy

The accuracy of the test set shows that 88% of the online shopper sessions were correctly classified. However, our data is imbalanced. That is to say, most labels have the value “False,” and only a few target labels are “True.” Consequently, we must ensure that our model does not classify all online shoppers as “non-buyers” (label: False) but also correctly predicts the buyers (label: True).

Precision

We calculate the precision as the number of True Positives divided by the number of True Positives and False Positives. Similar to Accuracy, Precision puts too much emphasis on the True negatives. Therefore, it does not say much about our model. The precision score for our model is just a little lower than the accuracy (83%).

Recall

We calculate the Recall by dividing the number of True Positives by the sum of the True Positives and the False Negatives. The Recall of our model is 27%, which is significantly below accuracy and precision. In our case, the precision call is more meaningful than precision and Recall because it puts a higher penalty on the low number of True positives.

F1-Score

The formula for the F1-Score is 2*((precision*recall)/(precision+recall)). Because the formula includes the Recall, the F-1 Score of our model is only 41%. Imagine we want to optimize our classification model further. In this case, we should look out for both F1-Score and Recall.

6.3 Interpretation

Metrics for classification models can be misleading. We should thus choose them carefully. Depending on which use case we are dealing with, False-negative and False-positive predictions can have different costs. Therefore, model evaluation is not always about exactness (precision and accuracy). Instead, the choice of performance metrics depends on what we want to achieve.

The challenge for our model is to correctly classify the smaller group of buyers (True positives). So, optimizing our model would be about achieving a balance between good accuracy without significantly lowering the F1_Score and Recall.

Step #7 Insights on Customer Purchase Intentions

Finally, we will use permutation feature importance to gain additional insights into our prediction model’s features. Permutation Feature Importance is a technique that measures the influence of features on the predictions of our model. Features with a high positive or negative score substantially impact predicting the prediction label. In contrast, features with scores close to zero play a lesser role in the predictions.

# Load the data
r = permutation_importance(model_lgr, X_test, y_test, n_repeats=30, random_state=0)

# Plot the barchart
data_im = pd.DataFrame(r.importances_mean, columns=['feature_permuation_score'])
data_im['feature_names'] = X.columns
data_im = data_im.sort_values('feature_permuation_score', ascending=False)

fig, ax = plt.subplots(figsize=(16, 5))
sns.barplot(y=data_im['feature_names'], x="feature_permuation_score", data=data_im, palette='nipy_spectral')
ax.set_title("Logistic Regression Feature Importances")

We can see that the three features with the highest impact are PageValues, BounceRates and Administration_Duration.

The higher the page’s value, the higher the customer’s chance to make a purchase.
The higher the average bounce rate that the customer visits, the higher the chance the customer makes a purchase.
In contrast, the more time a customer spends on administrative settings, the lower the chance the customer completes the purchase.

These were just a few sample findings. There is much more to explore in the data, and deeper analysis can uncover much more about the customers’ buying decisions.

Summary

This article has presented customer purchase prediction as an interesting use case for machine learning in e-commerce. After discussing the use case, we have developed a classification model that predicts the purchase intentions of online shoppers. You have learned to preprocess the data, train a logistic regression model and evaluate the model’s performance. Classifying purchase intentions can help online shops understand their customers better and automate certain online marketing activities. The previous section showed how marketers could use this to gain further insights into their customers’ behavior.

Thanks for reading and if you have any questions, let me know in the comments.

Sources and Further Reading

I hope this article was helpful. If you have any remarks or questions, please write them in the comments.

The links above to Amazon are affiliate links. By buying through these links, you support the Relataly.com blog and help to cover the hosting costs. Using the links does not affect the price.

The post Classifying Purchase Intention of Online Shoppers with Python appeared first on relataly.com.