Classifying online shoppers purchase intentions with python

Will they Buy or just Browse? Predicting Purchase Intentions of Online Shoppers with Python

Most online stores welcome countless visitors every day, but only a fraction of those visitors will make a purchase. Machine learning can predict whether online shoppers will buy something or browse around. In this article, we will develop a classification model that predicts purchase intentions. We assume a two-class prediction problem, where the goal is to predict the labels “buys” and “doesn’t buy” for a group of visitors. Our model uses a Logistic Regression algorithm from the Scikit-Learn machine learning library.

The remainder of this article proceeds as follows: We begin by briefly discussing why it is rewarding for online shops to use models that predict purchase intentions. Then we will turn to the practical part and train a two-class logistic regression model that predicts purchase intentions. We will evaluate the prediction performance of our model and conclude the circumstances under which customers make purchase decisions.

Modelling Purchase Intentions can Lead to A better Customer Understanding

Predicting the purchase intentions of online shoppers can be a step for online stores to understand their customers better. By creating predictive models makes it possible to conclude the factors influencing customers’ buying behavior. At what time of day are our customers most inclined to buy? For which products do customers often abandon the purchase process? Such questions are fascinating for marketing departments. Once understood, they can enable marketers to optimize their customers’ buying experience and achieve a higher conversion rate. In this way, intention prediction can help online stores target customers with the right products at the right time and thus take a step toward marketing automation.

Predicting purchase intentions - a two-class classification problem
A classification model that predicts the buying intention of online shoppers

Implementing a Prediction Model for Purchase Intentions with Python

In the following, we will develop a two-class classification model that uses the logistic regression algorithm to predict the purchase intentions of online shoppers. Logistic regression is a simple algorithm that is commonly used to solve two-class classification problems. One advantage of logistic regression models is that they allow us to understand the factors influencing the predictions.

As usual, you find the entire code sample in the relataly repository on GitHub.

About the Dataset

In this tutorial, we will be working with a public dataset from Kaggle.com. The data consists of 18 feature vectors belonging to 12,330 shopping sessions. You can download the data via the link below:

The sessions were recorded during a one-year period. Each record belongs to a separate shopping session and user. Thus, there is no bias in the data, such as a specific period, user, or day to avoid.

Below you find an overview of the features contained in the data (Source: Kaggle.com):

  • “Administrative”, “Administrative Duration”, “Informational”, “Informational Duration”, “Product Related” and “Product-Related Duration” represent the number of different types of pages visited by the visitor in that session and total time spent in each of these page categories. 
  • The “Bounce Rate”, “Exit Rate” and “Page Value” features represent the metrics measured by “Google Analytics” for each page in the e-commerce site.
  • The “Special Day” feature indicates the closeness of the site visiting time to a specific special day (e.g. Mother’s Day, Valentine’s Day)
  • The dataset also includes an operating system, browser, region, traffic type, visitor type as returning or new visitor, a Boolean value indicating whether the date of the visit is weekend, and month of the year.

The ‘Revenue’ attribute is the class label, also called the “prediction label.”

Step #1 Load the Data

We begin by loading the shopping dataset into a Pandas DataFrame. Afterward, we will print a brief overview of the data.

import calendar
import math 
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt 
from matplotlib import cm
import seaborn as sns

from sklearn.model_selection import train_test_split as train_test_split
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import confusion_matrix, roc_curve, auc, roc_auc_score
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Load train data
filepath = "data/classification-online-shopping/"
df_shopping_base = pd.read_csv(filepath + 'online_shoppers_intention.csv') 
df_shopping_base

Step #2 Cleaning the Data

Before we can start training our prediction model, we’ll do some cleanups (handle missing data, data type conversions, treating outliers, and so on).

# Replacing visitor_type to int
print(df_shopping_base['VisitorType'].unique())
df_shop = df_shopping_base.replace({'VisitorType' : { 'New_Visitor' : 0, 'Returning_Visitor' : 1, 'Other' : 2 }})

# Coverting month column to numeric numeric values
monthlist = df_shop['Month'].replace('June', 'Jun')
mlist = []
m = np.array(monthlist)
for mi in m:
    a = list(calendar.month_abbr).index(mi)
    mlist.append(a)
df_shop['Month'] =  mlist

# Delete records with NAs
df_shop.dropna(inplace=True)

df_shop.head()

Step #3 Exploring the Data

3.1 Class Labels

Next, we will familiarize ourselves with the data.

# Checking the balance of prediction labels
plt.figure(figsize=(16,2))
fig = sns.countplot(y="Revenue", data=df_shop, color='b')
plt.show()

Our class labels are somewhat imbalanced, as there are much more cases in the data with a prediction “false.” The reason is that more visitors won’t buy anything. Imbalanced data can affect the performance of classification models. But now that we are aware of the imbalance in our data, we can later choose appropriate evaluation metrics.

3.2 Feature Correlation

When developing classification models, not all features are usually equally useful. In particular, when features correlate with each other, it is helpful to limit the number of features. The distribution of values can indicate which features are correlated.

First, we will create a series of Whiskerplots for the features in our dataset. They help us to identify potential outliers and get a better idea of how the data looks.

# Whiskerplots
c= 'black'
df_shop.drop('Revenue', axis=1).plot(kind='box', 
                                subplots=True, layout=(4,4), 
                                sharex=False, sharey=False, 
                                figsize=(14,14), 
                                title='Whister plot for input variables')
plt.show()
Feature Whiskerplots

The Whiskerplots show that there are a couple of outliers in the data. However, the outliers are not significant enough to worry about them.

Histograms are another way of visualizing the distribution of numerical or categorical variables. They give a rough sense of the density of the distribution. To create the histograms, run the code below.

# Create histograms for feature columns separated by prediction label value
class_columnname = 'Revenue'

df_plot = df_shop.copy().drop(['Weekend'], axis=1)
register_matplotlib_converters()

list_length = df_plot.shape[1]
ncols = 4
nrows = int(round(list_length / ncols, 0))
if ncols * nrows < list_length:
    nrows += 1

fig, ax = plt.subplots(nrows=nrows, ncols=ncols, sharex=False, figsize=(15, 12))
fig.subplots_adjust(hspace=0.5, wspace=0.5)
for i in range(0, list_length):
        featurename = df_plot.columns[i]
        ax = plt.subplot(nrows, ncols, i+1)
        y0 = df_plot[df_plot[class_columnname]==True][featurename]
        ax.hist(y0, color='red', alpha=0.8, label= featurename + f'-{class_columnname}', bins='auto', edgecolor='w')
        y1 = df_plot[df_plot[class_columnname]==False][featurename]
        ax.hist(y1, color='blue', label=featurename + f'-No{class_columnname}', bins='auto', edgecolor='w')
        ax.set_title(featurename)
        ax.tick_params(axis="x", rotation=30, labelsize=10, length=0)
        plt.grid()   
fig.tight_layout()
plt.show()

Finally, we create a correlation matrix and visualize it as a heat map. The matrix provides a quick overview of which features are correlated and which are not.

# Feature correlation
plt.figure(figsize=(15,4))
f_cor = df_shop.corr()
sns.heatmap(f_cor, cmap="Blues_r")

The correlation plot shows that some features are highly correlated. The following features are highly correlated:

  • ProductRelated and ProductRelated_Duration.
  • BounceRates and ExitRates
plt.figure(figsize=(8,5))
sns.scatterplot(x= 'BounceRates',y='ExitRates',data=df_shop,hue='Revenue')
plt.title('Bounce Rate vs. Exit Rate', fontweight='bold', fontsize=15)
plt.show()
plt.figure(figsize=(8,5))
sns.scatterplot(x= 'ProductRelated',y='ProductRelated_Duration',data=df_shop,hue='Revenue')
plt.title('Bounce Rate vs. Exit Rate', fontweight='bold', fontsize=15)
plt.show()

When we start to train our model, we will only use one of the features from the two pairs.

Step #4 Data Preprocessing

Now that we are familiar with the data, we can begin to prepare the data for training the purchase intention classification model. Firstly, we will include only a selection of the features from the original shopping dataset. Second, we will split the data into two separate datasets: train and test with a ratio of 70%. Train X_train and X_test datasets contain the features, while y_train and y_test contain the respective prediction labels. Thirdly, we will use the MinMaxScaler to scale the numeric features to a range between 0 and 1. This makes it easier for the algorithm to interpret the data and can improve classification performance.

# Separate labels from training data
features = ['Administrative', 'Administrative_Duration', 'Informational', 
            'Informational_Duration', 'ProductRelated', 'BounceRates', 'PageValues', 
            'Month', 'Region', 'TrafficType', 'VisitorType']
X = df_shop[features] #Training data
y = df_shop['Revenue'] #Prediction label

# Split the data into x_train and y_train data sets
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, random_state=0)

# Scale the numeric values
scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

Step #5 Train a Purchase Intention Classifier

Next, it is time to train our prediction model. Various classification algorithms could be used to solve this problem, for example, decision trees, random forests, neural networks, or support-vector machines. We will be using the logistic regression algorithm, which is a common choice for simple two-class prediction problems.

We start the training process using the “fit” method of the logistic regression algorithm.

# Training a classification model using logistic regression 
logreg = LogisticRegression(solver='lbfgs')
score = logreg.fit(X_train, y_train).decision_function(X_test)

The trained model returns a training score that tells us how well the model has performed on the test dataset.

Step #6 Evaluate Model Performance

Finally, we will evaluate the performance of our classification model. For this purpose, we first create a confusion matrix. Then we calculate and compare different error metrics.

6.1 Confusion Matrix

The confusion matrix is a holistic and clean way to illustrate the results of a classification model. It differentiates between predicted labels and actual labels. For a binary classification model, the matrix comprises 2×2 quadrants that show the number of cases in each quadrant.

# create a confusion matrix
y_pred = logreg.predict(X_test)
cnf_matrix = confusion_matrix(y_test, y_pred)

# create heatmap
%matplotlib inline
class_names=[False, True] # name  of classes
fig, ax = plt.subplots(figsize=(7, 6))
tick_marks = np.arange(len(class_names))
plt.xticks(tick_marks, class_names)
plt.yticks(tick_marks, class_names)
sns.heatmap(pd.DataFrame(cnf_matrix), annot=True, cmap="YlGnBu", fmt='g')
ax.xaxis.set_label_position("top")
plt.tight_layout()
plt.title('Confusion matrix')
plt.ylabel('Actual label')
plt.xlabel('Predicted label')
confusion matrix on the results of our classification model that predicts purchase intentions
Confusion matrix

In the upper left (0,0), we see that 3102 online shopping sessions were correctly predicted to not lead to a purchase (True negatives). In 30 cases, the model was wrong and predicted that there would be a purchase, but there wasn’t (False positives). For 412 buyers, the model predicted that they would not buy anything, even though they were buying something (False negatives). In the lower right corner, we see that only in 151 cases could buyers be correctly identified as such (True positives).

6.2 Performance Metrics for Classification Models

Next, let’s take a brief look at the performance metrics. Four standard metrics that measure the performance of classification models are Accuracy, Precision, Recall, and f1_score.

print('Accuracy: {:.2f}'.format(accuracy_score(y_test, y_pred)))
print('Precision: {:.2f}'.format(precision_score(y_test, y_pred)))
print('Recall: {:.2f}'.format(recall_score(y_test, y_pred)))
print('f1_score: {:.2f}'.format(f1_score(y_test, y_pred)))
Accuracy

The accuracy on the test set shows that 88% of the online shopper sessions were correctly classified. However, our data is imbalanced. That is to say, most labels have the value “False,” and only a few labels are “True.” Consequently, we must ensure that our model does not classify all online shoppers as “non-buyers” (label: False) but also correctly predicts the buyers (label: true).

Precision

Precision is calculated as the number of True Positives divided by the number of True Positives and False Positives. The precision score for our model is just a little lower than the accuracy (83%). Similar to Accuracy, Precision puts too much emphasis on the True negatives. Therefore, it does not say much about our model.

Recall

We calculate the Recall by dividing the number of True Positives through the sum of the True Positives and the False Negatives. The Recall of our model is 27%, which is significantly below accuracy and precision. In our case, the precision call is more meaningful than precision and recall because it puts a higher penalty on the low number of True positives.

F1-Score

The formula for the F1-Score is 2*((precision*recall)/(precision+recall)). Because the recall is considered in the formula, the recall of our model is only 41%. Imagine we want to optimize our classification model further. In this case, we should look out for both F1-Score and Recall.

6.3 Interpretation

Metrics for classification models can be misleading. We should thus choose them carefully. Depending on which use case we are dealing with, False-negative and False-positive predictions can have different costs. Therefore, model evaluation is not always about exactness (precision and accuracy). Instead, the choice of performance metrics depends on what we want to achieve.

In classifying online shoppers, the challenge for our model is to correctly classify the smaller group of buyers (True positives). So, optimizing our model would be about achieving a balance between good accuracy without lowering the F1_Score and recall too much.

Step #7 Insights from on Customer Purchase Intentions

Finally, we will use permutation feature importance to gain additional insights into our prediction model’s features. Permutation Feature Importance is a technique that measures the influence of features on the predictions of our model. Features with a high positive or negative score have a substantial impact on predicting the prediction label. In contrast, features with scores close to zero play a lesser role in the predictions.

# Load the data
r = permutation_importance(model_lgr, X_test, y_test, n_repeats=30, random_state=0)

# Plot the barchart
data_im = pd.DataFrame(r.importances_mean, columns=['feature_permuation_score'])
data_im['feature_names'] = X.columns
data_im = data_im.sort_values('feature_permuation_score', ascending=False)

fig, ax = plt.subplots(figsize=(16, 5))
sns.barplot(y=data_im['feature_names'], x="feature_permuation_score", data=data_im, palette='nipy_spectral')
ax.set_title("Logistic Regression Feature Importances")
online purchase intention prediction - results of the feature permutation importance technique

We can see that the three features with the highest impact are PageValues, BounceRates and Administration_Duration.

  • The higher the value of the page, the higher the chance that the customer makes a purchase.
  • The higher the average bounce rate that the customer visits, the higher is the chance the customer makes a purchase
  • In contrast, the more time a customer spends on administrative settings, the lower is the chance that the customer completes the purchase.

These were just a few sample findings. There is much more to explore in the data, and deeper analysis can uncover much more about the customers’ buying decisions.

Summary

In this article, we have developed a classification model that predicts the purchase intentions of online shoppers. You have learned to preprocess the data, train a logistic regression model and evaluate the model’s performance. Classifying purchase intentions can help online shops understand their customers better and automate certain online marketing activities. The previous section showed how marketers could use this to gain further insights into their customers’ behavior.

I hope this article was helpful. If you have any remarks or questions, feel free to write them in the comments.

Author

  • Hi, I am Florian, a Zurich-based consultant for AI and Data. Since the completion of my Ph.D. in 2017, I have been working on the design and implementation of ML use cases in the Swiss financial sector. I started this blog in 2020 with the goal in mind to share my experiences and create a place where you can find key concepts of machine learning and materials that will allow you to kick-start your own Python projects.

Leave a Reply