Will they Buy or just Browse? Predicting Purchase Intentions of Online Shoppers with Python

Many online shops welcome countless visitors each day. However, only a fraction of these visitors will actually make a purchase. The rest just browses through the websites and does not make a transaction. This blog post is dedicated to the use of machine learning to predict whether or not visitors of an online shop will actually make a purchase. Predicting purchase intentions is a relevant use case in the area of online marketing and sales. Knowing which of the many customers who visit a shop actually buy something is very relevant for online shops. This is because the differences and similarities between groups can reveal patterns in customer buying behavior and thus contribute to a better understanding of the buying decision. Online shops can use this knowledge to optimize marketing activities and the customer experience in order to achieve higher conversion rates.

Predicting Purchase Intentions
A classification mode that predicts the buying intention of online shoppers

Predicting purchase intentions

This blog post guides you through the process of building a classification model using Python and Scikit-Learn that can predict the purchase intentions of online shoppers. We will assume a two-class prediction problem where the point is to predict the labels “buys” and “does not buy” for a group of visitors. Such two-class problems can be approached with different machine learning algorithms. In this blog post, we will use Logistic Regression. Because it is a best practice, the categories “buys” and “does not buy” will be converted into boolean values. Consequently, each customer is labelled with either True or False in the training data.

Overview

The example covers the following steps:

  • 1) Download the data
  • 2) Load the data into Python
  • 3) Explore the data
  • 4) Train a logistic regression model that classifies online shoppers buying intentions
  • 5) Evaluating model performance

Python Environment

This tutorial assumes that you have setup your python environment. I recommend using the Anaconda environment. If you have not yet set the environment up, you can follow this tutorial.

It is also assumed that you have the following packages installed: keras (2.0 or higher) with Tensorflow backend, numpy, pandas, matplot, sklearn. The packages can be installed using the console command:

pip install <packagename>

1) Download the data

The first step is to download the data. In this tutorial we will work with a public dataset from kaggle. You can download the data via the link below:

Copy the csv file into the following path, starting from the folder with your python notebook: data/classification-online-shopping/

2) Load the data into Python

The dataset consists of feature vectors belonging to 12333 sessions from a public online shop. Each session of the dataset belongs to a different user in a 1-year period. In this way, it is ensured that there is no tendency in the data, for example to a specific period, user or day to avoid. The dataset consists of 10 numerical and 8 categorical attributes. The ‘Revenue’ attribute will be used as the class label, also called “prediction label”.

In addition, the dataset contains the following features (information taken from the dataset site on kaggle.com):

  • “Administrative”, “Administrative Duration”, “Informational”, “Informational Duration”, “Product Related” and “Product Related Duration” represent the number of different types of pages visited by the visitor in that session and total time spent in each of these page categories. 
  • The “Bounce Rate”, “Exit Rate” and “Page Value” features represent the metrics measured by “Google Analytics” for each page in the e-commerce site.
  • The “Special Day” feature indicates the closeness of the site visiting time to a specific special day (e.g. Mother’s Day, Valentine’s Day)
  • The dataset also includes operating system, browser, region, traffic type, visitor type as returning or new visitor, a Boolean value indicating whether the date of the visit is weekend, and month of the year.

Run the following code to load the csv file into a dataframe named “dfshopping”.

# Setting up packages for data manipulation and machine learning
import calendar
import math 
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt 
from matplotlib import cm, pyplot
import sklearn as sk
from sklearn import tree
from sklearn.model_selection import train_test_split as train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import confusion_matrix, roc_curve, auc, roc_auc_score
import seaborn as sns

# Load train data
# label: Revenue
filepath = "data/classification-online-shopping/online_shoppers_intention.csv"
dfshopping = pd.read_csv(filepath) 

3) Exploring the data

Next, we’ll explore the dataset and familiarize ourselves with the features. Usually not all features help in the development of classification models. It is often helpful to make a preselection. The distribution of the values in the individual features can give an indication of which features correlate with each other. Furthermore, outliers can influence the training process. Getting familiar with the features is therefore an important step to optimize machine learning models.

# Exploring the data
print(dfshopping.shape)
dfshopping.head(5)
# Checking the balance of labels
df['Revenue'].value_counts()

As we can see, there are much more cases in the data with a prediction “false”. This is plausible because as mentioned, there are more visitors who won’t actually buy something. Such imbalanced data can lead to a possible misinterpretation of the performance of the classification model that we are about to build. But now that we are aware of the fact that we are working with imbalanced data, we can later choose appropriate evaluation metrics.

We proceed by making some value and type conversions.

# Replacing visitor_type to int
print(dfshopping['VisitorType'].unique())
dfshopping = dfshopping.replace({'VisitorType' : { 'New_Visitor' : 0, 'Returning_Visitor' : 1, 'Other' : 2 }})

# Coverting month column to numeric numeric values
df = dfshopping.copy()
monthlist = dfshopping['Month'].replace('June', 'Jun')
mlist = []
m = np.array(monthlist)
for mi in m:
    a = list(calendar.month_abbr).index(mi)
    mlist.append(a)
df['Month'] =  mlist
df

In an additional step, we will remove all records with ‘na’ values.

print(df.isnull().sum())
df.isna().sum()
# Delete records with NAs
df.dropna(inplace=True)

Running the following code will create whisker plots for all features in the dataset.

# Whiskerplots
c = 'black'
df.drop('Revenue', axis=1).plot(kind='box', subplots=True, layout=(4,4), sharex=False, sharey=False, figsize=(14,14), 
                                        title='Whister plot for input variables')
plt.savefig('shopping_box')
plt.show()
Feature boxplots

Next, we take a look at the histograms. Histograms are a way to visualize the distribution of numerical or categorical variables of a dataset. They are useful in the process to familiarize oneself with the data as they give a rough sense of the density of a distribution. To create the histograms, run the code below.

# Histograms
df.drop(['Revenue', 'Weekend'], axis=1).hist(bins=30, figsize=(14, 14), color='blue')
plt.suptitle("Histogram for each numeric input variable", fontsize=10)
plt.savefig('shopping_hist')
plt.show()
Feature histograms

4) Training a logistic regression model that classifies online shoppers buying intentions

Now that we are familiar with the dataset, we can begin to prepare the data and train a classification model. First, we need to split the data into different train and test data sets. The train dataset (x_train) will use 70% of the data. The test dataset (y_test) will use the remaining 30% of the data. The datasets y_train and y_test contain the respective prediction labels.

After splitting the data, we will use the MinMaxScaler to scale the numeric features to a range between 0 and 1. This is a best practice in machine learning.

# Separate labels from training data
features = ['Administrative', 'Administrative_Duration', 'Informational', 
            'Informational_Duration', 'ProductRelated', 'ProductRelated', 
            'ProductRelated_Duration', 'BounceRates', 'ExitRates', 'PageValues', 
            'Month', 'Region', 'TrafficType', 'VisitorType']
X = df[features] #Training data
y = df['Revenue'] #Prediction label

# Split the data into x_train and y_train data sets
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, random_state=0)

scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

Now, that we have prepared the data, we can train the logistic regression model. This can be done with the “fit” function of the logistic regression model.

# Training a classification model using logistic regression 
logreg = LogisticRegression(solver='lbfgs')
score = logreg.fit(X_train, y_train).decision_function(X_test)

The trained model returns a training score that tells us how good the model has performed on the test dataset.

5) Evaluating model performance

The evaluation of model performance is an essential step in model development. The metrics tell us how good our model is at predicting the purchase intentions of online shoppers. As we will see, the distribution of classes in the data plays a major role in the evaluation of our model. In particular, the question is whether the model is able to correctly classify the smaller group of buyers from the large number of visitors.

To evaluate the performance of the classification model, we first create a confusion matrix. Then we calculate and compare different error metrics.

Confusion Matrix

A holistic and clean way to illustrate the results of a classification model is the Confusion Matrix. It differentiates between predicted labels and actual labels. For a binary classification model, the matrix comprises 2×2 quadrants. The matrix shows the number of cases in each quadrant. The code below will create the confusion matrix for our prediction model and convert it into a heat map.

# create a confusion matrix
y_pred = logreg.predict(X_test)
cnf_matrix = confusion_matrix(y_test, y_pred)

# create heatmap
%matplotlib inline
class_names=[False, True] # name  of classes
fig, ax = plt.subplots(figsize=(7, 6))
tick_marks = np.arange(len(class_names))
plt.xticks(tick_marks, class_names)
plt.yticks(tick_marks, class_names)
sns.heatmap(pd.DataFrame(cnf_matrix), annot=True, cmap="YlGnBu", fmt='g')
ax.xaxis.set_label_position("top")
plt.tight_layout()
plt.title('Confusion matrix')
plt.ylabel('Actual label')
plt.xlabel('Predicted label')
Confusion matrix

Now let’s go through the matrix. In the upper left (0,0) we see that 3102 was correctly predicted to buy nothing. In 30 cases the model was wrong and predicted that they would buy, but they did not. For 412 buyers, the model predicted that they would not buy anything, even though they were actually buying something. In the lower right corner we see that only in 151 cases buyers could be correctly identified as such.

Performance Metrics for Classification Models

Four common metrics that measure the performance of classification models are:

  • Accuracy
  • Precision
  • Recall
  • f1_score

Run the code below to calculate these four metrics.

from sklearn.metrics import precision_score, accuracy_score, f1_score,  recall_score 
print('Accuracy: {:.2f}'.format(accuracy_score(y_test, y_pred)))
print('Precision: {:.2f}'.format(precision_score(y_test, y_pred)))
print('Recall: {:.2f}'.format(recall_score(y_test, y_pred)))
print('f1_score: {:.2f}'.format(f1_score(y_test, y_pred)))

Accuracy

The model accuracy on the test set is 88%. This means that 88% of the predictions were correct. Doesn’t sound so bad or? But is this sufficient to say whether our model performs good or bad? The answer is no. Measuring the accuracy is not enough. As we can remember, we are working with a set of imbalanced data, where most labels have the value “false”, and only a few labels are “True”. Consequently, we must ensure that our model does not classify all online shoppers as “non-buyers” (label: False), but also correctly predicts the buyers (label: true). For this reason we will take more detailed look at the Confusion Matrix.

Precision

Precision is calculated as the number of True Positives divided by the number of True Positives and False Positives. The precision score for our model is just a little lower than the accuracy (0.83). Similar to the accuracy, precision is not so meaningful in the case of our model, because it puts too much emphasis on the True negatives.

Recall

The Recall is calculated as the number of True Positives divided by the sum of the number of True Positives and the number of False Negatives. The Recall of our model is 27%, which is significantly below accuracy and precision. In our case, the precision call is more meaningful than precision and recall, because it puts a higher penalty on the low number of True positives.

F1-Score

The formula for the F1-Score is 2*((precision*recall)/(precision+recall)). Because, the recall is considered in the formula, the recall of our model is only 41%.

Imagine we want to further optimize our classification model. In this case, both F1-Score and Recall are the metrics we should look out for.

Summary

In this blog post you learned about developing a classification model for predicting the purchase intentions of online shoppers. We have trained and tested the logistic regression algorithm evaluated several performance of classification models.

The example has shown that error metrics for classification models can be misleading and that False Negative and False Positives predictions may involve different costs. You should therefore remember to not only evaluate a model on exactness (precision and accuracy), but also ensure that its predictions are balanced (F1-Score and Recall).

Follow Florian Müller:

Data Scientist & Machine Learning Consultant

Hi, my name is Florian! I am a Zurich-based Data Scientist with a passion for Artificial Intelligence and Machine Learning. After completing my PhD in Business Informatics at the University of Bremen, I started working as a Machine Learning Consultant for the swiss consulting firm ipt. When I'm not working on use cases for our clients, I work on own analytics projects and report on them in this blog.

Leave a Reply

Your email address will not be published. Required fields are marked *