Will they Buy or just Browse? Predicting Purchase Intentions of Online Shoppers with Python

0

Many online stores welcome countless visitors every day, but only a fraction of those visitors will actually make a purchase. Wouldn’t it be interesting to know in advance which visitors will actually buy something and which won’t? This Python tutorial describes how this works using machine learning. You will learn to train and test an intention classifier and use it to generate predictions.

The prediction of purchase intentions allows marketing professional to draw far-reaching conclusions about customer behavior and understand the circumstances under which certain types of customers make their purchase decision. In this way, intention prediction can help online shops to target customers with the right products at the right times and in this way to take a step towards marketing automation.

Predicting purchase intentions can be an important step toward marketing automation
A classification mode that predicts the buying intention of online shoppers

In the following, this Python tutorial guides you through the process of building a classification model for predicting purchase intentions using Python and Scikit-Learn. We will assume a two-class prediction problem where the point is to predict the labels “buys” and “does not buy” for a group of visitors. Such two-class problems can be approached with different machine learning algorithms. In this blog post, we will use Logistic Regression.

Implementing a Predicting Model for Purchase Intentions with Python

In the following, we will develop an intention classification model, using Python and the machine learning library Scikit-Learn. As usually, you will find the entire code sample in the relataly repository on GitHub.

Prerequisites

Before we start the coding part, make sure that you have setup your Python 3 environment and required packages. If you don’t have an environment set up yet, you can follow this tutorial to setup the Anaconda environment.

Also make sure you install all required packages. In this tutorial, we will be working with the following standard packages: 

In addition, we will be using the machine learning library scikit-learn ans seaborn for visualization

You can install packages using console commands:

  • pip install <package name>
  • conda install <package name> (if you are using the anaconda packet manager)

Download the Data

The first step is to download the data. In this tutorial we will work with a public dataset from kaggle. You can download the data via the link below:

Copy the csv file into the following path, starting from the folder with your python notebook: data/classification-online-shopping/

1 Load the Data

The dataset consists of feature vectors belonging to 12333 sessions from a public online shop. Each session of the dataset belongs to a different user in a 1-year period. In this way, it is ensured that there is no tendency in the data, for example to a specific period, user or day to avoid. The dataset consists of 10 numerical and 8 categorical attributes. The ‘Revenue’ attribute will be used as the class label, also called “prediction label”.

In addition, the dataset contains the following features (information taken from the dataset site on kaggle.com):

  • “Administrative”, “Administrative Duration”, “Informational”, “Informational Duration”, “Product Related” and “Product Related Duration” represent the number of different types of pages visited by the visitor in that session and total time spent in each of these page categories. 
  • The “Bounce Rate”, “Exit Rate” and “Page Value” features represent the metrics measured by “Google Analytics” for each page in the e-commerce site.
  • The “Special Day” feature indicates the closeness of the site visiting time to a specific special day (e.g. Mother’s Day, Valentine’s Day)
  • The dataset also includes operating system, browser, region, traffic type, visitor type as returning or new visitor, a Boolean value indicating whether the date of the visit is weekend, and month of the year.

Run the following code to load the csv file into a dataframe named “dfshopping”.

# Setting up packages for data manipulation and machine learning
import calendar
import math 
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt 
from matplotlib import cm, pyplot
import sklearn as sk
from sklearn import tree
from sklearn.model_selection import train_test_split as train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import confusion_matrix, roc_curve, auc, roc_auc_score
import seaborn as sns

# Load train data
# label: Revenue
filepath = "data/classification-online-shopping/online_shoppers_intention.csv"
dfshopping = pd.read_csv(filepath) 

2 Exploring the Data

Next, we’ll explore the dataset and familiarize ourselves with the features. Usually not all features help in the development of classification models. It is often helpful to make a preselection. The distribution of the values in the individual features can give an indication of which features correlate with each other. Furthermore, outliers can influence the training process. Getting familiar with the features is therefore an important step to optimize machine learning models.

# Exploring the data
print(dfshopping.shape)
dfshopping.head(5)
# Checking the balance of labels
df['Revenue'].value_counts()

As we can see, there are much more cases in the data with a prediction “false”. This is plausible because as mentioned, there are more visitors who won’t actually buy something. Imbalanced data can lead to a possible misinterpretation of the performance of the classification model that we are about to build. But now that we are aware of the fact that we are working with imbalanced data, we can later choose appropriate evaluation metrics.

We proceed by making some value and type conversions.

# Replacing visitor_type to int
print(dfshopping['VisitorType'].unique())
dfshopping = dfshopping.replace({'VisitorType' : { 'New_Visitor' : 0, 'Returning_Visitor' : 1, 'Other' : 2 }})

# Coverting month column to numeric numeric values
df = dfshopping.copy()
monthlist = dfshopping['Month'].replace('June', 'Jun')
mlist = []
m = np.array(monthlist)
for mi in m:
    a = list(calendar.month_abbr).index(mi)
    mlist.append(a)
df['Month'] =  mlist
df

In an additional step, we will remove all records with ‘na’ values.

print(df.isnull().sum())
df.isna().sum()
# Delete records with NAs
df.dropna(inplace=True)

Next, we will create whisker plots for all features in the dataset.

# Whiskerplots
c = 'black'
df.drop('Revenue', axis=1).plot(kind='box', subplots=True, layout=(4,4), sharex=False, sharey=False, figsize=(14,14), 
                                        title='Whister plot for input variables')
plt.savefig('shopping_box')
plt.show()
Feature boxplots

Next, we take a look at the histograms. Histograms are a way to visualize the distribution of numerical or categorical variables of a dataset. They are useful in the process to familiarize oneself with the data as they give a rough sense of the density of a distribution. To create the histograms, run the code below.

# Histograms
df.drop(['Revenue', 'Weekend'], axis=1).hist(bins=30, figsize=(14, 14), color='blue')
plt.suptitle("Histogram for each numeric input variable", fontsize=10)
plt.savefig('shopping_hist')
plt.show()
Feature histograms

3 Data Preprocessing

Once we are familiar with the dataset, we can begin to prepare the data and train a classification model. Firstly, we need to split the data into two separate datasets: train and test. We will use a ration of 70%. The datasets y_train and y_test contain the respective prediction labels. Second, we will use the MinMaxScaler to scale the numeric features to a range between 0 and 1.

# Separate labels from training data
features = ['Administrative', 'Administrative_Duration', 'Informational', 
            'Informational_Duration', 'ProductRelated', 'ProductRelated', 
            'ProductRelated_Duration', 'BounceRates', 'ExitRates', 'PageValues', 
            'Month', 'Region', 'TrafficType', 'VisitorType']
X = df[features] #Training data
y = df['Revenue'] #Prediction label

# Split the data into x_train and y_train data sets
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, random_state=0)

scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

4 Train a Purchase Intention Classifier

Now, that we have prepared the data, we can train the purchase intention prediction model. We will be using a logistic regression model, but we could also use other algorithms such as random decision forests or a simple decision tree. We will start the training process using the the “fit” method of the logistic regression algorithm.

# Training a classification model using logistic regression 
logreg = LogisticRegression(solver='lbfgs')
score = logreg.fit(X_train, y_train).decision_function(X_test)

The trained model returns a training score that tells us how good the model has performed on the test dataset.

5 Evaluate Model Performance

In the following, we will evaluate the performance of our intention classification model. The evaluation of model performance is an essential step in model development. The metrics tell us how good our model is at predicting the purchase intentions. As we will see, the distribution of classes in the data plays a major role in the evaluation of our model. In particular, the question is whether the model is able to correctly classify the smaller group of buyers from the large number of visitors.

To evaluate the performance of the classification model, we first create a confusion matrix. Then we calculate and compare different error metrics.

5.1 Confusion Matrix

The Confusion Matrix is a holistic and clean way to illustrate the results of a classification model. It differentiates between predicted labels and actual labels. For a binary classification model, the matrix comprises 2×2 quadrants. The matrix shows the number of cases in each quadrant.

# create a confusion matrix
y_pred = logreg.predict(X_test)
cnf_matrix = confusion_matrix(y_test, y_pred)

# create heatmap
%matplotlib inline
class_names=[False, True] # name  of classes
fig, ax = plt.subplots(figsize=(7, 6))
tick_marks = np.arange(len(class_names))
plt.xticks(tick_marks, class_names)
plt.yticks(tick_marks, class_names)
sns.heatmap(pd.DataFrame(cnf_matrix), annot=True, cmap="YlGnBu", fmt='g')
ax.xaxis.set_label_position("top")
plt.tight_layout()
plt.title('Confusion matrix')
plt.ylabel('Actual label')
plt.xlabel('Predicted label')
confusion matrix on the results of our classification model for predicting purchase intentions
Confusion matrix

5.2 Interpretation

Once, we have created the matrix, we can interpret it’s results. In the upper left (0,0) we see that 3102 was correctly predicted to buy nothing. In 30 cases the model was wrong and predicted that they would buy, but they did not. For 412 buyers, the model predicted that they would not buy anything, even though they were actually buying something. In the lower right corner we see that only in 151 cases buyers could be correctly identified as such.

5.3 Performance Metrics for Classification Models

Four common metrics that measure the performance of classification models are Accuracy, Precision, Recall and f1_score. We will calculate all four metrics using the code below:

from sklearn.metrics import precision_score, accuracy_score, f1_score, recall_score 
print('Accuracy: {:.2f}'.format(accuracy_score(y_test, y_pred)))
print('Precision: {:.2f}'.format(precision_score(y_test, y_pred)))
print('Recall: {:.2f}'.format(recall_score(y_test, y_pred)))
print('f1_score: {:.2f}'.format(f1_score(y_test, y_pred)))

Accuracy

The model accuracy on the test set is 88%. This means that 88% of the predictions were correct. Doesn’t sound so bad or? But is this sufficient to say whether our model performs good or bad? The answer is no, measuring accuracy is not sufficient. The reason is that our data is imbalanced. That is to say, most labels have the value “false”, and only a few labels are “True”. Consequently, we must ensure that our model does not classify all online shoppers as “non-buyers” (label: False), but also correctly predicts the buyers (label: true). For this reason we will take more detailed look at the Confusion Matrix.

Precision

Precision is calculated as the number of True Positives divided by the number of True Positives and False Positives. The precision score for our model is just a little lower than the accuracy (0.83). Similar to the accuracy, precision is not so meaningful in the case of our model, because it puts too much emphasis on the True negatives.

Recall

The Recall is calculated as the number of True Positives divided by the sum of the number of True Positives and the number of False Negatives. The Recall of our model is 27%, which is significantly below accuracy and precision. In our case, the precision call is more meaningful than precision and recall, because it puts a higher penalty on the low number of True positives.

F1-Score

The formula for the F1-Score is 2*((precision*recall)/(precision+recall)). Because, the recall is considered in the formula, the recall of our model is only 41%. Imagine we want to further optimize our classification model. In this case, both F1-Score and Recall are the metrics we should look out for.

The evaluation of our metrics shows that error metrics for classification models can be misleading and that False Negative and False Positives predictions may involve different costs. Therefore, it is important not only evaluate a model on exactness (precision and accuracy), but also ensure that its predictions are balanced (F1-Score and Recall).

Summary

This blog post has showcased the development of an intention classification model that predicts the purchase intentions of online shoppers. You have learned to preprocess the data, to train a logistic regression model that predicts purchase intentions. Finally, you have learned to evaluate the prediction performance of this model.

I hope you found this post useful. Please leave a comment if you have any remarks or questions.

Author

  • Hi, I am a Zurich-based Data Scientist with a passion for Machine Learning and Investing. After completing my Ph.D. in Business Informatics at the University of Bremen, I started working as a Machine Learning Consultant for the swiss consulting firm ipt. When I'm not working on use cases for our clients, I work on my own analytics projects and report on them in this blog.

Follow Florian Müller:

Data Scientist & Machine Learning Consultant

Hi, I am a Zurich-based Data Scientist with a passion for Machine Learning and Investing. After completing my Ph.D. in Business Informatics at the University of Bremen, I started working as a Machine Learning Consultant for the swiss consulting firm ipt. When I'm not working on use cases for our clients, I work on my own analytics projects and report on them in this blog.

Leave a Reply