Anyone About to Leave? Predicting Customer Churn of a Telecommunications Provider

Telecommunications service providers face considerable pressure to expand and retain their subscriber base. One of the biggest cost factors are customers cancelling their contracts. Innovative service providers therefore have learned to use machine learning to predict which of their customers will tend to cancel their contracts. Those providers who understand which customers tend to churn can take appropriate countermeasures early on to retain them. The prerequisite for this is that the provider is able to identify churn candidates among their customer base. In this tutorial I will show how to implement such a churn prediction model using Python machine learning. Furthermore, I will use this example to introduce permutation feature importance as a useful technique that allows us to draw conclusions about the relationship between input variables and model predictions.

Predicted churn probabilities

What’s the Business Case?

The effort that a company has to make to persuade a new customer to sign a contract is many times higher than the costs incurred in retaining existing customers. According to industry experts, it is at least four times more expensive to win a new customer than to retain an existing customer. Providers that are able to identify churn candidates in advance and manage to retain them can significantly reduce costs.

A crucial point is whether the provider succeeds in getting the churn candidates to stay. Sometimes it may be enough to contact the churn candidate and inquire about customer satisfaction. In other cases, this may not be enough and the provider needs to increase the service value, for example, by offering free services or granting a discount. However, actions should be well thought out, as they can also have a negative effect. For example, if a customer hardly uses his contract at all, a call from the provider may make him aware of this again and increase the desire to cancel the contract. Here again, machine learning can help to assess cases individually and identify the optimal anti-churn action.

About Permutation Feature Importance

In this tutorial, we will apply a technique called permutation feature importance, which calculates the importance of input variables of a prediction model. The results from this technique can be be as valuable a the predictions itself, because they help us to better understand the business context. For example, let’s say we have trained a model that predicts which of our customers are likely to churn, wouldn’t it be important to know why certain customers are more likely to churn than others? Permutation feature importance can help us to answer this question by providing us with a ranking of the input variables in our model by their usefulness to the model. The ranking can be used to validate assumptions about the business context and uncover causal relations in the data.

One of the biggest advantages of traditional prediction models, such as a decision tree, compared to neural networks is their interpretability. Neural networks are known as black boxes, because it is very difficult to understand the relation between input variables and model predictions. In traditional models, on the other hand, we can calculate the meaning of the features and use it to interpret the model and optimize its performance, for example by removing features from the model that are not important. This is one of the reasons why it is a good idea to start with a simple model first and move on to more complex models once you understand the data.

Python implementation

In the following, this tutorial shows how to implement a customer churn prediction model. We will train a decision forest model on a data set from Kaggle and optimize it using grid search. The data contains customer level information for a telecom provider and a binary prediction label of which customers canceled their contracts and which did not. Finally, we will calculate the feature importance to understand how the model works.

  1. Loading the Customer Churn Data
  2. Exploring the Data
  3. Preprocessing
  4. Fit an Optimized Decision Forest Model for Churn Prediction using Grid Search
  5. Best Model Performance Insights
  6. Calculating Feature Importance

Setting up the Environment

This tutorial assumes that you have setup your python environment. I recommend using Anaconda. If you have not yet set it up, you can follow this tutorial. It is also assumed that you have the following packages installed: keras (2.0 or higher) with Tensorflow backend , numpy, pandas, matplot, sklearn. The packages can be installed using the console command:

pip install <package name> 
conda install <package name> (if you are using the anaconda environment)

1) Loading the Customer Churn Data

We begin by loading a customer churn dataset from Kaggle. After you have completed the download, put the dataset under the filepath of your choice, but don’t forget to adjust the file path variable in the code. If you are working with the Kaggle Python environment, you can also directly save the dataset into your Kaggle project.

The dataset contains 3333 records and the following attributes.

  • Churn: 1 if customer cancelled service, 0 if not. This will be the prediction label.
  • AccountWeeks: number of weeks customer has had active account
  • ContractRenewal: 1 if customer recently renewed contract, 0 if not
  • DataPlan: 1 if customer has data plan, 0 if not
  • DataUsage: gigabytes of monthly data usage
  • CustServCalls: number of calls into customer service
  • DayMins: average daytime minutes per month
  • DayCalls: average number of daytime calls
  • MonthlyCharge: average monthly bill
  • OverageFee: largest overage fee in last 12 months

The following code will load the data from your local folder into your anaconda Python project:

import numpy as np 
import pandas as pd 
from pandas.plotting import register_matplotlib_converters
import math
import matplotlib.pyplot as plt 
import matplotlib.colors as mcolors
import scikitplot as skplt
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.inspection import permutation_importance
from sklearn.model_selection import GridSearchCV
import seaborn as sns

# set file path
filepath = "data/Churn-prediction/"

# Load train and test datasets
train_df = pd.read_csv(filepath + 'telecom_churn.csv')

2) Exploring the Data

Before we begin to with the preprocessing, we will quickly explore the data. For this purpose we will create histograms for the different attributes in our data.

# Create histograms for feature columns separated by prediction label value
nrows = 3; ncols = int(round(train_df.shape[1] / nrows, 0))
fig, ax = plt.subplots(nrows=nrows, ncols=ncols, sharex=False, figsize=(16, 10))
fig.subplots_adjust(hspace=0.3, wspace=0.3)
columns = train_df.columns
f = 0
features = []
for i in range(nrows):
    for j in range(ncols):
        if f <= train_df.shape[1]-1:
            assetname = columns[f]
            y0 = train_df[train_df['Churn']==0][assetname]
            ax[i, j].hist(y0, color='blue', label=assetname + '-Churn', bins='auto')
            y1 = train_df[train_df['Churn']==1][assetname]
            ax[i, j].hist(y1, color='red', alpha=0.8, label=assetname + '-NoChurn', bins='auto')
            f += 1
            ax[i, j].set_title(assetname)
            ax[i, j].grid()
Histograms of the churn prediction dataset separated by prediction label (red=churn, blue= no churn)

We can see that the distribution of the data for several attributes looks quite good and resembles a normal distribution, for example for OverageFeed, DayMins, DayCalls. However, the distribution for the prediction label is unbalanced. Naturally, this is because, there are more customers that remain with their contract (prediction label class = 0) than those that cancel their contract (prediction label class = 1).

3) Preprocessing

The next step is to preprocess the data. For the sake of keeping this tutorial simple, I have reduced this part to a minimum. For example, I do not treat the unbalanced label classes, which in a real business context would certainly be appropriate to improve the model performance. The imbalanced data is also a reason why I chose a decision forest as model type. Compared to other traditional models such as logistic regression, decision forests can handle unbalanced data relatively well, although performance will not be great.

The following code splits the data into train (x_train) and test data (x_test) and creates the respective datasets, which only contain the label class (y_train, y_test). The ratio is 0.7, resulting in 2333 records in the train dataset, and 1000 record in the test dataset.

# Create Training Dataset
x_df = train_df[train_df.columns[train_df.columns.isin(['AccountWeeks', 'ContractRenewal', 'DataPlan','DataUsage', 'CustServCalls', 'DayCalls', 'MonthlyCharge', 'OverageFee', 'RoamMins'])]].copy()
y_df = train_df['Churn'].copy()

# Split the data into x_train and y_train data sets
x_train, x_test, y_train, y_test = train_test_split(x_df, y_df, train_size=0.7, random_state=0)

Now comes the interesting part. We will train a decision forest that predicts customer churn. No actually, that is not true, instead we will train a whole series of 36 decision forests and then choose the best performing model. The technique used in this process is called hyperparameter tuning (more specifically grid search) and I have recently published a separate article on this topic.

The following code defines the parameters that the grid search will test (max_depth, n_estimators, and min_samples_split). Then the code runs the grid search and trains the decision forests. Finally, we print out the model ranking along with model parameters.

# Define parameters
max_depth=[2, 4, 8, 16]
n_estimators = [64, 128, 256]
min_samples_split = [5, 20, 30]

param_grid = dict(max_depth=max_depth, n_estimators=n_estimators, min_samples_split=min_samples_split)

# Build the gridsearch
dfrst = RandomForestClassifier(n_estimators=n_estimators, max_depth=max_depth, min_samples_split=min_samples_split, class_weight='balanced')
grid = GridSearchCV(estimator=dfrst, param_grid=param_grid, cv = 5)
grid_results =, y_train)

# Summarize the results in a readable format
print("Best: {0}, using {1}".format(grid_results.cv_results_['mean_test_score'], grid_results.best_params_))
results_df = pd.DataFrame(grid_results.cv_results_)
results_df.sort_values(by=['rank_test_score'], ascending=True, inplace=True)
Ranking of the models created with grid search
Model ranking created with Grid Search

The best performing model is model number 29, which scores 92,7 % and has been configured as follows:

  • max_depth = 16
  • min_samples_split = 5
  • n_estimators 256

We will proceed with this model. So what does this model tell us?

Well we can gain an overview of the distributions of our customers according to their churn probability. Just use the following code:

# Predicting Probabilities
y_pred_prob = best_clf.predict_proba(x_test) 
churnproba = y_pred_prob[:,1]

# Create histograms for feature columns separated by prediction label value
plt.hist(churnproba, color='blue', label='Churn', bins='auto')

Customer Base According to their Churn Rate

No need to worry about the customers on the far left (<0.5). The customers who tend to churn have a churn probability greater than 0.5 and are further to the right in the diagram.

5) Best Model Performance Insights

Let’s take a more detailed look at the performance of the best model, by calculating the confusion matrix:

# Extract the best decision forest 
best_clf = grid_results.best_estimator_
y_pred = best_clf.predict(x_test)

# Create a confusion matrix
cnf_matrix = confusion_matrix(y_test, y_pred)

# Create heatmap from the confusion matrix
class_names=[False, True] 
tick_marks = [0.5, 1.5]
fig, ax = plt.subplots(figsize=(7, 6))
sns.heatmap(pd.DataFrame(cnf_matrix), annot=True, cmap="Blues", fmt='g')
plt.title('Confusion matrix')
plt.ylabel('Actual label'); plt.xlabel('Predicted label')
plt.yticks(tick_marks, class_names); plt.xticks(tick_marks, class_names)

From 1000 customers in the test dataset, our model correctly classified 100 customers as churn candidates. For 832 customers, the model correctly predicted that these customer are unlikely to churn. In 30 cases the model falsely classified customers as churn candidates and 38 churn candidates were missed and falsely classified as non-churn candidates. This leads to a model accuracy of 93,2 % (based on a 0.5 threshold).

6) Permutation Feature Importance

Now that we have trained a model that gives good results, we want to understand the importance of the features for the model. With the following code we calculate the Feature Importance. Then we visualize the results in a barplot.

# Calculate the Feature Importance
r = permutation_importance(best_clf, x_test, y_test, n_repeats=30, random_state=0)

# Set the color range
clist = [(0, "purple"), (1, "blue")]
rvb = mcolors.LinearSegmentedColormap.from_list("", clist)

# Plot the barchart
x = np.arange(0, len(features))
y = df["Importance"] 
N = y.size
fig, ax = plt.subplots(figsize=(16, 5))
ax.barh(x, y, color=rvb(x/N))
ax.set_yticklabels(df["Feature Names"])
ax.set_title("Random Forest Feature Importance")
Random Forest Feature Importance
Feature Importance of the Best Random Forest Model

As we can see, the most important features are the monthly fee, data usage and customer service calls (CustServCalls). Of particular interest is the importance of customer service calls, as this could indicate that customers who come into contact with customer service have negative experiences. This shows how Feature Importance can provide the starting point for deeper analysis.


In this tutorial, we have taken a deep dive into churn prediction. You have learned how you can implement a churn prediction model using Python and Scikit-learn Machine Learning for a telecommunications provider. In addition, we have calculated the permutation feature importance to analyse which features contribute to the performance of our model. you have seen that feature importance can lead to new insights on the business context and become a starting point for furthleading investigations.

If you liked this article, let me know. I highly appreciate your feedback!

And if you want to learn more about text mining and customer satisfaction, you might want to take a look at my recent blog about sentiment analysis:


  • Hi, my name is Florian! I am a Zurich-based Data Scientist with a passion for Artificial Intelligence and Machine Learning. After completing my PhD in Business Informatics at the University of Bremen, I started working as a Machine Learning Consultant for the swiss consulting firm ipt. When I'm not working on use cases for our clients, I work on own analytics projects and report on them in this blog.

Follow Florian Müller:

Data Scientist & Machine Learning Consultant

Hi, my name is Florian! I am a Zurich-based Data Scientist with a passion for Artificial Intelligence and Machine Learning. After completing my PhD in Business Informatics at the University of Bremen, I started working as a Machine Learning Consultant for the swiss consulting firm ipt. When I'm not working on use cases for our clients, I work on own analytics projects and report on them in this blog.

Leave a Reply