Hyperparameter Tuning for Machine Learning - relataly.com

Using Fairlearn to Build Fair Machine Machine Learning Models with Python: Step-by-Step Towards More Responsible AI

Florian Follonier — Thu, 09 Mar 2023 23:45:16 +0000

As we enter an era where intelligent systems are increasingly relied upon to make key decisions, responsible AI has become more critical than ever before. It’s not enough to simply rely on data-driven decision-making; we must also ensure that these systems are fair and just. But how can we do this when the same bias and unfairness that exists in our society is reflected in these very algorithms?

The consequences of biased algorithms are far-reaching and can be devastating. They can perpetuate discriminatory practices, deny individuals opportunities or services, and lead to a range of negative outcomes. It’s no longer just a moral obligation to avoid perpetuating inequalities; it’s a necessity to prevent negative consequences such as reputational damage and legal repercussions.

This article aims to drive home the importance of fairness in AI and provide practical tips for ensuring that algorithms are unbiased and free from discrimination. We will delve into the fairlearn library, a powerful open-source toolkit, to assess and mitigate bias in machine learning. Using fairlearn, we will implement a classification model in Python to predict income on the adult consensus set and evaluate its fairness.

Think about it: just as we strive to ensure fairness in our daily lives, we must also do the same with AI. We must do our best to eliminate bias and discrimination from our algorithms, just as we work to eliminate these issues from our society. By using fairlearn, we can ensure that our models are as fair and unbiased as possible, minimizing the risk of perpetuating discrimination and inequality. We can take steps to ensure that our algorithms are not influenced by factors such as race, gender, or age, and that they are making predictions based on merit alone.

What is Responsible AI?

Responsible AI is an approach that emphasizes fairness, transparency, accountability, and ethical considerations in the development and deployment of AI systems. It involves identifying potential sources of bias in AI and taking steps to address them proactively. To ensure responsible AI, organizations must develop policies and processes that prioritize fairness, transparency, and accountability throughout the AI development lifecycle.

How Real-World Bias Manifests in Biased Data

As we continue to develop increasingly sophisticated AI models like GPT-4, it is crucial to acknowledge that real-world data is not always objective and unbiased. Training models on biased data can lead to the perpetuation and even amplification of existing biases and disparities. Failing to address these issues can cement and exacerbate existing inequalities, with far-reaching consequences for society and individuals alike.

Bias in AI models often stems from the data they are trained on. Historical data, social norms, and cultural practices can all introduce bias into datasets, which AI models may then unwittingly learn and propagate. This can lead to AI systems making biased decisions, producing biased content, or perpetuating stereotypes and prejudices.

Ways in which AI systems can be biased

There are many ways in which AI systems can be biased and thus behave unfairly, including the following:

Gender Bias in Hiring: The data used by companies to train their hiring algorithms may be biased against certain genders or demographics. As a result, these algorithms may perpetuate existing gender inequalities by disproportionately rejecting female candidates.
Racial Bias in Healthcare: Medical data may be biased against certain racial or ethnic groups, which can lead to misdiagnosis, mistreatment, or under-treatment of certain populations. For instance, studies have shown that Black patients are less likely to receive appropriate pain management than white patients.
Economic Bias in Credit Scoring: Financial data used to determine credit scores may be biased against low-income individuals or marginalized communities, making it harder for them to access credit or loans. The result is a cycle of poverty, where individuals cannot access credit and are thus unable to build wealth.
Age Bias in Predictive Policing: Predictive policing algorithms may be biased against certain age groups, such as younger people, who may be more likely to be falsely flagged as potential criminals. This can lead to increased surveillance and potential harassment of certain populations.

These examples highlight the importance of identifying and addressing bias in real-world data. By doing so, we can ensure that our models and algorithms are fair, just, and equitable for all individuals, regardless of their background or demographic.

One of many biases is that in many industries, Women still earn less than their male colleagues. Image created with Midjourney.

Model Fairness and Performance

In machine learning models, there is often a delicate balance between fairness and performance. For instance, optimizing a model for high accuracy might inadvertently lead to discrimination against certain groups of people. To avoid these adverse effects, we must actively work to reduce the influence of unfairness and bias in our models.

Achieving this balance involves multiple strategies, such as gathering diverse and representative data, identifying and addressing sources of bias within the data, and devising techniques to counteract the effects of bias. Moreover, it’s essential to consistently monitor and evaluate our models to ensure they maintain fairness and impartiality over time.

Conversely, when prioritizing fairness, some degree of accuracy may be sacrificed. Striking the right balance between fairness and performance depends on the specific use case and consideration of the potential consequences of any decisions made. The ultimate goal is to create a model that is both fair and effective in performing the task it was designed for.

In summary, to create a well-balanced machine learning model, it’s crucial to acknowledge the trade-off between fairness and performance. By employing various strategies, such as using diverse data, addressing bias, and continuously monitoring the models, we can work towards developing models that are not only accurate but also fair and equitable for all users.

Organizations must take the necessary steps to ensure their algorithms are unbiased and not perpetuate the same inequalities still plaguing our society. Image created with Midjourney.

Assessing Model Fairness with the Fairlearn Library

Fairlearn is an open-source Python package, developed by Microsoft, that offers tools for assessing and mitigating unfairness in machine learning models. The library empowers developers and data scientists to pinpoint and address potential sources of bias in their data and models, ensuring that predictions and decisions are made fairly and transparently.

Fairlearn encompasses a variety of algorithms and metrics for evaluating and mitigating unfairness, including:

Group Fairness: This metric measures the difference in a model’s performance across various subgroups of the population, such as race or gender.
Demographic Parity: This metric works to ensure that the model’s predictions are distributed fairly across different subgroups of the population.
Equalized Odds: This metric guarantees that the model’s predictions maintain equal accuracy across different subgroups of the population.

In addition to these metrics, Fairlearn also offers various techniques for mitigating unfairness. These include reweighing data to balance different subgroups, adjusting the model’s predictions to meet specific fairness criteria, and enforcing fairness constraints during the training process. By utilizing Fairlearn, developers and data scientists can be confident that their machine learning models are more equitable and unbiased.

Assessing Model Fairness with the Fairlearn Library. Image created with Midjourney.

Mitigating Unfairness in Machine Learning

There are two options to address unfairness in machine learning models: consider fairness during training or add a post-processing step to mitigate unfairness as an additional layer. Both methods have their merits and drawbacks. The ideal approach depends on the context and objectives of the problem you’re tackling.

Considering fairness during training involves incorporating fairness constraints into the learning process. This approach can lead to more interpretable models, as fairness is embedded from the start. However, it may require specialized algorithms and could result in reduced performance.

On the other hand, adding a post-processing step to mitigate unfairness involves adjusting the model’s predictions after training. This method offers flexibility, as it can be applied to any pre-trained model. It may also preserve performance better. However, it might not provide as much interpretability, as fairness is addressed separately from the original model.

Ultimately, the choice depends on your specific goals, the model’s purpose, and the importance of interpretability versus performance.

Considering Fairness during Training with Fairness Constraints

Considering fairness during training involves incorporating fairness constraints or objectives into the model’s training process. This can be done in different ways:

by balancing the training dataset to ensure equal representation of different groups.
by adjusting the model’s loss function to penalize unfair predictions,
or by adding fairness constraints to the optimization problem.

The main advantage of this approach is that it can lead to models that are inherently fair without requiring any additional post-processing steps. However, it can be challenging to define and operationalize fairness. In addition, the fairness objectives may conflict with other objectives, such as accuracy or generalization.

There are several fairness constraints that can be used to ensure that machine learning models and decision-making processes do not discriminate against certain groups. Some of the most commonly used constraints include:

Demographic Parity: This constraint ensures that the proportion of positive outcomes is equal across different demographic groups.
Equalized Odds: This constraint ensures that the true positive rate and false positive rate are equal across different demographic groups.
Disparate Impact: This constraint ensures that the ratio of positive outcomes to the total number of outcomes is the same across different demographic groups.

For a full list of constraints, please view the library documentation.

The choice of fairness constraint depends on the specific context and goals of the decision-making process, and the constraints may need to be customized or combined to achieve the desired level of fairness.

Fairness Mitigation as a Postprocessing Layer

Another approach is to add a post-processing step to mitigate the unfairness of existing models. This method involves applying a fairness algorithm to the model’s outputs after training. Examples include reweighting predictions to ensure equal treatment of different groups, calibrating the model’s scores to remove bias, or using a fairness metric to adjust predictions.

The primary advantage of this approach is its applicability to existing models without the need for retraining. However, it might not tackle the root causes of unfairness in the model, and the fairness algorithm could introduce its own biases or trade-offs.

We can consider fairness towards sensitive features during the training of a machine learning model by using the fairness constraints. Image created with Midjourney.

Building A More Inclusive Machine Learning Model using FairLearn and Python

Let’s explore building a fair machine learning model using the census dataset, which includes demographic attributes to predict if a person earns more or less than $50k per year. As income prediction is sensitive, we’ll ensure fairness in our model.

We’ll create an initial fairness-unaware model, evaluate its fairness, and then use grid search to identify alternative models with varying performance and unfairness. This process consists of:

Load the data: Load the census dataset containing demographic attributes like age, education level, race, and gender.
Initial preprocessing and visualization: Preprocess the data by removing missing values, encoding categorical variables, scaling numerical features, and visualizing the data for insights.
Splitting and scaling the data: Split the data into training and testing sets and scale the features to make them comparable.
Training a fairness-unaware model and assessing fairness: Train a baseline model unaware of unfairness and evaluate its fairness using metrics like disparate impact and statistical parity difference.
Mitigating bias by training fairness-aware models with fairness constraints: Train models using demographic parity as a fairness constraint to mitigate bias and improve fairness.
Comparing the performance of the models: Compare the models’ performance in accuracy, fairness, and other metrics like false positives and false negatives.

By following these steps, we’ll build a fair machine learning model that is accurate, equitable, and just.

The code is available on the GitHub repository.

View on GitHub Relataly GitHub Repo

Responsible AI, and fairness in machine learning will gain further importance as the role of this exciting technology continues to become more influential in society.

Prerequisites

Before starting the coding part, make sure that you have set up your Python 3 environment and required packages. This will ensure a seamless learning experience and prevent any potential roadblocks or issues that may arise due to an improperly configured environment.

If you don’t have an environment, follow this tutorial to set up the Anaconda environment.

Make sure you install all required packages. In this tutorial, we will be working with the following packages:

Pandas
NumPy
Matplotlib
Seaborn
Plotly
Fairlearn

You can install the Fairlearn library and other packages using console commands:

pip install 
conda install  (if you are using the anaconda packet manager)

About the Adult Consensus Dataset

The Adult Consensus Dataset is a highly regarded and extensively used resource in machine learning and statistics, offering a comprehensive view of over 32,000 individuals across the United States. This invaluable dataset features 14 unique attributes, such as age, education, occupation, work class, marital status, and gender, among others. It also includes a binary target variable, identifying individuals earning over $50,000 annually.

However, the dataset presents challenges, as it contains sensitive demographic features like race and gender. If not handled properly, these features may lead to biased predictions and unfair outcomes. As a result, models trained on this dataset require careful design and evaluation to prevent perpetuating or amplifying existing biases and discrimination.

Despite its complexities, the Adult Consensus Dataset remains a popular choice for training and testing binary classification models. Its value extends further, serving as an excellent benchmark for testing and showcasing fair AI models. The dataset raises important questions about fairness, bias, and discrimination in machine learning, making it an essential resource for creating ethical and unbiased AI solutions.

Step #1 Load Preprocess the Data

We begin by loading the data using the fetch_openml function, making the necessary imports, and performing some initial preprocessing. The dataset is loaded into two data frames, X_df and y_df, where X_df contains the features and y_df contains the binary target variable, which is income with classes “>50k” and “<=50k”. The code below encodes these classes with 0 and 1.

Assessing model fairness is always done with respect to one or several sensitive variables. Although there are several features in the dataset that could be considered sensitive, optimizing the model to consider all of these features simultaneously would lead to significant complexity and increase training times. Therefore, in this tutorial, we will focus specifically on assessing model fairness with respect to sex and race. The sex feature contains two classes, “male” and “female,” while the race feature contains classes for “Black,” “White,” and several further classes.

# A tutorial for this file will be shortly available at www.relataly.com
# Tested with python 3.9.13, matplotlib 3.6.2, numpy 1.23.3, seaborn=0.12.1, fairlearn=0.8.0, plotly 5.11.0

from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn import metrics as skm
from fairlearn.metrics import MetricFrame, selection_rate, selection_rate_difference, count, plot_model_comparison
from fairlearn.reductions import DemographicParity, ErrorRate, GridSearch
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt
sns.set_style('white', { 'axes.spines.right': False, 'axes.spines.top': False})

# read the adult data
# predict whether income exceeds $50K/yr based on census data. Also known as "Census Income" dataset.
# https://archive.ics.uci.edu/ml/datasets/adult

from sklearn.datasets import fetch_openml
data = fetch_openml(data_id=1590, parser='auto', as_frame=True)

X_df = data.data.copy()
data.target = data.target.replace({' <=50K': 0, ' >50K': 1})
y_df = data.target

# Declare sensitive features
sensitive_variables = ["sex", 'race']

# join X_df and y_df and name y_df as 'income'
X_df['salary >50k'] = y_df

# show df that contains the target column
X_df.head()

	age	workclass	fnlwgt	education	race	sex		...		hours-per-week	salary >50k
0	25	4			226802	1			Black	Male	...		40				0
1	38	4			89814	11			White	Male	...		50				0
2	28	2			336951	7			White	Male	...		40				1
3	44	4			160323	15			Black	Male	...		40				1
4	18	0			103497	15			White	Female	...		30				0

Step #2 Initial Preprocessing and Visualization

Let’s continue with some initial preprocessing and data visualization. First, we encode the categorical features in the X_df data frame using the LabelEncoder function from scikit-learn. This is important, as the fairlearn gridsearch technique that we will use in this tutorial only works with numeric features.

Next, to simplify the tutorial and reduce model training time, we will filter the dataset to only include people of races “Black” and “White.” This will result in four sensitive feature permutations:

Black, Woman
Black, Man
White, Woman
White, Man

Finally, we create two data frames. The X_df data frame contains the features we will be using to train our model, but it does not contain the target label anymore. The other data frame, df_visu, does contain the target label and will be used for visualization purposes. Finally, we use df_visu to visualize how the target variable is distributed with respect to our sensitive features.

# encode the categorical features
X_df['workclass'] = X_df['workclass'].cat.codes
X_df['education'] = X_df['education'].cat.codes
X_df['marital-status'] = X_df['marital-status'].cat.codes
X_df['occupation'] = X_df['occupation'].cat.codes
X_df['relationship'] = X_df['relationship'].cat.codes

# filter some values to reduce tutorial complexity and make it easier to visualize
X_df = X_df.loc[X_df["race"].astype(str).isin([" White", " Black"])]
X_df.race = X_df.race.cat.remove_unused_categories()

# ensure that the target column has the same filter applied
y_df = X_df['salary >50k']

# create a copy of the df for visualization
df_visu = X_df.copy()

# drop the target column from X_df
X_df.drop(['salary >50k'], axis=1, inplace=True)

# visualize the distributions of the cateogical variables
fig, ax = plt.subplots(1, 3, figsize=(16, 5))
sns.countplot(x='sex', hue='salary >50k', data=df_visu, ax=ax[0])
sns.countplot(x='race', hue='salary >50k', data=df_visu, ax=ax[1])
sns.kdeplot(x='age', hue='salary >50k', data=df_visu, ax=ax[2])

# rotate x_labels
plt.sca(ax[1])
plt.xticks(rotation=30)

All three plots reflect inequality in our society:

The percentage of women with a yearly income above 50k is much lower than that of men.
With respect to race, this bias is even more extreme, with way more white people earning more than 50k than black people.
Although we won’t use age as a sensitive label, it is noteworthy that the percentage of people earning more than 50k declines with growing age.

Step #3 Splitting and Scaling the Data

The code performs several preprocessing steps. First, the data is split into the features (X_df) and the target variable (y_df). Next, the categorical features in X_df are encoded using the LabelEncoder function from scikit-learn. The data is then split into training and test sets. Records are stratified based on the target variable to ensure that the proportion of each class is maintained in both sets.

It is important to note that the sensitive variables, sex and race, are included in the splitting process to ensure that the data is divided in a fair manner. Specifically, we create a new data frame A containing only the sensitive variables. We then split the data into training and test sets using the stratify parameter based on Y_encoded (the encoded version of y_df) and A.

Finally, the features in X_df are scaled using the StandardScaler function from scikit-learn to ensure that all features are on the same scale. The resulting scaled features are then converted back into a pandas data frame, X_scaled.

# create a dataframe with sensitive features
A = X_df[sensitive_variables]
X = X_df.drop(labels=sensitive_variables, axis=1)
X = pd.get_dummies(X)

# Scale the data
sc = StandardScaler()
X_scaled = sc.fit_transform(X)
X_scaled = pd.DataFrame(X_scaled, columns=X.columns)

# Encode the target variable
le = LabelEncoder()
Y_encoded = le.fit_transform(y_df)

# We split the data into training and test sets:
X_train, X_test, Y_train, Y_test, A_train, A_test = train_test_split(
    X_scaled, Y_encoded, A, test_size=0.4, random_state=0, stratify=Y_encoded
)

# Work around indexing bug
X_train = X_train.reset_index(drop=True)
A_train = A_train.reset_index(drop=True)
X_test = X_test.reset_index(drop=True)
A_test = A_test.reset_index(drop=True)

Step #4 Train a Fairness-Unaware Model and Assess its Fairness

Now it’s time to train our first model. This initial model will be entirely fairness unaware, meaning its predictions will reflect any patterns of unfairness present in the data.

Model training

We’ll begin by training a baseline fairness unaware decision tree classifier with default settings. The model will be trained on the training set (X_train and Y_train) and then tested on the test set (X_test and Y_test).

# Define evaluation metrics for performance and fairness
performance_metric = skm.accuracy_score
fairness_metric = selection_rate_difference

unmitigated_predictor = DecisionTreeClassifier(random_state=0)
unmitigated_predictor.fit(X_train, Y_train)

Evaluation and Visualization

To evaluate the model’s performance, we’ll use two crucial metrics: the accuracy score and the “selection rate difference”. The latter is a vital fairness metric that measures the difference between selection rates of two distinct groups for a specific outcome or prediction.

In machine learning, this metric often helps determine the fairness of a model, especially with sensitive features like race, gender, or age. A selection rate difference of zero indicates that the model is making predictions fairly and without bias, as both groups have equal selection rates.

However, we must pay attention to a positive or negative selection rate difference. It implies that the model is favoring one group over the other, which could signal bias or discrimination. Carefully monitoring this metric is critical to ensure the model is performing fairly and justly.

Lastly, we’ll visualize the results in a bar chart using the accuracy and selection rate by group. This visualization will provide a comprehensive view of the model’s performance and fairness, allowing us to make informed decisions about potential improvements or adjustments needed to strike the right balance between accuracy and fairness.

metric_frame = MetricFrame(
    metrics={
        "accuracy": performance_metric,
        "selection_rate": fairness_metric,
        "count": count,
    },
    sensitive_features=A_test,
    y_true=Y_test,
    y_pred=unmitigated_predictor.predict(X_test),
)

# plot the metrics
print(metric_frame.overall)
print(metric_frame.by_group)
metric_frame.by_group.plot.bar(
    subplots=True,
    layout=[1, 3],
    legend=False,
    figsize=[12, 4],
    title="Accuracy and selection rate by group",
)

accuracy              0.854136
selection_rate        0.197373
count             18579.000000
dtype: float64
                accuracy  selection_rate    count
sex     race                                     
 Female  Black  0.948665        0.053388    974.0
         White  0.921083        0.083873   5246.0
 Male    Black  0.877378        0.161734    946.0
         White  0.813371        0.264786  11413.0
array([[,
        ,
        ]],
      dtype=object)

When we look at the selection rate, we can see that the model is intrinsically biased. To be explicit, if this model would go into production, a Black Woman would have a significantly lower chance (5%) to be attributed a +50k income than a White Man (above 25%), even if all other attributes (age, education, and so on) are the same.

Step #5 Train Fairness-Aware Models with Gridsearch

We perform a grid search with demographic parity as a fairness constraint to find the best decision tree classifier model that is both accurate and fair. We evaluate different models with different hyperparameters for the decision tree classifier. The best predictors with the lowest error rates and lowest disparities (in terms of demographic parity) are selected as the dominant models.

We call these models “dominant” because they stand out as superior solutions in multi-objective optimization. “Dominant” implies that these models outshine others across multiple evaluation metrics without falling short in any of the metrics. Dominant models are preferred for their optimal balance of performance and fairness.

During model training, we’ll incorporate demographic parity as a fairness constraint. Demographic parity seeks to ensure equal proportions of positive outcomes (e.g., being hired, approved for a loan, etc.) across various demographic groups (e.g., gender, race, or age) in a population. In simpler terms, it requires the decision-making process not to discriminate against any group based on their demographic characteristics. Mathematically, it’s expressed as P(prediction=1|sensitive_feature=0) = P(prediction=1|sensitive_feature=1), where sensitive_feature is a binary variable indicating membership in a specific demographic group. Achieving demographic parity helps reduce discrimination and promote fairness in decision-making processes. However, depending on the context and goals of the decision-making process, demographic parity might not always be the most appropriate fairness constraint.

Depending on the hardware you are using, training can take several minutes.

# Define and run GridSearch with DemographicParity 
sweep = GridSearch(
    DecisionTreeClassifier(random_state=0),
    constraints=DemographicParity(), # DemographicParity() is the default
    grid_size=100, # number of models to evaluate
)
sweep.fit(X_train, Y_train, sensitive_features=A_train)

# The best predictor is the one with the lowest error rate
predictors = sweep.predictors_

errors, disparities = [], []
for p in predictors:
    
    def classifier(X):
        return p.predict(X)

    # Calculate the error rate of the classifier
    error = ErrorRate()
    error.load_data(X_train, pd.Series(Y_train), sensitive_features=A_train)
    errors.append(error.gamma(classifier)[0])

    # Calculate the disparity in the predictions
    disparity = DemographicParity()
    disparity.load_data(X_train, pd.Series(Y_train), sensitive_features=A_train)
    disparities.append(disparity.gamma(classifier).max())

# Consolidate the results into a single dataframe
grid_results = pd.DataFrame(
    {"predictor": predictors, 
     "error": errors, 
     "disparity": disparities}
)

# Loop through the grid and select the relevant models 
non_dominated_models = []
for model in grid_results.itertuples():
    # only select models with lower disparity compared to the other models
    if model.error < grid_results["error"][grid_results["disparity"] < model.disparity].min():
        non_dominated_models.append(model.predictor)

That’s it; now we have trained several fairness-aware models. Next, let’s look at the performance and fairness of these models.

Step #6 Comparing Dominant and Unmitigated Models

Finally, we’ll evaluate and compare our models’ performance. We’ll assess the dominant models and the unmitigated model using evaluation metrics for performance and fairness. The performance metric we’ll use is the accuracy score, and for fairness, we’ll use the selection rate difference. To do this, we’ll predict on the test data using the unmitigated model, then add predictions for all relevant dominant models.

Next, we’ll create a scatterplot for the models based on their accuracy and selection rate. We’ll use the x-axis for performance metrics and the y-axis for fairness metrics. We’ll also include point labels and display the plot. This visualization will help us understand the trade-offs between performance and fairness in our models.

# We can evaluate the dominant models along with the unmitigated model.

# Use the unmitigated model to predict on the test data
predictions = {"unmitigated_model": unmitigated_predictor.predict(X_test)}

# Add predictions for all relevant dominant models
metric_frames = {"unmitigated_model": metric_frame}
for i in range(len(non_dominated_models)):
    key = "dominant_model_{0}".format(i)
    predictions[key] = non_dominated_models[i].predict(X_test)

    metric_frames[key] = MetricFrame(
        metrics={
            "accuracy": performance_metric,
            "selection_rate": selection_rate,
            "count": count,
        },
        sensitive_features=A_test,
        y_true=Y_test,
        y_pred=predictions[key],
    )

# scatterplot for the models along their accuracy and selection rate
plot_model_comparison(
    x_axis_metric=performance_metric,
    y_axis_metric=fairness_metric,
    y_true=Y_test,
    y_preds=predictions,
    sensitive_features=A_test,
    point_labels=True,
    show_plot=True,
)

Upon examination, it becomes apparent that we are presented with multiple dominant models that have a marginal difference in their selection rates as well as accuracy levels. These models, while exhibiting slightly lower accuracy, are less influenced by race or gender than the fairness-unaware model. As such, these models may be considered viable candidates for deployment in the model selection process.

It is worth noting, however, that while these models appear to be less prone to bias based on sensitive features, they may still exhibit unfairness towards other features that have not been declared sensitive.

Summary

The significance of responsible AI and fairness in machine learning cannot be overstated. With AI systems becoming increasingly integrated into our lives, it’s essential that we proactively address issues of bias and discrimination.

In this article, we’ve taken a small but vital step towards a more equitable world by developing machine learning models that are fair with respect to sensitive features. Our efforts have been focused on ensuring that our models are less biased and more equitable, using techniques such as reweighting and threshold optimization.

The negative consequences of biased models cannot be ignored. They can lead to discrimination and reputational damage, making it even more important to ensure that our AI systems are fair and just. As AI technology continues to advance, there is an increased need for responsible AI and fairness in AI systems, particularly with the development of more powerful generative models that have the potential to revolutionize various industries.

By taking steps to address issues of bias and discrimination in AI systems, we can ensure that these systems promote positive outcomes and benefit everyone equally. We must strive towards a more equitable and just world through responsible AI and fairness in machine learning.

When used wisely, AI and machine learning can help create a fairer and more equitable world. Image created with Midjourney.

Sources and Further Reading

Books on Responsible AI and Applied Machine Learning

The links above to Amazon are affiliate links. By buying through these links, you support the Relataly.com blog and help to cover the hosting costs. Using the links does not affect the price.

The post Using Fairlearn to Build Fair Machine Machine Learning Models with Python: Step-by-Step Towards More Responsible AI appeared first on relataly.com.

Using Random Search to Tune the Hyperparameters of a Random Decision Forest with Python

Florian Follonier — Thu, 07 Apr 2022 17:55:36 +0000

Perfecting your machine learning model’s hyperparameters can often feel like hunting for a proverbial needle in a haystack. But with the Random Search algorithm, this intricate process of hyperparameter tuning can be efficiently automated, saving you valuable time and effort. Hyperparameters are properties intrinsic to your model, like the number of estimators in an ensemble model, and heavily influence its performance. Unlike model parameters, which are discovered during training by the machine learning algorithm, hyperparameters require pre-specification.

In this comprehensive Python tutorial, we’ll guide you on how to harness the power of Random Search to optimize a regression model’s hyperparameters. Our illustrative example utilizes a Support Vector Machine (SVM) for predicting house prices. However, the fundamental principles you’ll learn can be seamlessly applied to any model. So why painstakingly fine-tune hyperparameters manually when Random Search can handle the task efficiently?

Here’s a preview of what this Python tutorial entails:

A brief overview of how Random Search operates and instances where it might be preferable to Grid Search.
A hands-on Python tutorial featuring a public house price dataset from Kaggle.com. The aim here is to train a regression model capable of predicting US house prices based on various properties.
Training a ‘best-guess’ model in Python, followed by using Random Search to discover a model with enhanced performance.
Finally, we’ll implement cross-validation to validate our models’ performance.

By the end of this tutorial, you’ll be well-equipped to let Random Search efficiently fine-tune your model’s hyperparameters, freeing up your time for other crucial tasks.

Hyperparameter Tuning

Hyperparameters are configuration options that allow us to customize machine learning models and improve their performance. While normal parameters are the internal coefficients that the model learns during training, we need to specify hyperparameters before the training. It is usually impossible to find the best configuration without testing different configurations.

Searching for a suitable model configuration is called “hyperparameter tuning” or “hyperparameter optimization.” Machine learning algorithms have varying hyperparameters and parameter values. For example, a random decision forest classifier allows us to configure varying parameters such as the number of trees, the maximum tree depth, and the minimum number of nodes required for a new branch.

The hyperparameters and the range of possible parameter values span a search space in which we seek to identify the best configuration. The larger the search space, the more difficult it gets to find an optimal model. We can use random search to automatize this process.

Random search can be an efficient way to tune the hyperparameters of a machine learning model. Image generated with Midjourney

Techniques for Tuning Hyperparameters

Hyperparameter tuning is the process of adjusting the hyperparameters of a machine learning algorithm to optimize its performance on a specific dataset or task. Several techniques can be used for hyperparameter tuning, including:

Grid Search: grid search is a brute-force search algorithm that systematically evaluates a given set of hyperparameter values by training and evaluating a model for each combination of values. It is a simple and effective technique, but it can be computationally expensive, especially for large or complex datasets.
Random Search: As mentioned, random search is an alternative to grid search that randomly samples a given set of hyperparameter values rather than evaluating all possible combinations. It can be more efficient than grid search, but it may not find the optimal set of hyperparameters.
Bayesian Optimization: A bayesian optimization is a probabilistic approach to hyperparameter tuning, which uses Bayesian inference to model the distribution of hyperparameter values that are likely to produce a good performance. It can be more efficient and effective than grid search or random search, but it can be more challenging to implement and interpret.
Genetic Algorithms: genetic algorithms are optimization algorithms inspired by the principles of natural selection and genetics. They use a population of candidate solutions, which are iteratively evolved and selected based on their fitness or performance, to find the optimal set of hyperparameters.

In this article, we specifically look at the Random Search technique.

You can spend much time tuning a machine learning model. Image generated with Midjourney.

What is Random Search?

The random search algorithm generates models from hyperparameter permutations randomly selected from a grid of parameter values. The idea behind the randomized approach is that testing random configurations efficiently identifies a good model. We can use random search both for regression and classification models.

Random Search and Grid Search are the most popular techniques for hyperparametric tuning, and both methods are often compared. Unlike random search, grid search covers the search space exhaustively by trying all possible variants. The technique works well for testing a small number of configurations already known to work well.

As long as both search space and training time are small, the grid search technique is excellent for finding the best model. However, the number of model variants increases exponentially with the size of the search space. It is often more efficient for large search spaces or complex models to use random search.

Since random search does not exhaustively cover the search space, it does not necessarily yield the best model. However, it is also much faster than grid search and efficient in delivering a suitable model in a short time.

Random Search vs. Exhaustive Grid Search

Tuning the Hyperparameters of a Random Decision Forest Regressor in Python using Random Search

In this tutorial, we delve into the use of the Random Search algorithm in Python, specifically for predicting house prices. We’ll be using a dataset rich in diverse house characteristics. Various elements, such as data quality and quantity, model intricacy, the selection of machine learning algorithms, and housing market stability, significantly influence the accuracy of house price predictions.

Our initial model employs a Random Decision Forest algorithm, which we’ll optimize using a random search approach for hyperparameters tuning. By identifying and implementing a more advantageous configuration, we aim to enhance our model’s performance significantly.

Here’s a concise outline of the steps we’ll undertake:

Loading the house price dataset
Exploring the dataset intricacies
Preparing the data for modeling
Training a baseline Random Decision Forest model
Implementing a random search approach for model optimization
Measuring and evaluating the performance of our optimized model

Through this step-by-step guide, you’ll learn to enhance model performance, further refining your understanding of Random Search algorithm implementation in Python.

The Python code is available in the relataly GitHub repository.

View on GitHub Relataly Github Repo

Once we have trained a house price prediction model, we can use it to asses the price of new houses. Image generated with Midjourney.

Prerequisites

Before starting the coding part, ensure that you have set up your Python (3.8 or higher) environment and required packages. If you don’t have an environment, follow this tutorial to set up the Anaconda environment.

Also, make sure you install all required packages. In this tutorial, we will be working with the following standard packages:

In addition, we will be using the Python Machine Learning library Scikit-learn to implement the random forest and the grid search technique.

You can install packages using console commands:

pip install
conda install (if you are using the anaconda packet manager)

House Price Prediction: About the Use Case and the Data

House price prediction is the process of using statistical and machine learning techniques to predict the future value of a house. This can be useful for a variety of applications, such as helping homeowners and real estate professionals to make informed decisions about buying and selling properties. In order to make accurate predictions, it is important to have access to high-quality data about the housing market.

In this tutorial, we will work with a house price dataset from the house price regression challenge on Kaggle.com. The dataset is available via a git hub repository. It contains information about 4800 houses sold between 2016 and 2020 in the US. The data includes the sale price and a list of 48 house characteristics, such as:

Year – The year of construction,
SaleYear – The year in which the house was sold
Lot Area – The lot area of the house
Quality – The overall quality of the house from one (lowest) to ten (highest)
Road – The type of road, e.g., paved, etc.
Utility – The type of the utility
Park Lot Area – The parking space included with the property
Room number – The number of rooms

Predicting house prices with machine learning. Image generated with Midjourney.

Step #1 Load the Data

We begin by loading the house price data from the relataly GitHub repository. A separate download is not required.

# A tutorial for this file is available at www.relataly.com

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import RandomizedSearchCV, train_test_split
from sklearn.metrics import mean_absolute_error, mean_absolute_percentage_error
from sklearn import svm

# Source: 
# https://www.kaggle.com/c/house-prices-advanced-regression-techniques

# Load train and test datasets
path = "https://raw.githubusercontent.com/flo7up/relataly_data/main/house_prices/train.csv"
df = pd.read_csv(path)
print(df.columns)
df.head()

Index(['Id', 'MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street',
       'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig',
       'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType',
       'HouseStyle', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd',
       'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType',
       'MasVnrArea', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual',
       'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1',
       'BsmtFinType2', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating',
       'HeatingQC', 'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF',
       'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath',
       'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual',
       'TotRmsAbvGrd', 'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType',
       'GarageYrBlt', 'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual',
       'GarageCond', 'PavedDrive', 'WoodDeckSF', 'OpenPorchSF',
       'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'PoolQC',
       'Fence', 'MiscFeature', 'MiscVal', 'MoSold', 'YrSold', 'SaleType',
       'SaleCondition', 'SalePrice'],
      dtype='object')
	Id	MSSubClass	MSZoning	LotFrontage	LotArea	Street	Alley	LotShape	LandContour	Utilities	...	PoolArea	PoolQC	Fence	MiscFeature	MiscVal	MoSold	YrSold	SaleType	SaleCondition	SalePrice
0	1	60			RL			65.0		8450	Pave	NaN		Reg			Lvl			AllPub		...	0			NaN		NaN		NaN			0		2		2008	WD			Normal			208500
1	2	20			RL			80.0		9600	Pave	NaN		Reg			Lvl			AllPub		...	0			NaN		NaN		NaN			0		5		2007	WD			Normal			181500
2	3	60			RL			68.0		11250	Pave	NaN		IR1			Lvl			AllPub		...	0			NaN		NaN		NaN			0		9		2008	WD			Normal			223500
3	4	70			RL			60.0		9550	Pave	NaN		IR1			Lvl			AllPub		...	0			NaN		NaN		NaN			0		2		2006	WD			Abnorml			140000
4	5	60			RL			84.0		14260	Pave	NaN		IR1			Lvl			AllPub		...	0			NaN		NaN		NaN			0		12		2008	WD			Normal			250000
5 rows × 81 columns

Step #2 Explore the Data

Before jumping into preprocessing and model training, let’s quickly explore the data. A distribution plot can help us understand our dataset’s frequency of regression values.

# Create histograms for feature columns separated by prediction label value
ax = sns.displot(data=df[['SalePrice']].dropna(), height=6, aspect=2)
plt.title('Sale Price Distribution')

For feature selection, it is helpful to understand the predictive power of the different variables in a dataset. We can use scatterplots to estimate the predictive power of specific features. Running the code below will create a scatterplot that visualizes the relation between the sale price, lot area, and the house’s overall quality.

# Create histograms for feature columns separated by prediction label value
plt.figure(figsize=(16,6))
df_features = df[['SalePrice', 'LotArea', 'OverallQual']]
sns.scatterplot(data=df_features, x='LotArea', y='SalePrice', hue='OverallQual')
plt.title('Sale Price Distribution')

As expected, the scatterplot shows that the sale price increases with the overall quality. On the other hand, the LotArea has only a minor effect on the sale price.

Step #3 Data Preprocessing

Next, we prepare the data for use as input to train a regression model. Because we want to keep things simple, we reduce the number of variables and use only a small set of features. In addition, we encode categorical variables with integer dummy values.

To ensure that our regression model does not know the target variable, we separate house price (y) from features (x). Last, we split the data into separate datasets for training and testing. The result is four different data sets: x_train, y_train, x_test, and y_test.

def preprocessFeatures(df):   
    # Define a list of relevant features
    feature_list = ['SalePrice', 'OverallQual', 'Utilities', 'GarageArea', 'LotArea', 'OverallCond']
    df_dummy = pd.get_dummies(df[feature_list])
    # Cleanse records with na values
    #df_prep = df_prep.dropna()
    return df_dummy

df_base = preprocessFeatures(df)

# Split the data into x_train and y_train data sets
x_train, x_test, y_train, y_test = train_test_split( df_base.copy(), df_base['SalePrice'].copy(), train_size=0.7, random_state=0)
x_train

		OverallQual	GarageArea	LotArea	OverallCond	Utilities_AllPub	Utilities_NoSeWa
682		6			431			2887	5			1					0
960		5			0			7207	7			1					0
1384	6			280			9060	5			1					0
1100	2			246			8400	5			1					0
416		6			440			7844	7			1					0

Step #4 Train Different Regression Models using Random Search

Now that the dataset is ready, we can train the random decision forest regressor. To do this, we first define a dictionary with different parameter ranges. In addition, we need to define the number of model variants (n) that the algorithm should try. The random search algorithm then selects n random permutations from the grid and uses them to train the model.

We use the RandomSearchCV algorithm from the scikit-learn package. The “CV” in the function name stands for cross-validation. Cross-validation involves splitting the data into subsets (folds) and rotating them between training and validation runs. This way, each model is trained and tested multiple times on different data partitions. When the search algorithm finally evaluates the model configuration, it summarizes these results into a test score.

We use a Random Decision Forest – a robust machine learning algorithm that can handle classification and regression tasks. As a so-called ensemble model, the Random Forest considers predictions from a set of multiple independent estimators. The estimator is an important parameter to pass to the RandomSearchCV function. Random decision forests have several hyperparameters that we can use to influence their behavior. We define the following parameter ranges:

max_leaf_nodes = [2, 3, 4, 5, 6, 7]
min_samples_split = [5, 10, 20, 50]
max_depth = [5,10,15,20]
max_features = [3,4,5]
n_estimators = [50, 100, 200]

These parameter ranges define the search space from which the randomized search algorithm (RandomSearchCV) will select random configurations. Other parameters will use default values as defined by scikit-learn.

# Define the Estimator and the Parameter Ranges
dt = RandomForestRegressor()
number_of_iterations = 20
max_leaf_nodes = [2, 3, 4, 5, 6, 7]
min_samples_split = [5, 10, 20, 50]
max_depth = [5,10,15,20]
max_features = [3,4,5]
n_estimators = [50, 100, 200]

# Define the param distribution dictionary
param_distributions = dict(max_leaf_nodes=max_leaf_nodes, 
                           min_samples_split=min_samples_split, 
                           max_depth=max_depth,
                           max_features=max_features,
                           n_estimators=n_estimators)

# Build the gridsearch
grid = RandomizedSearchCV(estimator=dt, 
                          param_distributions=param_distributions, 
                          n_iter=number_of_iterations, 
                          cv = 5)

grid_results = grid.fit(x_train, y_train)

# Summarize the results in a readable format
print("Best params: {0}, using {1}".format(grid_results.cv_results_['mean_test_score'], grid_results.best_params_))
results_df = pd.DataFrame(grid_results.cv_results_)

Best params: [0.68738293 0.49581669 0.52138751 0.61235299 0.65360944 0.61165147
 0.70392285 0.52278886 0.67687248 0.68219638 0.70031536 0.65842909
 0.51939338 0.70801017 0.70911805 0.69543885 0.67983801 0.60744371
 0.68270285 0.70741042], using {'n_estimators': 100, 'min_samples_split': 5, 'max_leaf_nodes': 7, 'max_features': 3, 'max_depth': 15}
	mean_fit_time	std_fit_time	mean_score_time	std_score_time	param_n_estimators	param_min_samples_split	param_max_leaf_nodes	param_max_features	param_max_depth	params	split0_test_score	split1_test_score	split2_test_score	split3_test_score	split4_test_score	mean_test_score	std_test_score	rank_test_score
0	0.049196		0.002071		0.004074		0.000820		50					20						5	4	15	{'n_estimators': 50, 'min_samples_split': 20, ...	0.662973	0.705533	0.669520	0.702608	0.696280	0.687383	0.017637	7
1	0.041115		0.000554		0.003046		0.000094		50					50						2	3	10	{'n_estimators': 50, 'min_samples_split': 50, ...	0.490984	0.527231	0.426270	0.523086	0.511513	0.495817	0.036978	20
2	0.043325		0.000779		0.003486		0.000447		50					50						2	5	20	{'n_estimators': 50, 'min_samples_split': 50, ...	0.484524	0.559358	0.485459	0.517253	0.560343	0.521388	0.033545	18
3	0.162083		0.005665		0.012420		0.004788		200					5						3	3	20	{'n_estimators': 200, 'min_samples_split': 5, ...	0.586586	0.638341	0.573437	0.626793	0.636608	0.612353	0.027021	14
4	0.166659		0.003026		0.010958		0.000084		200					10						4	3	15	{'n_estimators': 200, 'min_samples_split': 10,...	0.633305	0.679161	0.623236	0.661864	0.670481	0.653609	0.021636	13

These are the five best models and their respective hyperparameter configurations.

Step #5 Select the best Model and Measure Performance

Finally, we will choose the best model from the list using the “best_model” function. We then calculate the MAE and the MAPE to understand how the model performs on the overall test dataset. We then print a comparison between actual sale prices and predicted sale prices.

# Select the best Model and Measure Performance
best_model = grid_results.best_estimator_
y_pred = best_model.predict(x_test)
y_df = pd.DataFrame(y_test)
y_df['PredictedPrice']=y_pred
y_df.head()

	SalePrice	PredictedPrice
529	200624		166037.831002
491	133000		135860.757958
459	110000		123030.336177
279	192000		206488.444327
655	88000		130453.604206

Next, let’s take a look at the classification errors.

# Mean Absolute Error (MAE)
MAE = mean_absolute_error(y_pred, y_test)
print('Mean Absolute Error (MAE): ' + str(np.round(MAE, 2)))

# Mean Absolute Percentage Error (MAPE)
MAPE = mean_absolute_percentage_error(y_pred, y_test)
print('Median Absolute Percentage Error (MAPE): ' + str(np.round(MAPE*100, 2)) + ' %')

Mean Absolute Error (MAE): 29591.56 
Median Absolute Percentage Error (MAPE): 15.57 %

On average, the model deviates from the actual value by 16 %. Considering we only used a fraction of the available features and defined a small search space, there is much room for improvement.

Summary

This article has shown how we can use grid Search in Python to efficiently search for the optimal hyperparameter configuration of a machine learning model. In the conceptual part, you learned about hyperparameters and how to use random search to try out all permutations of a predefined parameter grid. The second part was a Python hands-on tutorial, in which you learned to use random search to tune the hyperparameters of a regression model. We worked with a house price dataset and trained a random decision forest regressor that predicts the sale price for houses depending on several characteristics. Then we defined parameter ranges and tested random permutations. In this way, we quickly identified a configuration that outperforms our initial baseline model.

Remember that a random search efficiently identifies a good-performing model but does not necessarily return the best-performing one. Tech random search techniques can be used to tune the hyperparameters of both regression and classification models.

Sources and Further Reading

I hope this article was helpful. If you have any questions or suggestions, please write them in the comments.

The links above to Amazon are affiliate links. By buying through these links, you support the Relataly.com blog and help to cover the hosting costs. Using the links does not affect the price.

The post Using Random Search to Tune the Hyperparameters of a Random Decision Forest with Python appeared first on relataly.com.

Customer Churn Prediction – Understanding Models with Feature Permutation Importance using Python

Florian Follonier — Sun, 02 Aug 2020 13:24:28 +0000

Customer retention is a prime objective for service companies, and understanding the patterns that lead to customer churn can be the key to maintaining long-lasting client relationships. Businesses incur significant costs when customers discontinue their services, hence it’s vital to identify potential churn risks and take preemptive actions to retain these customers. Machine Learning models can be instrumental in identifying these patterns and providing valuable insights into customer behavior.

An intriguing technique, Permutation Feature Importance, allows us to discern the significance of different features of our machine learning model, thereby shedding light on their influence on customer churn. This tutorial guides you through the intricacies of this technique and its implementation.

The structure of this tutorial is as follows:

We begin by discussing the business problem of customer churn and its implications.
We introduce the concept of Permutation Feature Importance, a powerful tool to identify essential features in our machine learning model.
We transition into the hands-on coding segment, where we build a churn prediction model using Python.
Our model undergoes a classification process and hyperparameter tuning to select the most effective parameters.
Utilizing the trained model, we predict the churn probabilities for a test set of customers.
Finally, we create a feature ranking based on their impact on the model’s performance.

By employing permutation feature importance, this tutorial offers a deep-dive into the correlation between input variables and model predictions, providing actionable insights for effective customer churn management.

Also:

Customer churn prediction is a compelling use case for machine learning. It is particularly effective when combined with feature permutation importance.

What is Churn Prediction?

A company’s effort to persuade a new customer to sign a contract is many times higher than the costs incurred in retaining existing customers. According to industry experts, winning a new customer is four times more expensive than keeping an existing one. Providers that can identify churn candidates and manage to retain them can significantly reduce costs.

A crucial point is whether the provider succeeds in getting the churn candidates to stay. Sometimes it may be enough to contact the churn candidate and inquire about customer satisfaction. In other cases, this may not be enough, and the provider needs to increase the service value, for example, by offering free services or a discount. However, actions should be well thought out, as they can also negatively affect. For instance, if a customer hardly ever uses his contract, a call from the provider may even increase the desire to cancel the contract. Machine learning can help assess cases individually and identify the optimal anti-churn action.

About Permutation Feature Importance

Feature importance is a helpful technique for understanding the contribution of input variables (features) to a predictive model. The results from this technique can be as valuable as the predictions themselves, as they can help us understand the business context better. For example, let’s say we have trained a model that predicts which of our customers will likely churn. Wouldn’t it be interesting to know why specific customers are more likely to churn than others? Permutation feature importance can help us answer this question by providing us with a ranking of the input variables in our model by their usefulness. The order can validate assumptions about the business context and uncover causal relations in the data.

Compared to neural networks, one of the most significant advantages of traditional prediction models, such as a decision tree, is their interpretability. Neural networks are black boxes because it is tough to understand the relationship between input and model predictions. In traditional models, on the other hand, we can calculate the meaning of the features and use it to interpret the model and optimize its performance, for example, by removing features from the model that are not important. We, therefore, start with a simple model first and move on to more complex models once we understand the data.

Implementing a Customer Churn Prediction Model in Python

In the following, we will implement a customer churn prediction model. We will train a decision forest model on a data set from Kaggle and optimize it using grid search. The data contains customer-level information for a telecom provider and a binary prediction label of which customers canceled their contracts and did not. Finally, we will calculate the feature importance to understand how the model works.

The code is available on the GitHub repository.

View on GitHub Relataly GitHub Repo

Prerequisites

Before starting the coding part, make sure that you have set up your Python 3 environment and required packages. If you don’t have an environment, you can follow this tutorial to set up the Anaconda environment.

Make sure you install all required packages. In this tutorial, we will be working with the following packages:

Pandas
NumPy
Matplotlib
Seaborn

In addition, we will be using Keras (2.0 or higher) with Tensorflow backend and the machine learning library Scikit-learn.

You can install packages using console commands:

pip install
conda install (if you are using the anaconda packet manager)

Step #1 Loading the Customer Churn Data

We begin by loading a customer churn dataset from Kaggle. If you work with the Kaggle Python environment, you can directly save the dataset into your Kaggle project. After completing the download, put the dataset under the file path of your choice, but don’t forget to adjust the file path variable in the code.

The dataset contains 3333 records and the following attributes.

Churn: The prediction label: 1 if the customer canceled service, 0 if not.
AccountWeeks: number of weeks the customer has had an active account
ContractRenewal: 1 if customer recently renewed contract, 0 if not
DataPlan: 1 if the customer has a data plan, 0 if not
DataUsage: gigabytes of monthly data usage
CustServCalls: number of calls into customer service
DayMins: average daytime minutes per month
DayCalls: average number of daytime calls
MonthlyCharge: average monthly bill
OverageFee: The most considerable overage fee in the last 12 months

The following code will load the data from your local folder into your anaconda Python project:

import numpy as np 
import pandas as pd 
import math
from pandas.plotting import register_matplotlib_converters
import matplotlib.pyplot as plt 
import matplotlib.colors as mcolors
import matplotlib.dates as mdates 

from sklearn.metrics import confusion_matrix, classification_report
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.inspection import permutation_importance
import seaborn as sns


# set file path
filepath = "data/Churn-prediction/"

# Load train and test datasets
train_df = pd.read_csv(filepath + 'telecom_churn.csv')
train_df.head()

	Churn	AccountWeeks	ContractRenewal	DataPlan	DataUsage	CustServCalls	DayMins	DayCalls	MonthlyCharge	OverageFee	RoamMins
0	0		128				1				1			2.7			1				265.1		110		89.0			9.87		10.0
1	0		107				1				1			3.7			1				161.6		123		82.0			9.78		13.7
2	0		137				1				0			0.0			0				243.4		114		52.0			6.06		12.2
3	0		84				0				0			0.0			2				299.4		71		57.0			3.10		6.6
4	0		75				0				0			0.0			3				166.7		113		41.0			7.42		10.1

Step #2 Exploring the Data

Before we begin with the preprocessing, we will quickly explore the data. For this purpose, we will create histograms for the different attributes in our data.

# # Create histograms for feature columns separated by prediction label value
df_plot = train_df.copy()

# class_columnname = 'Churn'
sns.pairplot(df_plot, hue="Churn", height=2.5, palette='muted')

Histograms of the churn prediction dataset separated by prediction label (red=churn, blue= no churn)

We can see that the data distribution for several attributes looks quite good and resembles a normal distribution, for example, for OverageFeed, DayMins, and DayCalls. However, the distribution for the prediction label is unbalanced. This is because more customers remain with their contract (prediction label class = 0) than those that cancel their contract (prediction label class = 1).

Step #3 Data Preprocessing

The next step is to preprocess the data. I have reduced this part to a minimum to keep this tutorial simple. For example, I do not treat the unbalanced label classes. However, this would be appropriate to improve the model performance in a real business context. The imbalanced data is also why I chose a decision forest as a model type. Decision forests can handle unbalanced data relatively well compared to traditional models such as logistic regression.

The following code splits the data into the train (x_train) and test data (x_test) and creates the respective datasets, which only contain the label class (y_train, y_test). The ratio is 0.7, resulting in 2333 records in the training dataset and 1000 in the test dataset.

# Create Training Dataset
x_df = train_df[train_df.columns[train_df.columns.isin(['AccountWeeks', 'ContractRenewal', 'DataPlan','DataUsage', 'CustServCalls', 'DayCalls', 'MonthlyCharge', 'OverageFee', 'RoamMins'])]].copy()
y_df = train_df['Churn'].copy()

# Split the data into x_train and y_train data sets
x_train, x_test, y_train, y_test = train_test_split(x_df, y_df, train_size=0.7, random_state=0)
x_train

		AccountWeeks	ContractRenewal	DataPlan	DataUsage	CustServCalls	DayCalls	MonthlyCharge	OverageFee	RoamMins
2918	58				1				0			0.00		4				112			53.0			13.29		0.0
1884	51				0				1			3.32		2				60			74.2			10.03		12.3
2823	87				1				0			0.00		2				80			50.0			9.35		16.6
2319	83				1				1			2.35		3				105			91.5			12.65		8.7
2980	84				1				0			0.00		3				86			62.0			13.78		14.3
...		...				...				...			...			...				...			...				...			...
835	27	1				0				0.00		1			75				31.0		10.43			9.9
3264	89				1				1			1.59		0				98			50.9			10.36		5.9
1653	93				0				0			0.00		1				78			42.0			10.99		11.1
2607	91				1				0			0.00		3				100			53.0			11.97		9.9
2732	130				0				0			0.00		5				106			68.0			18.19		16.9

Step #4 Fit an Optimized Decision Forest Model for Churn Prediction using Grid Search

Now comes the exciting part. We will train a series of 36 decision forests and then choose the best-performing model. The technique used in this process is called hyperparameter tuning (more specifically, grid search), and I have recently published a separate article on this topic.

The following code defines the parameters the grid search will test (max_depth, n_estimators, and min_samples_split). Then the code runs the grid search and trains the decision forests. Finally, we print out the model ranking along with model parameters.

# Define parameters
max_depth=[2, 4, 8, 16]
n_estimators = [64, 128, 256]
min_samples_split = [5, 20, 30]

param_grid = dict(max_depth=max_depth, n_estimators=n_estimators, min_samples_split=min_samples_split)

# Build the gridsearch
dfrst = RandomForestClassifier(n_estimators=n_estimators, max_depth=max_depth, min_samples_split=min_samples_split, class_weight='balanced')
grid = GridSearchCV(estimator=dfrst, param_grid=param_grid, cv = 5)
grid_results = grid.fit(x_train, y_train)

# Summarize the results in a readable format
results_df = pd.DataFrame(grid_results.cv_results_)
results_df.sort_values(by=['rank_test_score'], ascending=True, inplace=True)

# Reduce the results to selected columns
results_filtered = results_df[results_df.columns[results_df.columns.isin(['param_max_depth', 'param_min_samples_split', 'param_n_estimators','std_fit_time', 'rank_test_score', 'std_test_score', 'mean_test_score'])]].copy()
results_filtered

std_fit_time	param_max_depth	param_min_samples_split	param_n_estimators	mean_test_score	std_test_score	rank_test_score
28				0.004742		16						5					128	0.931415	0.006950		1
27				0.002620		16						5					64	0.925848	0.008177		2
29				0.015711		16						5					256	0.925846	0.006156		3
20				0.006258		8						5					256	0.923704	0.007961		4
19				0.001816		8						5					128	0.921988	0.006458		5
18				0.002161		8						5					64	0.919847	0.007716		6
31				0.003728		16						20					128	0.902690	0.011642		7
30				0.002057		16						20					64	0.901836	0.009789		8
32				0.004940		16						20					256	0.899691	0.009813		9
21				0.001994		8						20					64	0.898408	0.008710		10
22				0.003761		8						20					128	0.897121	0.007529		11
23				0.003828		8						20					256	0.895833	0.009159		12
33				0.003798		16						30					64	0.885546	0.010394		13
26				0.005560		8						30					256	0.885541	0.014937		14
...

The best-performing model is model number 29, which scores 92,7 %. Its hyperparameters are as follows:

max_depth = 16
min_samples_split = 5
n_estimators 256

We will proceed with this model. So what does this model tell us?

We can gain an overview of the distributions of our customers according to their churn probability. Just use the following code:

# Predicting Probabilities
y_pred_prob = best_clf.predict_proba(x_test) 
churnproba = y_pred_prob[:,1]

# Create histograms for feature columns separated by prediction label value
sns.histplot(data=churnproba)

Customer Base According to their Churn Rate

Customers who tend to churn have a churn probability greater than 0.5. They are further to the right in the diagram. So, we don’t have to worry about the customers on the far left (<0.5).

Step #5 Best Model Performance Insights

Let’s take a more detailed look at the performance of the best model. We do this by calculating the confusion matrix.

If you want to learn more about measuring the performance of classification models, check out this tutorial.

# Extract the best decision forest 
best_clf = grid_results.best_estimator_
y_pred = best_clf.predict(x_test)

# Create a confusion matrix
cnf_matrix = confusion_matrix(y_test, y_pred)

# Create heatmap from the confusion matrix
class_names=[False, True] 
tick_marks = [0.5, 1.5]
fig, ax = plt.subplots(figsize=(7, 6))
sns.heatmap(pd.DataFrame(cnf_matrix), annot=True, cmap="Blues", fmt='g')
ax.xaxis.set_label_position("top")
plt.tight_layout()
plt.title('Confusion matrix')
plt.ylabel('Actual label'); plt.xlabel('Predicted label')
plt.yticks(tick_marks, class_names); plt.xticks(tick_marks, class_names)

From 1000 customers in the test dataset, our model correctly classified 100 customers as churn candidates. For 832 customers, the model accurately predicted that these customers are unlikely to churn. In 30 cases, the model falsely classified customers as churn candidates, and 38 were missed and falsely classified as non-churn candidates. The result is a model accuracy of 93,2 % (based on a 0.5 threshold).

Step #6 Permutation Feature Importance

Now that we have trained a model that gives good results, we want to understand the importance of the model’s features. With the following code, we calculate the Feature Importance score. Then we visualize the results in a barplot.

# Load the data
r = permutation_importance(best_clf, x_test, y_test, n_repeats=30, random_state=0)

# Set the color range
clist = [(0, "purple"), (1, "blue")]
rvb = mcolors.LinearSegmentedColormap.from_list("", clist)
colors = rvb(data_im['feature_permuation_score']/len(x_test.columns))

# Plot the barchart
data_im = pd.DataFrame(r.importances_mean, columns=['feature_permuation_score'])
data_im['feature_names'] = x_test.columns
data_im = data_im.sort_values('feature_permuation_score', ascending=False)

fig, ax = plt.subplots(figsize=(16, 5))
sns.barplot(y=data_im['feature_names'], x="feature_permuation_score", data=data_im, palette='nipy_spectral')
ax.set_title("Random Forest Feature Importances")

The feature ranking can provide the starting point for deeper analysis. As we can see, the most important features are the monthly fee, data usage, and customer service calls (CustServCalls). Of particular interest is the importance of customer service calls, as this could indicate that customers who encounter customer service have negative experiences.

Summary

This article has shown how to implement a churn prediction model using Python and scikit-learn Machine Learning. We have calculated the permutation feature importance to analyze which features contribute to the performance of our model. You have learned that permutation feature importance can provide data scientists with new insights into the context of a prediction model. Therefore, the technique is often a good starting point for forthleading investigations.

I am always interested in improving my articles and learning from my audience. If you liked this article, show your appreciation by leaving a comment. And if you didn’t, let me know too. Cheers

Sources and Further Reading

The links above to Amazon are affiliate links. By buying through these links, you support the Relataly.com blog and help to cover the hosting costs. Using the links does not affect the price.

And if you are interested in text mining and customer satisfaction, consider taking a look at my recent blog about sentiment analysis:

Sentiment Analysis with Naive Bayes and Logistic Regression in Python

The post Customer Churn Prediction – Understanding Models with Feature Permutation Importance using Python appeared first on relataly.com.

Tuning Model Hyperparameters with Grid Search at the Example of Training a Random Forest Classifier in Python

Florian Follonier — Mon, 06 Jul 2020 21:16:52 +0000

Are you struggling to find the best hyperparameters for your machine learning model? With Python’s Scikit-learn library, you can use grid search to fine-tune your model and improve its performance. In this article, we’ll guide you through the process of hyperparameter tuning for a classification model, using a random decision forest that predicts the survival of Titanic passengers as an example.

We’ll start by explaining the concept of grid search and how it works. Then, we’ll dive into the development and optimization of the random decision forest using Python. By defining a parameter grid and feeding it to the grid search algorithm, we can explore all possible hyperparameter combinations and find the optimal configuration for our model.

Finally, we’ll compare the performance of different model configurations to determine the best one for our classification task. Whether you’re new to machine learning or looking to boost the performance of an existing model, this step-by-step guide to hyperparameter tuning with grid search will help you achieve better results. Let’s get started!

Also: Multivariate Anomaly Detection on Time-Series Data in Python

Exemplary parameter grid for the tuning of a random decision forest with four hyperparameters

What are Hyperparameters?

Hyperparameters play a crucial role in the performance of a machine learning model. They are adjustable parameters that influence the model training process and control how a machine learning algorithm learns and how it behaves.

Unlike the internal parameters (coefficients, etc.) that the algorithm automatically optimizes during model training, hyperparameters are model characteristics (e.g., the number of estimators for an ensemble model) that we must set in advance.

Which hyperparameters are available, depends on the algorithm. For example, a random decision forest model may have hyperparameters such as the number of trees and tree depth, while a neural network model may have hyperparameters such as the number of hidden layers and nodes in each layer. Finding the optimal configuration of hyperparameters can be a challenging task, as there is often no way to know in advance what the ideal values should be.

This requires experimentation with different hyperparameter settings, which can be time-consuming if done manually. Grid search is a useful tool for automating this process and efficiently finding the best hyperparameter configuration for a given model.

Hyperparameters are the little screws that we can adjust to tune a predictive model.

Efficient Hyperparameter Tuning with Exhaustive Grid Search

When we train a machine learning model, it is usually unclear which hyperparameters lead to good results. While there are estimates and rules of thumb, there is often no way to avoid trying out hyperparameters in experiments. However, machine learning models often have several hyperparameters that affect the model’s performance in a nonlinear way.

We can use grid search to automate searching for optimal model hyperparameters. The search grid algorithm exhaustively generates models from parameter permutations of a grid of parameter values. Let’s take a look at how this works.

Hyperparameter Tuning with Grid Search: How it Works

The idea behind the grid search technique is quite simple. We have a model with parameters, and the challenge is to test various configurations until we are satisfied with the result. Grid search is exhaustive in that it tests all permutations of a parameter grid. The number of model variants results from the parameter grid and the specified parameters.

The grid search algorithm requires us to provide the following information:

The hyperparameters that we want to configure (e.g., tree depth)
For each hyperparameter, a range of values (e.g., [50, 100, 150])
A performance metric so that the algorithm knows how to measure performance (e.g., accuracy for a classification model)

For example, imagine we have a range of [16, 32, and 64] for n_estimators and a range of [8, 16, and 32] for max_depth. Then, the search grid will test 9 different parameter configurations.

Early Stopping

Running parameter optimization against an entire grid can be time-consuming, but there are ways to shorten the process. Depending on how much time you want to invest in the search process, you can test all combinations exhaustively or shorten the process with an early stopping logic. A stopping logic defines that the search ends early when a specific criterion is met. Such a criterion could be, for example, that newly trained models underperform the average performance of previously trained models by a certain value. In this case, the search stops and returns the best models found up to that point. When you define a large search grid with many parameters, defining an early stopping logic is recommended.

Strengths and Weaknesses of Grid Search

The advantage of the grid search is that the algorithm automatically identifies the optimal parameter configuration from the parameter grid. However, the number of possible configurations increases exponentially with the number of values in the parameter grid. So, in practice, defining a sparse parameter grid or defining stopping criteria is essential.

Grid Search is only one of several techniques that can be used to tune the hyperparameters of a predictive model. Alternative techniques include Random Search. In contrast to Grid Search, Random Search is a none exhaustive hyperparameter-tuning technique, which randomly selects and tests specific configurations from a predefined search space. Further optimization techniques are Bayesian Search and Gradient Descent.

A parameter grid with two hyperparameters and respectively three hyperparameter values

Evaluation Metrics

The question of which metric to optimize against inevitably arises when we talk about optimization. Generally, all common metrics available for classification or regression come into question.

Metrics for regression (more detailed description)

Mean Absolute Error (MAE)
Root Mean Squared Absolute Error (RMSAE)
Relative Squared Error (RSE).

Metrics for classification (more detailed description)

Accuracy
Precision
F-1 Score
Recall

Tuning the Hyperparameters of a Random Decision Forest Classifier in Python using Grid Search

Now that we have familiarized ourselves with the basic concept of hyperparameter tuning, let’s move on to the Python hands-on part! In this part, we will work with the Titanic dataset. We will apply the grid search optimization technique to a classification model. We will develop our Machine Learning model based on the Titanic dataset.

The sinking of the Titanic was one of the most catastrophic ship disasters, leading to more than 1500 casualties (The exact number is unknown due to several passengers being unregistered). The Titanic dataset contains a list of passengers with passenger information such as age, gender, cabin, ticket cost, etc., and whether they survived the Titanic sinking. The information about the passengers shows certain patterns that allow conclusions about the likelihood of the passengers surviving the accident. These data can be used to train a predictive model.

In the following, we will use the survival flag as a label and passenger information as input for a classification model. The goal is to predict whether a passenger will survive the Titanic sinking or not. The algorithm will be a random decision forest algorithm that classifies the passengers into two groups, survivors and non-survivors. Once we have trained a baseline model, we will apply grid search to optimize the hyperparameters of this model and select the best model.

The code is available on the GitHub repository.

View on GitHub Relataly GitHub Repo

Source: The National Archives/Heritage-Images/Imagestate

Prerequisites

Before starting the coding part, make sure that you have set up your Python 3 environment and required packages. If you don’t have a Python environment, follow this tutorial to set up the Anaconda environment.

Also, make sure you install all required packages. In this tutorial, we will be working with the following standard packages:

In addition, we will be using the Python Machine Learning library Scikit-learn to implement the random forest and the grid search technique.

You can install packages using console commands:

pip install
conda install (if you are using the anaconda packet manager)

About the Titanic Dataset

In this article, we will be working with the popular titanic dataset for classification. The Titanic dataset is a well-known dataset that contains information about the passengers on the Titanic, a British passenger liner that sank in the North Atlantic Ocean in 1912 after colliding with an iceberg. The dataset includes variables such as the passenger’s name, age, fare, and class, as well as whether or not the passenger survived.

The titanic dataset contains the following information on passengers of the titanic:

Survival: Survival 0 = No, 1 = Yes (Prediction Label)
Pclass: Ticket class 1 = 1st, 2 = 2nd, 3 = 3rd
Sex: Sex
Age: Age in years
SibSp: # of siblings/spouses aboard the Titanic
Parch: # of parents/children aboard the Titanic
Ticket: Ticket number
Fare: Passenger fare
Cabin: Cabin number
Embarked: Port of Embarkation C = Cherbourg, Q = Queenstown, S = Southampton

The Survival column contains the prediction label, which states whether a passenger survived the sinking of the Titanic or not.

You can download the titanic dataset from the Kaggle website. Once you have completed the download, you can place the dataset in the file path of your choice. Using the Kaggle Python environment, you can directly save the dataset into your Kaggle project.

We can assume that the cabin location of the passengers had an impact on their chance of surviving the sinking.

Step #1 Load the Titanic Data

The following code will load the titanic data into our python project. If you have placed the data outside the path shown below, don’t forget to adjust the file path in the code.

import math 
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt 
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
import seaborn as sns
from pandas.plotting import register_matplotlib_converters

# set file path
filepath = "data/titanic-grid-search/"

# Load train and test datasets
titanic_train_df = pd.read_csv(filepath + 'titanic-train.csv')
titanic_test_df = pd.read_csv(filepath + 'titanic-test.csv')
titanic_train_df.head()

	PassengerId	Survived	Pclass	Name							Sex	Age	SibSp	Parch	Ticket				Fare	Cabin	Embarked
0	1			0			3		Braund, Mr. Owen Harris			male	22.0	1	0	A/5 21171			7.2500	NaN		S
1	2			1			1		Cumings, Mrs. John Bradley ...	female	38.0	1	0	PC 17599			71.2833	C85		C
2	3			1			3		Heikkinen, Miss. Laina			female	26.0	0	0	STON/O2. 3101282	7.9250	NaN		S
3	4			1			1		Futrelle, Mrs. Jacques ...		female	35.0	1	0	113803				53.1000	C123	S
4	5			0			3		Allen, Mr. William Henry		male	35.0	0	0	373450				8.0500	NaN		S

Step #2 Preprocessing and Exploring the Data

Before we can train a model, we preprocess the data:

Firstly, we clean the missing values in the data and replace them with the mean.
Second, we transform categorical features (Embarked and Sex) into numeric values. In addition, we will delete some columns to reduce model complexity.
Finally, we delete the prediction label from the training dataset and place it into a separate dataset named y_df.

# Define a function for preprocessing the train and test data 
def preprocess(df):
    
    # Delete some columns that we will not use
    new_df = df[df.columns[~df.columns.isin(['Cabin', 'PassengerId', 'Name', 'Ticket'])]].copy()
    
    # Replace missing values
    for i in new_df.select_dtypes(include=['int16', 'int32', 'int64', 'float16', 'float32', 'float64']).columns:
        new_df[i].fillna(new_df[i].mean(), inplace=True)
    new_df['Embarked'].fillna('C', inplace=True)
    
    # Decode categorical values as integer values
    new_df_b = new_df.copy()
    new_df_b['Sex'] = np.where(new_df_b['Sex']=='male', 0, 1) 
    
    cleanups = {"Sex":     {"m": 0, "f": 1},
                "Embarked": {"S": 1, "Q": 2, "C": 3}}
    new_df_b.replace(cleanups, inplace=True)
    x = new_df_b.drop(columns=['Survived'])
    y = new_df_b['Survived']  
    
    return x, y

# Create the training dataset train_df and the label dataset
x_df, y_df = preprocess(train_df)
x_df.head()

		Pclass	Sex	Age		SibSp	Parch	Fare	Embarked
0		3		0	22.0	1		0		7.2500	1
1		1		1	38.0	1		0		71.2833	3
2		3		1	26.0	0		0		7.9250	1
3		1		1	35.0	1		0		53.1000	1
4		3		0	35.0	0		0		8.0500	1

Let’s take a quick look at the data by creating paired plots for the columns of our data set. Pair plots help us to understand the relationships between pairs of variables in a dataset.

# # Create histograms for feature columns separated by prediction label value
df_plot = titanic_train_df.copy()

# class_columnname = 'Churn'
sns.pairplot(df_plot, hue="Survived", height=2.5, palette='muted')

The histograms tell us various things. For example, most passengers were between 25 and 35 years old. In addition, we can see that most passengers had low-fare tickets, while some passengers had significantly more expensive tickets.

Step #3 Splitting the Data

Next, we will split the data set into training data (x_train, y_train) and test data (x_test, y_test) using a split ratio of 70/30.

# Split the data into x_train and y_train data sets
x_train, x_test, y_train, y_test = train_test_split(x_df, y_df, train_size=0.7, random_state=0)

Step #4 Building a Single Random Forest Model

After completing the preprocessing, we can train the first model. The model uses a random forest algorithm. The random forest algorithm has a large number of hyperparameters.

4.1 About the Random Forest Algorithm

A random forest is a robust predictive algorithm that can handle classification and regression tasks. As a so-called ensemble model, the random forest considers predictions from a group of several independent estimators.

Random decision forests have several hyperparameters that we can use to influence their behavior. However, not all of these hyperparameters have the same influence on model performance. Limiting the number of models by defining a sparse parameter grid is essential to reduce the amount of time needed to test the hyperparameters.

Therefore, we restrict the hyperparameters optimized by the grid search approach to the following two:

n_estimators determine the number of decision trees in the forest
max_depth defines the maximum number of branches in each decision tree

In the scikit-learn documentation, you also find a full list of available hyperparameters. For the rest of these hyperparameters, we will use the default value defined by scikit-learn.

4.2 Implementing a Random Forest Model

We train a simple baseline model and make a test prediction with the x_test dataset. Then we visualize the performance of the baseline model in a confusion matrix:

# Train a single random forest classifier
clf = RandomForestClassifier(max_depth=2, random_state=0, n_estimators = 100)
clf.fit(x_train, y_train)
y_pred = clf.predict(x_test)

# Create a confusion matrix
cnf_matrix = confusion_matrix(y_test, y_pred)

# Create heatmap from the confusion matrix
%matplotlib inline
class_names=[False, True] # name  of classes
fig, ax = plt.subplots(figsize=(7, 6))
sns.heatmap(pd.DataFrame(cnf_matrix), annot=True, cmap="YlGnBu", fmt='g')
ax.xaxis.set_label_position("top")
plt.tight_layout()
plt.title('Confusion matrix')
plt.ylabel('Actual label')
plt.xlabel('Predicted label')
tick_marks = [0.5, 1.5]
plt.xticks(tick_marks, class_names)
plt.yticks(tick_marks, class_names)

Confusion matrix of the best-guess random forest model

Our best-guess model accurately predicted that 151 passengers would not survive. The dark-blue number in the top-left is the group of titanic passengers that did not survive the sinking, and our model classified them correctly as non-survivors. The green area below shows the passengers who survived the sinking and were correctly classified. The other sections show the number of times our model was wrong.

In total, these results correspond to a model accuracy of 80%. Considering that this was a best-guess model, these results are pretty good. However, we can further optimize these results by using the grid search approach for hyperparameter tuning.

Step #5 Hyperparameter Tuning a Classification Model using the Grid Search Technique

By comparing the performance of different model configurations, we can find the best set of hyperparameters that yields the highest accuracy. This approach is a powerful tool for fine-tuning machine learning models and improving their performance. So let’s get started and see if we can beat the results of our best-guess model using the grid search technique!

Training and Tuning the Model

Next, we will use the grid search technique to optimize a random decision forest model that predicts the survival of Titanic passengers. We’ll define a grid of hyperparameter values in Python and then use the Scikit-learn library to train and test the model with different hyperparameter configurations. First, we will define a parameter range:

max_depth = [2, 8, 16]
n_estimators = [64, 128, 256]

We leave the other parameters at their default value. In addition, we need to define against which metric we want the grid search algorithm to evaluate the model performance. Since we have no personal preference and our dataset is well-balanced, we choose the mean test score as the evaluation metric. Then we run the grid search algorithm.

# Define Parameters
max_depth=[2, 8, 16]
n_estimators = [64, 128, 256]
param_grid = dict(max_depth=max_depth, n_estimators=n_estimators)

# Build the grid search
dfrst = RandomForestClassifier(n_estimators=n_estimators, max_depth=max_depth)
grid = GridSearchCV(estimator=dfrst, param_grid=param_grid, cv = 5)
grid_results = grid.fit(x_train, y_train)

# Summarize the results in a readable format
print("Best: {0}, using {1}".format(grid_results.cv_results_['mean_test_score'], grid_results.best_params_))
results_df = pd.DataFrame(grid_results.cv_results_)
results_df

Best: [0.79611613 0.78005161 0.79290323 0.81387097 0.82187097 0.81867097 
 0.78818065 0.78816774 0.78498065], using {'max_depth': 8, 'n_estimators': 128}
 
 	mean_fit_time	std_fit_time	mean_score_time	std_score_time	param_max_depth	param_n_estimators	params	split0_test_score				split1_test_score	split2_test_score	split3_test_score	split4_test_score	mean_test_score	std_test_score	rank_test_score
0	0.057045		0.001108		0.005001		0.000001		2				64					{'max_depth': 2, 'n_estimators': 64}	0.824				0.800				0.784				0.774194			0.798387		0.796116	0.016883	4
1	0.112051		0.002088		0.009490		0.000775		2				128					{'max_depth': 2, 'n_estimators': 128}	0.760				0.824				0.784				0.750000			0.782258		0.780052	0.025523	9
2	0.221600		0.003740		0.016487		0.000448		2				256					{'max_depth': 2, 'n_estimators': 256}	0.792				0.824				0.784				0.774194			0.790323		0.792903	0.016756	5
3	0.061998		0.001410		0.005801		0.000400		8				64					{'max_depth': 8, 'n_estimators': 64}	0.784				0.824				0.792				0.806452			0.862903		0.813871	0.028044	3
4	0.122886		0.002652		0.009587		0.000480		8				128					{'max_depth': 8, 'n_estimators': 128}	0.784				0.848				0.808				0.806452			0.862903		0.821871	0.029089	1
5	0.250295		0.007654		0.018557		0.000836		8				256					{'max_depth': 8, 'n_estimators': 256}	0.800				0.824				0.800				0.806452			0.862903		0.818671	0.023797	2
6	0.065602		0.000505		0.005800		0.000399		16				64					{'max_depth': 16, 'n_estimators': 64}	0.736				0.808				0.784				0.766129			0.846774		0.788181	0.037557	6
7	0.127662		0.003297		0.008600		0.004080		16				128					{'max_depth': 16, 'n_estimators': 128}	0.752				0.800				0.784				0.758065			0.846774		0.788168	0.034078	7
8	0.259617		0.003121		0.018873		0.000537		16				256					{'max_depth': 16, 'n_estimators': 256}	0.752				0.784				0.776				0.766129			0.846774		0.784981	0.032690	8

The list above is an overview of the tested model configurations, ranked by their prediction scores. Model number five achieved the best results. The parameters of this model are a maximum depth of 8 and several estimators of 256.

Model Evaluation

We select the best model and use it to predict the test data set. We visualize the results in another confusion matrix.

# Extract the best decision forest 
best_clf = grid_results.best_estimator_
y_pred = best_clf.predict(x_test)

# Create a confusion matrix
cnf_matrix = confusion_matrix(y_test, y_pred)

# Create heatmap from the confusion matrix
%matplotlib inline
class_names=[False, True] # name  of classes
fig, ax = plt.subplots(figsize=(7, 6))
sns.heatmap(pd.DataFrame(cnf_matrix), annot=True, cmap="YlGnBu", fmt='g')
ax.xaxis.set_label_position("top")
plt.tight_layout()
plt.title('Confusion matrix')
plt.ylabel('Actual label')
plt.xlabel('Predicted label')
tick_marks = [0.5, 1.5]
plt.xticks(tick_marks, class_names)
plt.yticks(tick_marks, class_names)

Confusion matrix of the best grid search model

The confusion matrix shows the best model results from the grid search technique. The result is an overall model accuracy of 83,5 %, which shows that the best grid search model outperforms our initial best guess model. This optimal model has correctly classified that 148 passengers would not survive and 76 passengers would survive. In 44 cases, the model was wrong.

Summary

In the hands-on part of this article, we developed a random decision forest that predicts the survival of Titanic passengers using Python and scikit-learn. The grid search technique applies not only to classification models but can also be used to optimize the performance of regression models. First, we developed a baseline model with best-guess parameters. Subsequently, we defined a parameter grid and used the grid search technique to tune the hyperparameters of the random decision forest. In this way, we quickly identified a configuration that outperforms our initial baseline model. In this way, we have demonstrated how Gid Search can help optimize the classification model parameters.

I hope this article was helpful. I am always interested to learn and improve. So, if you have any questions or suggestions, please write them in the comments.

Sources and Further Reading

The links above to Amazon are affiliate links. By buying through these links, you support the Relataly.com blog and help to cover the hosting costs. Using the links does not affect the price.

The post Tuning Model Hyperparameters with Grid Search at the Example of Training a Random Forest Classifier in Python appeared first on relataly.com.