Hyper-parameter-tuning Classification Models using Grid Search in Python

Hyperparameter Tuning a Random Forest Classifier using Grid Search in Python

This article shows how to employ the grid search technique in Python to optimize the hyperparameters of a machine learning model. Hyperparameters control how a machine learning algorithm learns and how it behaves. Unlike the internal parameters (coefficients, etc.) that the algorithm automatically optimizes during model training, hyperparameters are model characteristics (e.g., number of estimators for an ensemble model) that we must set in advance. A common approach to tune the model hyperparameters is by conducting several experiments with different parameter configurations. This manual hypertuning, however, is often time-consuming. The method uses a grid of predefined parameters to test all possible permutations and return the variant that leads to the best results. In the following, we will look at how this works. For this purpose, we will develop and optimize a random decision in Python forest that predicts Titanic survivors.

The rest of this article is structured as follows: We begin by introducing the basic concept behind grid search. Then we will develop and optimize a random decision forest that predicts the survival of Titanic passengers. Finally, we will define a parameter grid in Python and feed it to the grid search algorithm. The algorithm then tests all possible permutations and finds an optimal configuration.

Operated by the White Star Line, RMS Titanic was the largest and most luxurious ocean liner of her time.
Source: The National Archives/Heritage-Images/Imagestate

When we train a machine learning model, it is usually unclear which hyperparameters lead to good results. While there are estimates and rules of thumb, there is often no way to avoid trying out hyperparameters in experiments. However, machine learning models often have several hyperparameters that affect the model’s performance in a nonlinear way.

We can use grid search to automate the process of searching for optimal model hyperparameters. The grid search algorithm exhaustively generates models from parameter permutations of a grid of parameter values. Let’s take a look at how this works.

Hyperparameter Tuning with Grid Search: How it Works

The concept behind the grid search technique is quite simple. As mentioned earlier, grid search is exhaustive in that it tests all permutations of a parameter grid. The number of model variants results from the parameter grid and the specified parameters.

The grid search algorithm requires us to provide the following information:

  • The hyperparameters that we want to configure (e.g., tree depth)
  • For each hyperparameter a range of values (e.g., [50, 100, 150])
  • A performance metric so that the algorithm knows how to measure performance (e.g., accuracy for a classification model)

A sample parameter grid is shown below:

parameter grid for hyperparameter tuning
Exemplary parameter grid for the tuning of a random decision forest with four hyperparameters

For example, let’s say we specify a range of [16, 32, and 64] for n_estimators and a range of [8, 16, and 32] for max_depth. Then, the search grid will test a total of 9 different parameter configurations.

A search grid with two parameters and three parameter values
A parameter grid with two hyperparameters

The advantage of the grid search is that the algorithm automatically identifies the optimal parameter configuration from the parameter grid. However, the number of possible configurations increases exponentially with the number of values in the parameter grid. So in practice, it is essential to define a sparse parameter grid or run the algorithm several times with different parameter ranges.

An alternative to exhaustive hyperparameter-tuning is random search, which randomly tests a predefined number of configurations.

Let’s move on to the practical part in Python! In the following, we will focus on the Titanic dataset. The dataset contains information about Titanic passengers such as age, gender, cabin, ticket cost, etc. This data provides a fantastic basis to predict whether a passenger will survive the Titanic sinking or not. We will use the dataset to train a random decision forest algorithm to classify the passengers into two groups, survivors and non-survivors. Then we will use grid search to optimize the hyperparameters of this model and select the best model.

Prerequisites

Before we start the coding part, make sure that you have set up your Python 3 environment and required packages. If you don’t have an environment set up yet, you can follow this tutorial to set up the Anaconda environment.

Also, make sure you install all required packages. In this tutorial, we will be working with the following standard packages: 

In addition, we will be using the Python machine learning library Scikit-learn to implement the random forest and the grid search technique.

You can install packages using console commands:

  • pip install <package name>
  • conda install <package name> (if you are using the anaconda packet manager)

About the Data

We begin by loading the titanic dataset from the Kaggle website – one of the best-known datasets to demonstrate classification. Once you have completed the download, you can place the dataset in the file path of your choice. If you use the Kaggle Python environment, you can also directly save the dataset into your Kaggle project.

The titanic dataset contains the following information on passengers of the titanic:

  • Survival: Survival 0 = No, 1 = Yes (Prediction Label)
  • Pclass: Ticket class 1 = 1st, 2 = 2nd, 3 = 3rd
  • Sex: Sex
  • Age: Age in years
  • SibSp: # of siblings / spouses aboard the Titanic
  • Parch: # of parents / children aboard the Titanic
  • Ticket: Ticket number
  • Fare: Passenger fare
  • Cabin: Cabin number
  • Embarked: Port of Embarkation C = Cherbourg, Q = Queenstown, S = Southampton

The column “Survival” contains the prediction label, which says whether a passenger survived the sinking of the Titanic or not.

Step #1 Load the Data

The following code will load the titanic data into our python project. If you have placed the data outside the path shown below, don’t forget to adjust the file path in the code.

import math 
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt 
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
import seaborn as sns
from pandas.plotting import register_matplotlib_converters

# set file path
filepath = "data/titanic-grid-search/"

# Load train and test datasets
titanic_train_df = pd.read_csv(filepath + 'titanic-train.csv')
titanic_test_df = pd.read_csv(filepath + 'titanic-test.csv')
titanic_train_df.head()
Head of the full titanic dataset

Step #2 Preprocessing and Exploring the Data

Before we can train a model, we preprocess the data:

  • Firstly, we clean the missing values in the data and replace them with the mean.
  • Second, we transform categorical features (Embarked and Sex) into numeric values. In addition, we will delete some columns to reduce model complexity.
  • Finally, we delete the prediction label from the training dataset and place it into a separate dataset named y_df.
# Define a preprocessing function
def preprocess(dfx):
    df = dfx.copy()
    
    # Deleting some columns from the data
    new_df = df[df.columns[~df.columns.isin(['Survived', 'Cabin', 'PassengerId', 'Name', 'Ticket'])]]
    
    # Replace missing values
    new_df.fillna(df.mean(), inplace=True)
    new_df['Embarked'].fillna('C', inplace=True)
    
    # Convert categorical values to integer
    new_df_b = new_df.copy()
    new_df_b['Sex'] = np.where(new_df_b['Sex']=='male', 0, 1) 
    new_df_b['Embarked'].mask(new_df_b['Embarked']=='S', '1', inplace=True)
    new_df_b['Embarked'].mask(new_df_b['Embarked']=='Q', '2', inplace=True)
    new_df_b['Embarked'].mask(new_df_b['Embarked']=='C', '3', inplace=True)
    
    return new_df_b

# Create train_df & test_df
x_df = preprocess(titanic_train_df).copy()
y_df = titanic_train_df['Survived'].copy()
x_df.head()
Our selection of features from the titanic dataset

Let’s take a quick look at the data by creating histograms for the columns of our data set.

# Create histograms for feature columns
x_df.hist(bins=30, figsize=(12, 8), color='blue', alpha=0.4)
plt.suptitle("Histogram for each numeric input variable", fontsize=10)
plt.show()
plot of the selected features of the titanic dataset created in Python

The histograms tell us various things. For example, we see that most passengers were between 25 and 35 years old. In addition, we can see that most passengers had low-fare tickets, while some passengers had significantly more expensive tickets. This seems logical since most passengers 3rd class tickets.

Step #3 Splitting the Data

Next, we will split the data set into training data (x_train, y_train) and test data (x_test, y_test) using a split ratio of 70/30.

# Split the data into x_train and y_train data sets
x_train, x_test, y_train, y_test = train_test_split(x_df, y_df, train_size=0.7, random_state=0)

Step #4 Building a Single Random Forest Model

Now that we have completed the pre-processing, we can train a first model. The model uses a random forest algorithm.

4.1 About the Random Forest Algorithm

A random forest is a powerful algorithm that can handle both classification and regression tasks. As a so-called ensemble model, the random forest considers predictions from a group of several independent estimators.

Random decision forests have several hyperparameters, which we can use to influence their behavior. As mentioned before, it is essential to limit the number of models by defining a sparse parameter grid. Therefore, we restrict the hyperparameters optimized by the grid search approach to the following two:

  • n_estimators determines the number of decision trees in the forest
  • max_depth defines the maximum number of branches in each decision tree

For the rest of the parameters, we will use the default value as defined by scikit-learn.

4.2 Implementing a Random Forest Model

We train a simple baseline model and make a test prediction with the x_test dataset. Then we visualize the performance of the baseline model in a confusion matrix:

# Train a single random forest classifier
clf = RandomForestClassifier(max_depth=2, random_state=0, n_estimators = 100)
clf.fit(x_train, y_train)
y_pred = clf.predict(x_test)

# Create a confusion matrix
cnf_matrix = confusion_matrix(y_test, y_pred)

# Create heatmap from the confusion matrix
%matplotlib inline
class_names=[False, True] # name  of classes
fig, ax = plt.subplots(figsize=(7, 6))
sns.heatmap(pd.DataFrame(cnf_matrix), annot=True, cmap="YlGnBu", fmt='g')
ax.xaxis.set_label_position("top")
plt.tight_layout()
plt.title('Confusion matrix')
plt.ylabel('Actual label')
plt.xlabel('Predicted label')
tick_marks = [0.5, 1.5]
plt.xticks(tick_marks, class_names)
plt.yticks(tick_marks, class_names)
Confusion matrix of the best-guess random forest model Python
Confusion matrix of the best-guess random forest model

The dark-blue number in the top-left is the group of titanic passengers that did not survive the sinking, and our model classified them correctly as non-survivors. Our best-guess model accurately predicted that 151 passengers would not survive. The green area on the right below shows the passengers that survived the sinking and were correctly classified. The other sections show the number of times our model was wrong.

In total, these results correspond to a model accuracy of 80%. Considering that this was a best-guess model, these results are pretty good. However, we can further optimize these results by using the grid search approach for hyperparameter tuning.

Step #5 Hyperparameter Tuning using the Grid Search Technique

Let’s find out if we can beat the results of our best-guess model using grid search. First, we will define a parameter range:

  • max_depth=[2, 8, 16]
  • n_estimators = [64, 128, 256]

We leave the other parameters at their default value. In addition, we need to define against which metric we want the grid search algorithm to evaluate the model performance. Since we have no personal preference and our dataset is well-balanced, we choose the mean-test-score as the evaluation metric. Then we run the grid search algorithm.

# Define Parameters
max_depth=[2, 8, 16]
n_estimators = [64, 128, 256]
param_grid = dict(max_depth=max_depth, n_estimators=n_estimators)

# Build the grid search
dfrst = RandomForestClassifier(n_estimators=n_estimators, max_depth=max_depth)
grid = GridSearchCV(estimator=dfrst, param_grid=param_grid, cv = 5)
grid_results = grid.fit(x_train, y_train)

# Summarize the results in a readable format
print("Best: {0}, using {1}".format(grid_results.cv_results_['mean_test_score'], grid_results.best_params_))
results_df = pd.DataFrame(grid_results.cv_results_)
results_df
results of the random decision forests generated through the grid search approach in Python

The list above is an overview of the tested model configurations, ranked by their prediction scores. Model number five achieved the best results. The parameters of this model are a maximum depth of 8 and several estimators of 256.

We select the best model and use it to predict the test data set. We visualize the results in another confusion matrix.

# Extract the best decision forest 
best_clf = grid_results.best_estimator_
y_pred = best_clf.predict(x_test)

# Create a confusion matrix
cnf_matrix = confusion_matrix(y_test, y_pred)

# Create heatmap from the confusion matrix
%matplotlib inline
class_names=[False, True] # name  of classes
fig, ax = plt.subplots(figsize=(7, 6))
sns.heatmap(pd.DataFrame(cnf_matrix), annot=True, cmap="YlGnBu", fmt='g')
ax.xaxis.set_label_position("top")
plt.tight_layout()
plt.title('Confusion matrix')
plt.ylabel('Actual label')
plt.xlabel('Predicted label')
tick_marks = [0.5, 1.5]
plt.xticks(tick_marks, class_names)
plt.yticks(tick_marks, class_names)
confusion matrix on the best model returned by the grid search approach in python
Confusion matrix of the best grid search model

The confusion matrix shows the results of the best model as returned by the grid search technique. This optimal model has correctly classified that 148 passengers would not survive and that 76 passengers would survive. In 44 cases, the model was wrong. The result is an overall model accuracy of 83,5 %, which shows that the best grid search model outperforms our initial best guess model.

Summary

This article has shown how we can use grid search in Python to efficiently search for the optimal hyperparameter configuration of a machine learning model. In the conceptual part, you learned about hyperparameters and how to use grid search to try out all permutations of a predefined parameter grid. In the practical part of this article, we focused on developing a classification model for the Titanic dataset using Python and scikit-learn. We created a random forest that predicts the survival of Titanic passengers. The first model was a best-guess model. Subsequently, we defined a parameter grid and used the grid search technique to tune the hyperparameters of the random decision forest. In this way, we were quickly able to identify a configuration that outperforms our initial baseline model.

I hope this article was helpful. If you have any questions or suggestions, let me know in the comments.

Author

  • Hi, I am Florian, a Zurich-based consultant for AI and Data. Since the completion of my Ph.D. in 2017, I have been working on the design and implementation of ML use cases in the Swiss financial sector. I started this blog in 2020 with the goal in mind to share my experiences and create a place where you can find key concepts of machine learning and materials that will allow you to kick-start your own Python projects.

Leave a Reply