Hyperparameter Tuning a Random Forest Classifier using Grid Search in Python

Hyperparameters control how a machine learning algorithm learns and how it behaves. Unlike the internal parameters (coefficients, etc.) that the algorithm automatically optimizes during model training, hyperparameters are model characteristics (e.g., the number of estimators for an ensemble model) that we must set in advance. Finding the optimal hyperparameter configuration is challenging. Often there is no way of knowing the ideal hyperparameters in advance. It is, therefore, that finding an excellent model requires conducting several experiments with different parameters. If done manually, this can be time-consuming. This article presents the grid search technique, a thorough approach to identifying the optimal hyperparameters of a machine learning model.

Grid search uses a grid of predefined hyperparameters (the search space) to test all possible permutations and return the model variant that leads to the best results. In the following, we will look at how this works in the example of hyperparameter tuning a classification model. For this purpose, we will develop and optimize a random decision forest in Python that classifies Titanic passengers as survivors and non-survivors.

The rest of this article proceeds as follows: We begin by introducing the basic concept behind grid search. Then we will develop and optimize a random decision forest that predicts the survival of Titanic passengers. Finally, we will define a parameter grid in Python and feed it to the grid search algorithm. The algorithm then tests all possible permutations and finds an optimal configuration. Finally, we compare the performance of different model configurations.

Operated by the White Star Line, RMS Titanic was the largest and most luxurious ocean liner of her time.
Source: The National Archives/Heritage-Images/Imagestate

When we train a machine learning model, it is usually unclear which hyperparameters lead to good results. While there are estimates and rules of thumb, there is often no way to avoid trying out hyperparameters in experiments. However, machine learning models often have several hyperparameters that affect the model’s performance in a nonlinear way.

We can use grid search to automate searching for optimal model hyperparameters. The grid search algorithm exhaustively generates models from parameter permutations of a grid of parameter values. Let’s take a look at how this works.

Hyperparameter Tuning with Grid Search: How it Works

The idea behind the grid search technique is quite simple. We have a model with parameters, and the challenge is to test various configurations until we are satisfied with the result. Grid search is exhaustive in that it tests all permutations of a parameter grid. The number of model variants results from the parameter grid and the specified parameters.

The grid search algorithm requires us to provide the following information:

  • The hyperparameters that we want to configure (e.g., tree depth)
  • For each hyperparameter a range of values (e.g., [50, 100, 150])
  • A performance metric so that the algorithm knows how to measure performance (e.g., accuracy for a classification model)

For example, imagine we have a range of [16, 32, and 64] for n_estimators and a range of [8, 16, and 32] for max_depth. Then, the search grid will test 9 different parameter configurations.

Grid Search - A search grid with two parameters and three parameter values
A parameter grid with two hyperparameters

The illustration below shows a sample parameter grid:

Grid Search - parameter grid for hyperparameter tuning
Exemplary parameter grid for the tuning of a random decision forest with four hyperparameters

The advantage of the grid search is that the algorithm automatically identifies the optimal parameter configuration from the parameter grid. However, the number of possible configurations increases exponentially with the number of values in the parameter grid. So in practice, it is essential to define a sparse parameter grid or run the algorithm several times with different parameter ranges.

An alternative to exhaustive hyperparameter-tuning is random search, which randomly selects and tests specific configurations from a parameter grid.

Let’s move on to the practical part in Python! We will apply the grid search optimization technique to a classification model. We will develop our machine learning model based on the Titanic dataset. The dataset contains a list of passengers with passenger information such as age, gender, cabin, ticket cost, etc., and whether they survived the Titanic sinking.

In the following, we will use the survival flag as a label and passenger information as input for a classification model. The goal is to predict whether a passenger will survive the Titanic sinking or not. The algorithm will be a random decision forest algorithm that classifies the passengers into two groups, survivors and non-survivors. Once we have trained a baseline model, we will apply grid search to optimize the hyperparameters of this model and select the best model.

The code is available on the GitHub repository.

Prerequisites

Before starting the coding part, make sure that you have set up your Python 3 environment and required packages. If you don’t have an environment set up yet, you can follow this tutorial to set up the Anaconda environment.

Also, make sure you install all required packages. In this tutorial, we will be working with the following standard packages: 

In addition, we will be using the Python machine learning library Scikit-learn to implement the random forest and the grid search technique.

You can install packages using console commands:

  • pip install <package name>
  • conda install <package name> (if you are using the anaconda packet manager)

About the Data

We begin by loading the titanic dataset from the Kaggle website – a popular dataset to demonstrate classification. Once you have completed the download, you can place the dataset in the file path of your choice. Using the Kaggle Python environment, you can also directly save the dataset into your Kaggle project.

The titanic dataset contains the following information on passengers of the titanic:

  • Survival: Survival 0 = No, 1 = Yes (Prediction Label)
  • Pclass: Ticket class 1 = 1st, 2 = 2nd, 3 = 3rd
  • Sex: Sex
  • Age: Age in years
  • SibSp: # of siblings/spouses aboard the Titanic
  • Parch: # of parents/children aboard the Titanic
  • Ticket: Ticket number
  • Fare: Passenger fare
  • Cabin: Cabin number
  • Embarked: Port of Embarkation C = Cherbourg, Q = Queenstown, S = Southampton
We can assume that the cabin location of the passengers had an impact on their chance to survive the sinking. Developing a machine learning model for prediction of titanic passenger survival and optimizing its hyperparameters using grid search
We can assume that the cabin location of the passengers had an impact on their chance of surviving the sinking.

The Survival column contains the prediction label, which states whether a passenger survived the sinking of the Titanic or not.

Step #1 Load the Data

The following code will load the titanic data into our python project. If you have placed the data outside the path shown below, don’t forget to adjust the file path in the code.

import math 
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt 
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
import seaborn as sns
from pandas.plotting import register_matplotlib_converters

# set file path
filepath = "data/titanic-grid-search/"

# Load train and test datasets
titanic_train_df = pd.read_csv(filepath + 'titanic-train.csv')
titanic_test_df = pd.read_csv(filepath + 'titanic-test.csv')
titanic_train_df.head()
titanic dataset used in this grid search tutorial
Head of the full titanic dataset

Step #2 Preprocessing and Exploring the Data

Before we can train a model, we preprocess the data:

  • Firstly, we clean the missing values in the data and replace them with the mean.
  • Second, we transform categorical features (Embarked and Sex) into numeric values. In addition, we will delete some columns to reduce model complexity.
  • Finally, we delete the prediction label from the training dataset and place it into a separate dataset named y_df.
# Define a function for preprocessing the train and test data 
def preprocess(df):
    
    # Delete some columns that we will not use
    new_df = df[df.columns[~df.columns.isin(['Cabin', 'PassengerId', 'Name', 'Ticket'])]].copy()
    
    # Replace missing values
    for i in new_df.select_dtypes(include=['int16', 'int32', 'int64', 'float16', 'float32', 'float64']).columns:
        new_df[i].fillna(new_df[i].mean(), inplace=True)
    new_df['Embarked'].fillna('C', inplace=True)
    
    # Decode categorical values as integer values
    new_df_b = new_df.copy()
    new_df_b['Sex'] = np.where(new_df_b['Sex']=='male', 0, 1) 
    
    cleanups = {"Sex":     {"m": 0, "f": 1},
                "Embarked": {"S": 1, "Q": 2, "C": 3}}
    new_df_b.replace(cleanups, inplace=True)
    x = new_df_b.drop(columns=['Survived'])
    y = new_df_b['Survived']  
    
    return x, y

# Create the training dataset train_df and the label dataset
x_df, y_df = preprocess(train_df)
x_df.head()
Our selection of features from the titanic dataset

Let’s take a quick look at the data by creating paired plots for the columns of our data set.

# # Create histograms for feature columns separated by prediction label value
df_plot = titanic_train_df.copy()

# class_columnname = 'Churn'
sns.pairplot(df_plot, hue="Survived", height=2.5, palette='muted')
paired plot created with seaborn

The histograms tell us various things. For example, most passengers were between 25 and 35 years old. In addition, we can see that most passengers had low-fare tickets, while some passengers had significantly more expensive tickets.

Step #3 Splitting the Data

Next, we will split the data set into training data (x_train, y_train) and test data (x_test, y_test) using a split ratio of 70/30.

# Split the data into x_train and y_train data sets
x_train, x_test, y_train, y_test = train_test_split(x_df, y_df, train_size=0.7, random_state=0)

Step #4 Building a Single Random Forest Model

Now that we have completed the preprocessing, we can train a first model. The model uses a random forest algorithm.

4.1 About the Random Forest Algorithm

A random forest is a robust machine learning algorithm that can handle classification and regression tasks. As a so-called ensemble model, the random forest considers predictions from a group of several independent estimators.

Random decision forests have several hyperparameters that we can use to influence their behavior. It is essential to limit the number of models by defining a sparse parameter grid. Therefore, we restrict the hyperparameters optimized by the grid search approach to the following two:

  • n_estimators determine the number of decision trees in the forest
  • max_depth defines the maximum number of branches in each decision tree

For the rest of the parameters, we will use the default value as defined by scikit-learn.

4.2 Implementing a Random Forest Model

We train a simple baseline model and make a test prediction with the x_test dataset. Then we visualize the performance of the baseline model in a confusion matrix:

# Train a single random forest classifier
clf = RandomForestClassifier(max_depth=2, random_state=0, n_estimators = 100)
clf.fit(x_train, y_train)
y_pred = clf.predict(x_test)

# Create a confusion matrix
cnf_matrix = confusion_matrix(y_test, y_pred)

# Create heatmap from the confusion matrix
%matplotlib inline
class_names=[False, True] # name  of classes
fig, ax = plt.subplots(figsize=(7, 6))
sns.heatmap(pd.DataFrame(cnf_matrix), annot=True, cmap="YlGnBu", fmt='g')
ax.xaxis.set_label_position("top")
plt.tight_layout()
plt.title('Confusion matrix')
plt.ylabel('Actual label')
plt.xlabel('Predicted label')
tick_marks = [0.5, 1.5]
plt.xticks(tick_marks, class_names)
plt.yticks(tick_marks, class_names)
Confusion matrix of the best-guess random forest model

Our best-guess model accurately predicted that 151 passengers would not survive. The dark-blue number in the top-left is the group of titanic passengers that did not survive the sinking, and our model classified them correctly as non-survivors. The green area below shows the passengers who survived the sinking and were correctly classified. The other sections show the number of times our model was wrong.

In total, these results correspond to a model accuracy of 80%. Considering that this was a best-guess model, these results are pretty good. However, we can further optimize these results by using the grid search approach for hyperparameter tuning.

Step #5 Hyperparameter Tuning a Classification Model using the Grid Search Technique

Let’s find out if we can beat the results of our best-guess model using the grid search technique. First, we will define a parameter range:

  • max_depth=[2, 8, 16]
  • n_estimators = [64, 128, 256]

We leave the other parameters at their default value. In addition, we need to define against which metric we want the grid search algorithm to evaluate the model performance. Since we have no personal preference and our dataset is well-balanced, we choose the mean test score as the evaluation metric. Then we run the grid search algorithm.

# Define Parameters
max_depth=[2, 8, 16]
n_estimators = [64, 128, 256]
param_grid = dict(max_depth=max_depth, n_estimators=n_estimators)

# Build the grid search
dfrst = RandomForestClassifier(n_estimators=n_estimators, max_depth=max_depth)
grid = GridSearchCV(estimator=dfrst, param_grid=param_grid, cv = 5)
grid_results = grid.fit(x_train, y_train)

# Summarize the results in a readable format
print("Best: {0}, using {1}".format(grid_results.cv_results_['mean_test_score'], grid_results.best_params_))
results_df = pd.DataFrame(grid_results.cv_results_)
results_df
results of the random decision forests generated through the grid search approach in Python

The list above is an overview of the tested model configurations, ranked by their prediction scores. Model number five achieved the best results. The parameters of this model are a maximum depth of 8 and several estimators of 256.

We select the best model and use it to predict the test data set. We visualize the results in another confusion matrix.

# Extract the best decision forest 
best_clf = grid_results.best_estimator_
y_pred = best_clf.predict(x_test)

# Create a confusion matrix
cnf_matrix = confusion_matrix(y_test, y_pred)

# Create heatmap from the confusion matrix
%matplotlib inline
class_names=[False, True] # name  of classes
fig, ax = plt.subplots(figsize=(7, 6))
sns.heatmap(pd.DataFrame(cnf_matrix), annot=True, cmap="YlGnBu", fmt='g')
ax.xaxis.set_label_position("top")
plt.tight_layout()
plt.title('Confusion matrix')
plt.ylabel('Actual label')
plt.xlabel('Predicted label')
tick_marks = [0.5, 1.5]
plt.xticks(tick_marks, class_names)
plt.yticks(tick_marks, class_names)
confusion matrix on the best model returned by the grid search approach in python
Confusion matrix of the best grid search model

The confusion matrix shows the best model results from the grid search technique. The result is an overall model accuracy of 83,5 %, which shows that the best grid search model outperforms our initial best guess model. This optimal model has correctly classified that 148 passengers would not survive and 76 passengers would survive. In 44 cases, the model was wrong.

Summary

This article has shown how we can use grid Search in Python to efficiently search for the optimal hyperparameter configuration of a machine learning model. In the conceptual part, you learned about hyperparameters and how to use grid Search to try out all permutations of a predefined parameter grid.

In the practical part of this article, we developed a random decision forest that predicts the survival of Titanic passengers using Python and scikit-learn. The grid search technique does not only apply to classification models but can also be used to optimize the performance of regression models. First, we developed a baseline model with best-guess parameters. Subsequently, we defined a parameter grid and used the grid search technique to tune the hyperparameters of the random decision forest. In this way, we quickly identified a configuration that outperforms our initial baseline model. In this way, we have demonstrated how Gid Search can help optimize the classification model parameters.

I hope this article was helpful. If you have any questions or suggestions, please write them in the comments.

Author

  • Hi, I am Florian, a Zurich-based consultant for AI and Data. Since the completion of my Ph.D. in 2017, I have been working on the design and implementation of ML use cases in the Swiss financial sector. I started this blog in 2020 with the goal in mind to share my experiences and create a place where you can find key concepts of machine learning and materials that will allow you to kick-start your own Python projects.

Leave a Comment