Hyperparameter Tuning with Grid Search at the Example of a Random Forest Classifier with Python

The functionality of machine learning models can be controlled with their hyperparameters. The choice of these parameters often has a significant impact on model performance and, in practice, can make the difference between sufficient and outstanding performance. Data scientists therefore spend a large part of their time adjusting the various parameters of a machine learning model with the aim of finding the optimal set of parameters. This process is called hyperparameter tuning (also referred to as model tuning).

Manually executed hyperparameter tuning can be time-consuming, since each model configuration needs to be configured, trained and evaluated. Especially in models where every percentage point counts, several hundred experiments can quickly come together until an optimal configuration is found. To reduce the manual effort, it is generally a good idea to automate the process of hyperparameter tuning and in this blog post I will show how this works. The blog post will demonstrate a technique in testing several configurations for random forest models that predict the survival of Titanic passengers. We will use grid search to automatically and exhaustively test a set of parameter values and identify the model which delivers the best performance.

Operated by the White Star Line, RMS Titanic was the largest and most luxurious ocean liner of her time.
The National Archives/Heritage-Images/Imagestate

Automated Hyperparameter Tuning using Grid Search

The common way of automatically searching for an optimal parameter configuration is by using a grid search. Grid search is an exhaustive search technique in which all possible permutations of a parameter grid are tried out step by step. We can apply Grid Search to machine learning models and automatically test different model configurations. As a result, we get a ranking of the models based on their performance. For this to work, we need to provide the search grid with the following information:

  • The hyperparameters that we want to test
  • For each hyperparameter a range of values
  • A performance metric so that the algorithm knows how to measure performance
Exemplary search grid

The number of models created by the search grid can be calculated by multiplying the number of defined values in each parameter range. Later in this tutorial we will define the search grid to tune the hyperparameters of a random decision forest:

  • n_estimators, which determines the number of decision trees,
  • and max_depth, which defines the maximum number of branches in each decision tree

We specify a range of [16, 32, and 64] for n_estimators and a range of [8, 16 and 32] for max_depth. In addition, we specify accuracy as the metric to measure performance. As a result, the search grid will train and test a total of nine (3 x 3 = 9) different random forest models. Then we can run the grid search algorithm and while we can relax, the search grid does all the work and determines the best configuration.

A search grid with two parameters and three parameter values

Python implementation

In the following, this tutorial shows how to optimize a random forest with the help of a search grid using the Titanic dataset as an example.

The example covers the following steps:

  1. Loading the Titanic Data
  2. Preprocessing and Exploring the Data
  3. Split the Data
  4. Training a single Random Forest Model
  5. Model Tuning using Grid Search

Python Environment

This tutorial assumes that you have setup your python environment. I recommend using Anaconda. If you have not yet set it up, you can follow this tutorial. It is also assumed that you have the following packages installed: keras (2.0 or higher) with Tensorflow backend , numpy, pandas, matplot, sklearn. The packages can be installed using the console command:

pip install <package name> 
conda install <package name> (if you are using the anaconda environment)

1) Loading the Titanic Data

We begin by loading titanic dataset from the Kaggle website – one of the best known datasets to demonstrate classification. After you have completed the download, put the dataset under the filepath of your choice. However, don’t forget to adjust the file path in the code. If you are working with the Kaggle Python environment, you can also directly save the dataset into your Kaggle project.

The titanic dataset contains the following information on passengers of the titanic:

  • Survival: Survival 0 = No, 1 = Yes (Prediction Label)
  • Pclass: Ticket class 1 = 1st, 2 = 2nd, 3 = 3rd
  • Sex: Sex
  • Age: Age in years
  • SibSp: # of siblings / spouses aboard the Titanic
  • Parch: # of parents / children aboard the Titanic
  • Ticket: Ticket number
  • Fare: Passenger fare
  • Cabin: Cabin number
  • Embarked: Port of Embarkation C = Cherbourg, Q = Queenstown, S = Southampton

The column “Survival” contains the prediction label, which says whether a passenger survived the sinking or not. W can use this data to train a classifier that predicts the passengers that will survive the sinking and those that will not. The following code will load the data into our python project.

import math 
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt 
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
import seaborn as sns
from pandas.plotting import register_matplotlib_converters

# set file path
filepath = "data/titanic-grid-search/"

# Load train and test datasets
titanic_train_df = pd.read_csv(filepath + 'titanic-train.csv')
titanic_test_df = pd.read_csv(filepath + 'titanic-test.csv')
titanic_train_df.head()
Head of the full titanic dataset

2) Preprocessing and Exploring the Data

Before we can train a model, we first need to preprocess the data. This will require several steps: First, we will clean the missing values in the data and replace them with the mean. Second, we will transform categorical features (Embarked and Sex) into numeric values. In addition, we will delete some columns to reduce model complexity. Finally, we will delete the prediction label from the training dataset and place it into a separate dataset named y_df.

# Define a preprocessing function
def preprocess(dfx):
    df = dfx.copy()
    
    # Deleting some columns from the data
    new_df = df[df.columns[~df.columns.isin(['Survived', 'Cabin', 'PassengerId', 'Name', 'Ticket'])]]
    
    # Replace missing values
    new_df.fillna(df.mean(), inplace=True)
    new_df['Embarked'].fillna('C', inplace=True)
    
    # Convert categorical values to integer
    new_df_b = new_df.copy()
    new_df_b['Sex'] = np.where(new_df_b['Sex']=='male', 0, 1) 
    new_df_b['Embarked'].mask(new_df_b['Embarked']=='S', '1', inplace=True)
    new_df_b['Embarked'].mask(new_df_b['Embarked']=='Q', '2', inplace=True)
    new_df_b['Embarked'].mask(new_df_b['Embarked']=='C', '3', inplace=True)
    
    return new_df_b

# Create train_df & test_df
x_df = preprocess(titanic_train_df).copy()
y_df = titanic_train_df['Survived'].copy()
x_df.head()
Head of the train_dataset

Let’s take a quick look at the data by creating histograms for the columns of our data set.

# Histograms for each column
register_matplotlib_converters()
nrows = 2; ncols = int(round(x_df.shape[1] / nrows, 0))
fig, ax = plt.subplots(nrows=nrows, ncols=ncols, sharex=False, figsize=(16, 10))
fig.subplots_adjust(hspace=0.2, wspace=0.2)
columns = x_df.columns
f = 0
for i in range(nrows):
    for j in range(ncols):
        if f <= x_df.shape[1]-1:
            assetname = columns[f]
            y = x_df[assetname]
            ax[i, j].hist(y, color='#039dfc', label=assetname, bins='auto')
            #ax[i, j].set_xlim([y.max(), y.min()])
            f += 1
            ax[i, j].set_title(assetname)
plt.show()
Histograms of all columns in our train dataset

The histograms tell us various things. For example, we see that most passengers were between 25 and 35 years old. In addition, we can see that most passenger had low fare tickets, while some passenger had tickets that were significantly more expensive. This seems logical, since most passengers 3rd class tickets.

3) Splitting the data

Next, we will split the data set into training data (x_train, y_train) and test data (x_test, y_test) using a split ratio of 70/30.

# Split the data into x_train and y_train data sets
x_train, x_test, y_train, y_test = train_test_split(x_df, y_df, train_size=0.7, random_state=0)

4) Building a single Random Forest Model

Now that we have completed the pre-processing, we can train a first model. The model that we are about to train uses a random forest, which is an algorithm that can be used for classification and regression tasks. In contrast to a decision tree, the decision forest is an ensemble model that combines the results of many different decision trees to make the best possible decisions.

A random forest model has a wide range hyperparameters with which we can control the characteristics of the decision trees and the ensemble model. However, for the sake of simplicity, we will control only two of them and use the default value for all other parameters:

  • The number of the estimators in the ensemble model (n_estimators)
  • The maximum depth of search decision tree in the ensemble model (max_depth)

The following code will first train the random forest model, then make a test prediction with the x_test dataset, and finally visualize the model performance in a confusion matrix:

# Train a single random forest classifier
clf = RandomForestClassifier(max_depth=2, random_state=0, n_estimators = 100)
clf.fit(x_train, y_train)
y_pred = clf.predict(x_test)

# Create a confusion matrix
cnf_matrix = confusion_matrix(y_test, y_pred)

# Create heatmap from the confusion matrix
%matplotlib inline
class_names=[False, True] # name  of classes
fig, ax = plt.subplots(figsize=(7, 6))
sns.heatmap(pd.DataFrame(cnf_matrix), annot=True, cmap="YlGnBu", fmt='g')
ax.xaxis.set_label_position("top")
plt.tight_layout()
plt.title('Confusion matrix')
plt.ylabel('Actual label')
plt.xlabel('Predicted label')
tick_marks = [0.5, 1.5]
plt.xticks(tick_marks, class_names)
plt.yticks(tick_marks, class_names)
Confusion matrix of the best guess model

The number of cases correctly classified by our model are shown in the squares at the top left and bottom right of the matrix. As we see, our best estimate model correctly predicted that 151 passengers would not survive and that 64 would survive the sinking. In 53 cases the model was wrong. In total, this corresponds to a model accuracy of 80%. Considering that the choice of parameters was only a best guess, these results are surprisingly good. However, by using automated hyperparameter tuning, we should be able to identify a model that outperforms these results.

5) Hyperparameter Tuning using Grid Search

Let’s find out, if we can beat the results of our best guess model using grid search. First, we will define a parameter range and then have the random forest model test any combination of parameters.

# Define Parameters
max_depth=[2, 8, 16]
n_estimators = [64, 128, 256]
param_grid = dict(max_depth=max_depth, n_estimators=n_estimators)

# Build the gridsearch
dfrst = RandomForestClassifier(n_estimators=n_estimators, max_depth=max_depth)
grid = GridSearchCV(estimator=dfrst, param_grid=param_grid, cv = 5)
grid_results = grid.fit(x_train, y_train)

# Summarize the results in a readable format
print("Best: {0}, using {1}".format(grid_results.cv_results_['mean_test_score'], grid_results.best_params_))
results_df = pd.DataFrame(grid_results.cv_results_)
results_df
#mean_fit_timestd_fit_timemean_score_timestd_score_timeparam_max_depthparam_n_estimatorsparamssplit1_test_scoresplit2_test_scoresplit3_test_scoresplit4_test_scoremean_test_scorestd_test_scorerank_test_score
00.0345200.0014440.0040890.0005352640.7460320.8320.7741940.7983870.7822580.7865170.0284338
10.0722520.0120600.0068680.00078821280.7698410.8240.7903230.7741940.7983870.7913320.0193935
20.1344070.0038580.0132640.00072522560.7777780.8320.7903230.7741940.7822580.7913320.0210725
30.0399130.0005100.0040150.0006438640.8095240.8240.7983870.7983870.8629030.8186200.0239962
40.0822980.0024980.0079330.00053581280.7698410.8320.7822580.7983870.8629030.8089890.0340493
50.1627280.0082880.0146320.00116182560.7777780.8400.7983870.8145160.8709680.8202250.0325081
60.0434340.0008210.0045950.00049416640.7539680.8000.7903230.7661290.8306450.7881220.0268587
70.0912130.0050090.0081570.000339161280.7460320.8080.7983870.7661290.8548390.7945430.0374104
80.1710450.0053600.0157420.000724162560.7380950.8000.7822580.7661290.8387100.7849120.0337149
Performance ranking of different models created by the grid search

The list above is an overview of the tested models, ranked by their prediction scores. Model number five achieved the best results. The parameters of this model are max_depth = 8 and n_estimators = 256. Let’s select the best model and use it to make a prediction on the test data set. Finally, we will visualize the results in another confusion matrix that we can compare to the initial model.

# Extract the best decision forest 
best_clf = grid_results.best_estimator_
y_pred = best_clf.predict(x_test)

# Create a confusion matrix
cnf_matrix = confusion_matrix(y_test, y_pred)

# Create heatmap from the confusion matrix
%matplotlib inline
class_names=[False, True] # name  of classes
fig, ax = plt.subplots(figsize=(7, 6))
sns.heatmap(pd.DataFrame(cnf_matrix), annot=True, cmap="YlGnBu", fmt='g')
ax.xaxis.set_label_position("top")
plt.tight_layout()
plt.title('Confusion matrix')
plt.ylabel('Actual label')
plt.xlabel('Predicted label')
tick_marks = [0.5, 1.5]
plt.xticks(tick_marks, class_names)
plt.yticks(tick_marks, class_names)
Confusion matrix of the best grid search model

The confusion matrix tells us that our grid search model has correctly classified that 148 passengers would not survive and that 76 passengers would survive. In 44 cases the model was wrong. This leads to a model accuracy of 83,5 %. This means, the performance of the best grid search model clearly outperforms our initial best guess model.

Summary

In this tutorial you have learned to automate the process of hyperparameter tuning of a machine learning model. Specifically, you have learned to define a grid of parameters and use it to find the optimal set of parameters for a random decision forest that predicts survival of titanic passengers. We have seen that the grid search was easily capable of finding a model that outperforms our best guess model.

Grid search is a useful method that can be applied to optimize almost any machine learning model. So if you understand the process, you can make model development much more efficient.

Follow Florian Müller:

Data Scientist & Machine Learning Consultant

Hi, my name is Florian! I am a Zurich-based Data Scientist with a passion for Artificial Intelligence and Machine Learning. After completing my PhD in Business Informatics at the University of Bremen, I started working as a Machine Learning Consultant for the swiss consulting firm ipt. When I'm not working on use cases for our clients, I work on own analytics projects and report on them in this blog.

Leave a Reply

Your email address will not be published. Required fields are marked *