The functionality of machine learning models can be controlled with their hyperparameters. The choice of these parameters often has a significant impact on model performance and, in practice, can make the difference between sufficient and outstanding performance. Data scientists therefore spend a large part of their time adjusting the various parameters of a machine learning model with the aim of finding the optimal set of parameters. This process is called hyperparameter tuning (also referred to as model tuning).
Manually executed hyperparameter tuning can be time-consuming, since each model configuration needs to be configured, trained and evaluated. Especially in models where every percentage point counts, several hundred experiments can quickly come together until an optimal configuration is found. To reduce the manual effort, it is generally a good idea to automate the process of hyperparameter tuning and in this blog post I will show how this works. The blog post will demonstrate a technique in testing several configurations for random forest models that predict the survival of Titanic passengers. We will use grid search to automatically and exhaustively test a set of parameter values and identify the model which delivers the best performance.

Automated Hyperparameter Tuning using Grid Search
A common way of automatically searching for an optimal parameter configuration is by using a grid search. Grid search is an exhaustive search technique in which all possible permutations of a parameter grid are tried out step by step. We can apply Grid Search to machine learning models and automatically test different model configurations. As a result, we get a ranking of the models based on their performance. For this to work, we need to provide the search grid with the following information:
- The hyperparameters that we want to test
- For each hyperparameter a range of values
- A performance metric so that the algorithm knows how to measure performance

The number of models created by the search grid can be calculated by multiplying the number of defined values in each parameter range. Later in this tutorial we will define the search grid to tune the hyperparameters of a random decision forest:
- n_estimators, which determines the number of decision trees,
- max_depth, which defines the maximum number of branches in each decision tree
We specify a range of [16, 32, and 64] for n_estimators and a range of [8, 16 and 32] for max_depth. In addition, we specify accuracy as the metric to measure performance. As a result, the search grid will train and test a total of nine (3 x 3 = 9) different random forest models. Then we can run the grid search algorithm and while we can relax, the search grid does all the work and determines the best configuration.

A search grid with two parameters and three parameter values
Python implementation
In the following, this tutorial shows how to optimize a random forest with the help of a search grid using the Titanic dataset as an example.
The example covers the following steps:
- Loading the Titanic Data
- Preprocessing and Exploring the Data
- Split the Data
- Training a single Random Forest Model
- Model Tuning using Grid Search
Python Environment
This tutorial assumes that you have setup your python environment. I recommend using Anaconda. If you have not yet set it up, you can follow this tutorial. It is also assumed that you have the following packages installed: keras (2.0 or higher) with Tensorflow backend , numpy, pandas, matplot, sklearn. The packages can be installed using the console command:
pip install <package name> conda install <package name> (if you are using the anaconda environment)
1) Loading the Titanic Data
We begin by loading titanic dataset from the Kaggle website – one of the best known datasets to demonstrate classification. Once you have completed the download, put the dataset under the filepath of your choice. However, don’t forget to adjust the file path in the code. If you are working with the Kaggle Python environment, you can also directly save the dataset into your Kaggle project.
The titanic dataset contains the following information on passengers of the titanic:
- Survival: Survival 0 = No, 1 = Yes (Prediction Label)
- Pclass: Ticket class 1 = 1st, 2 = 2nd, 3 = 3rd
- Sex: Sex
- Age: Age in years
- SibSp: # of siblings / spouses aboard the Titanic
- Parch: # of parents / children aboard the Titanic
- Ticket: Ticket number
- Fare: Passenger fare
- Cabin: Cabin number
- Embarked: Port of Embarkation C = Cherbourg, Q = Queenstown, S = Southampton
The column “Survival” contains the prediction label, which says whether a passenger survived the sinking or not. W can use this data to train a classifier that predicts the passengers that will survive the sinking and those that will not. The following code will load the data into our python project.
import math import numpy as np import pandas as pd import matplotlib.pyplot as plt from sklearn.metrics import confusion_matrix from sklearn.model_selection import GridSearchCV from sklearn.ensemble import RandomForestClassifier import seaborn as sns from pandas.plotting import register_matplotlib_converters # set file path filepath = "data/titanic-grid-search/" # Load train and test datasets titanic_train_df = pd.read_csv(filepath + 'titanic-train.csv') titanic_test_df = pd.read_csv(filepath + 'titanic-test.csv') titanic_train_df.head()

2) Preprocessing and Exploring the Data
Before we can train a model, we first need to preprocess the data. This will require several steps: First, we will clean the missing values in the data and replace them with the mean. Second, we will transform categorical features (Embarked and Sex) into numeric values. In addition, we will delete some columns to reduce model complexity. Finally, we will delete the prediction label from the training dataset and place it into a separate dataset named y_df.
# Define a preprocessing function def preprocess(dfx): df = dfx.copy() # Deleting some columns from the data new_df = df[df.columns[~df.columns.isin(['Survived', 'Cabin', 'PassengerId', 'Name', 'Ticket'])]] # Replace missing values new_df.fillna(df.mean(), inplace=True) new_df['Embarked'].fillna('C', inplace=True) # Convert categorical values to integer new_df_b = new_df.copy() new_df_b['Sex'] = np.where(new_df_b['Sex']=='male', 0, 1) new_df_b['Embarked'].mask(new_df_b['Embarked']=='S', '1', inplace=True) new_df_b['Embarked'].mask(new_df_b['Embarked']=='Q', '2', inplace=True) new_df_b['Embarked'].mask(new_df_b['Embarked']=='C', '3', inplace=True) return new_df_b # Create train_df & test_df x_df = preprocess(titanic_train_df).copy() y_df = titanic_train_df['Survived'].copy() x_df.head()

Let’s take a quick look at the data by creating histograms for the columns of our data set.
# Histograms for each column register_matplotlib_converters() nrows = 2; ncols = int(round(x_df.shape[1] / nrows, 0)) fig, ax = plt.subplots(nrows=nrows, ncols=ncols, sharex=False, figsize=(16, 10)) fig.subplots_adjust(hspace=0.2, wspace=0.2) columns = x_df.columns f = 0 for i in range(nrows): for j in range(ncols): if f <= x_df.shape[1]-1: assetname = columns[f] y = x_df[assetname] ax[i, j].hist(y, color='#039dfc', label=assetname, bins='auto') #ax[i, j].set_xlim([y.max(), y.min()]) f += 1 ax[i, j].set_title(assetname) plt.show()

The histograms tell us various things. For example, we see that most passengers were between 25 and 35 years old. In addition, we can see that most passenger had low fare tickets, while some passenger had tickets that were significantly more expensive. This seems logical, since most passengers 3rd class tickets.
3) Splitting the data
Next, we will split the data set into training data (x_train, y_train) and test data (x_test, y_test) using a split ratio of 70/30.
# Split the data into x_train and y_train data sets x_train, x_test, y_train, y_test = train_test_split(x_df, y_df, train_size=0.7, random_state=0)
4) Building a single Random Forest Model
Now that we have completed the pre-processing, we can train a first model. The model that we are about to train uses a random forest
4.1 About the Random Forest Algorithm
A random forest is a powerful machine learning algorithm that can be used for classification and regression tasks. In contrast to a decision tree, the decision forest is an ensemble model that combines the results of many different decision trees to make the best possible decisions. The algorithm trains numerous decision trees and each tree has a vote on the overall prediction result.
A random forest model has a wide range hyperparameters with which we can control the characteristics of the decision trees and the ensemble model. However, for the sake of simplicity, we will control only two of them and use the default value for all other parameters:
- The number of the estimators in the ensemble model (n_estimators)
- The maximum depth of search decision tree in the ensemble model (max_depth)
4.2 Implementing a Random Forest Model
The following code will first train the random forest model, then make a test prediction with the x_test dataset, and finally visualize the model performance in a confusion matrix:
# Train a single random forest classifier clf = RandomForestClassifier(max_depth=2, random_state=0, n_estimators = 100) clf.fit(x_train, y_train) y_pred = clf.predict(x_test) # Create a confusion matrix cnf_matrix = confusion_matrix(y_test, y_pred) # Create heatmap from the confusion matrix %matplotlib inline class_names=[False, True] # name of classes fig, ax = plt.subplots(figsize=(7, 6)) sns.heatmap(pd.DataFrame(cnf_matrix), annot=True, cmap="YlGnBu", fmt='g') ax.xaxis.set_label_position("top") plt.tight_layout() plt.title('Confusion matrix') plt.ylabel('Actual label') plt.xlabel('Predicted label') tick_marks = [0.5, 1.5] plt.xticks(tick_marks, class_names) plt.yticks(tick_marks, class_names)

The number of cases correctly classified by our model are shown in the squares at the top left and bottom right of the matrix. As we see, our best estimate model correctly predicted that 151 passengers would not survive and that 64 would survive the sinking. In 53 cases the model was wrong. In total, these results corresponds to a model accuracy of 80%. Considering that the choice of parameters was only a best guess, these results are surprisingly good. However, by using automated hyperparameter tuning, we should be able to identify a model that outperforms these results.
5) Hyperparameter Tuning using Grid Search
Let’s find out, if we can beat the results of our best guess model using grid search. First, we will define a parameter range and then have the random forest model test any combination of parameters.
# Define Parameters max_depth=[2, 8, 16] n_estimators = [64, 128, 256] param_grid = dict(max_depth=max_depth, n_estimators=n_estimators) # Build the gridsearch dfrst = RandomForestClassifier(n_estimators=n_estimators, max_depth=max_depth) grid = GridSearchCV(estimator=dfrst, param_grid=param_grid, cv = 5) grid_results = grid.fit(x_train, y_train) # Summarize the results in a readable format print("Best: {0}, using {1}".format(grid_results.cv_results_['mean_test_score'], grid_results.best_params_)) results_df = pd.DataFrame(grid_results.cv_results_) results_df

# | mean_fit_time | std_fit_time | mean_score_time | std_score_time | param_max_depth | param_n_estimators | params | split1_test_score | split2_test_score | split3_test_score | split4_test_score | mean_test_score | std_test_score | rank_test_score |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.034520 | 0.001444 | 0.004089 | 0.000535 | 2 | 64 | 0.746032 | 0.832 | 0.774194 | 0.798387 | 0.782258 | 0.786517 | 0.028433 | 8 |
1 | 0.072252 | 0.012060 | 0.006868 | 0.000788 | 2 | 128 | 0.769841 | 0.824 | 0.790323 | 0.774194 | 0.798387 | 0.791332 | 0.019393 | 5 |
2 | 0.134407 | 0.003858 | 0.013264 | 0.000725 | 2 | 256 | 0.777778 | 0.832 | 0.790323 | 0.774194 | 0.782258 | 0.791332 | 0.021072 | 5 |
3 | 0.039913 | 0.000510 | 0.004015 | 0.000643 | 8 | 64 | 0.809524 | 0.824 | 0.798387 | 0.798387 | 0.862903 | 0.818620 | 0.023996 | 2 |
4 | 0.082298 | 0.002498 | 0.007933 | 0.000535 | 8 | 128 | 0.769841 | 0.832 | 0.782258 | 0.798387 | 0.862903 | 0.808989 | 0.034049 | 3 |
5 | 0.162728 | 0.008288 | 0.014632 | 0.001161 | 8 | 256 | 0.777778 | 0.840 | 0.798387 | 0.814516 | 0.870968 | 0.820225 | 0.032508 | 1 |
6 | 0.043434 | 0.000821 | 0.004595 | 0.000494 | 16 | 64 | 0.753968 | 0.800 | 0.790323 | 0.766129 | 0.830645 | 0.788122 | 0.026858 | 7 |
7 | 0.091213 | 0.005009 | 0.008157 | 0.000339 | 16 | 128 | 0.746032 | 0.808 | 0.798387 | 0.766129 | 0.854839 | 0.794543 | 0.037410 | 4 |
8 | 0.171045 | 0.005360 | 0.015742 | 0.000724 | 16 | 256 | 0.738095 | 0.800 | 0.782258 | 0.766129 | 0.838710 | 0.784912 | 0.033714 | 9 |
The list above is an overview of the tested models, ranked by their prediction scores. Model number five achieved the best results. The parameters of this model are a maximum depth of 8 and a number of estimators of 256.
Next, we will select the best model and use it to make a prediction on the test data set. Afterwards, we will visualize the results in another confusion matrix that we can compare to the initial model.
# Extract the best decision forest best_clf = grid_results.best_estimator_ y_pred = best_clf.predict(x_test) # Create a confusion matrix cnf_matrix = confusion_matrix(y_test, y_pred) # Create heatmap from the confusion matrix %matplotlib inline class_names=[False, True] # name of classes fig, ax = plt.subplots(figsize=(7, 6)) sns.heatmap(pd.DataFrame(cnf_matrix), annot=True, cmap="YlGnBu", fmt='g') ax.xaxis.set_label_position("top") plt.tight_layout() plt.title('Confusion matrix') plt.ylabel('Actual label') plt.xlabel('Predicted label') tick_marks = [0.5, 1.5] plt.xticks(tick_marks, class_names) plt.yticks(tick_marks, class_names)

The confusion matrix shows that our grid search model has correctly classified that 148 passengers would not survive and that 76 passengers would survive. In 44 cases the model was wrong. This results in an overall model accuracy of 83,5 % and demonstrates that the performance of the best grid search model clearly outperforms our initial best guess model.
Summary
In this tutorial you have learned to automate the process of hyperparameter tuning of a machine learning model. We have demonstrated that grid search is a useful method that can be applied to optimize almost any machine learning model. So if you understand the process, you can make model development much more efficient. Specifically, you have learned to define a grid of parameters and use it to find the optimal set of parameters for a random decision forest that predicts survival of titanic passengers. We have seen that the grid search was easily capable of finding a model that outperforms our best guess model.
I hope you found this tutorial useful. If you have any questions or suggestions, feel free to let me know in the comments.
Forecasting Criminal Activity in San Francisco using XGBoost and Python
[…] Next, we establish a performance baseline by training a Random Forest Classifier. This will help us to evaluate the performance of the XGBoost classification model. A Random Forest is a powerful prediction algorithm that can deal with regression and classification problems. The idea is to have an ensemble of decision trees. Each tree makes a prediction and thus votes on the final outcome. If you want to learn more about decision Trees and how to do hyperparameter tuning on them, check out my recent post. […]