Using Random Search to Tune the Hyperparameters of a Random Decision Forest with Python

Random search is an efficient method for automated hyperparameter tuning machine learning models. Hyperparameters are model properties (e.g., the number of estimators for an ensemble model). Unlike model parameters, the machine learning algorithm does not discover the model hyperparameters during training. Instead, we need to specify them in advance. Finding compelling hyperparameters is an explorative process that involves training and testing different model configurations. While tuning the hyperparameter manual is possible, it often takes time and repetitive work. A better way of tuning hyperparameters is to automatize this process with grid search or random search. We have already covered grid search in a separate article. In this tutorial, we will learn how random search works and how we can use it to optimize a regression model in Python. The use case is a support vector machine that predicts house prices.

The rest of this Python tutorial proceeds as follows: We begin by briefly explaining how random search works and when to prefer it over grid search. Then we turn to the Python hands-on part, where we will be working with a public house price dataset from Kaggle.com. Our goal is to train a regression model that predicts the sale price of a given house in the US based on several characteristics. Once we have trained a best-guess model in Python, we use the random search technique to identify a better-performing model. We use cross-validation to validate the models.

Hyperparameter Tuning

Hyperparameters are configuration options that allow us to customize machine learning models and improve their performance. While normal parameters are the internal coefficients that the model learns during training, we need to specify hyperparameters before the training. It is usually impossible to find the best configuration without testing different configurations.

Searching for a suitable model configuration is called “hyperparameter tuning” or “hyperparameter optimization.” Machine learning algorithms have varying hyperparameters and parameter values. For example, a random decision forest classifier allows us to configure varying parameters such as the number of trees, the maximum tree depth, and the minimum number of nodes required for a new branch.

The hyperparameters and the range of possible parameter values span a search space in which we seek to identify the best configuration. The larger the search space, the more difficult it gets to find an optimal model. We can use random search to automatize this process.

Random Search

The random search algorithm generates models from hyperparameter permutations randomly selected from a grid of parameter values. The idea behind the randomized approach is that testing random configurations is an efficient way to identify a good model. We can use random search both for regression and classification models.

Random search and grid search are the most popular techniques for hyperparametric tuning, and both methods are often compared. Unlike random search, grid search covers the search space exhaustively by trying all possible variants. The technique works well for testing a small number of configurations that are already known to work well.

As long as both search space and training time are small, the grid search technique is excellent for finding the best model. However, the number of model variants increases exponentially with the size of the search space. It is often more efficient for large search spaces or complex models to use random search.

Since random search does not exhaustively cover the search space, it does not necessarily yield the best model. However, it is also much faster than grid search and efficient in delivering a suitable model in a short time.

random decision forest python,
hyperparameter tuning,
comparison between random search and grid search

Let’s see how random search works in Python. In the following, we will work with the house price data set and train a regression model that predicts the sale price based on several house characteristics. Our model will use a random decision forest algorithm. Random decision forests have several hyperparameters, and we will start with the best guess configuration. Once we have trained this initial best-guess model, we will use a random search to optimize the hyperparameters and identify a configuration that outperforms our model.

The tutorial covers the following steps:

  • Load the house price data
  • Explore the data
  • Prepare the data
  • Train a baseline model
  • Use a random search approach to identify an excellent model
  • Measure model performance

The Python code is available in the relataly GitHub repository.

Prerequisites

Before starting with the coding part, ensure that you have set up your Python (3.8 or higher) environment and required packages. If you don’t have an environment set up yet, you can follow this tutorial to set up the Anaconda environment.

Also, make sure you install all required packages. In this tutorial, we will be working with the following standard packages: 

In addition, we will be using the Python machine learning library Scikit-learn to implement the random forest and the grid search technique.

You can install packages using console commands:

  • pip install <package name>
  • conda install <package name> (if you are using the anaconda packet manager)

House Price Prediction: About the Use Case and the Data

In this tutorial, we will be working with a house price dataset. The dataset contains information about 4800 houses sold between 2016 and 2020 in the US. The data includes the sale price and a list of 48 house characteristics, such as

  • Year – The year of construction,
  • SaleYear – The year in which the house was sold
  • Lot Area – The lot area of the house
  • Quality – The overall quality of the house from one (lowest) to ten (highest)
  • Road – The type of road, e.g., paved, etc.
  • Utility – The type of the utility
  • Park Lot Area – The parking space included with the property
  • Room number – The number of rooms

The dataset is freely available for download from the house price regression challenge on Kaggle.com or via a git hub repository.

Step #1 Load the Data

We begin by loading the house price data from the relataly GitHub repository. A separate download is not required.

# A tutorial for this file is available at www.relataly.com

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import RandomizedSearchCV, train_test_split
from sklearn.metrics import mean_absolute_error, mean_absolute_percentage_error
from sklearn import svm

# Source: 
# https://www.kaggle.com/c/house-prices-advanced-regression-techniques

# Load train and test datasets
path = "https://raw.githubusercontent.com/flo7up/relataly_data/main/house_prices/train.csv"
df = pd.read_csv(path)
print(df.columns)
df.head()
random search hyperparameter tuning. random forest regression,
house price dataset

Step #2 Explore the Data

Before jumping into preprocessing and model training, let’s quickly explore the data. A distribution plot can help us understand the frequency of regression values in our dataset.

# Create histograms for feature columns separated by prediction label value
ax = sns.displot(data=df[['SalePrice']].dropna(), height=6, aspect=2)
plt.title('Sale Price Distribution')
random search hyperparameter tuning python. random forest regression,
sale price distribution

For feature selection, it is helpful to understand the predictive power of the different variables in a dataset. We can use scatterplots to estimate the predictive power of specific features. Running the code below will create a scatterplot that visualizes the relation between sale price, lot area, and the house’s overall quality.

# Create histograms for feature columns separated by prediction label value
plt.figure(figsize=(16,6))
df_features = df[['SalePrice', 'LotArea', 'OverallQual']]
sns.scatterplot(data=df_features, x='LotArea', y='SalePrice', hue='OverallQual')
plt.title('Sale Price Distribution')
random search hyperparameter tuning python. random forest regression,
scatter plot, feature selection

As we would expect, the scatterplot shows that the sale price increases with the overall quality. On the other hand, the LotArea has only a minor effect on the sale price.

Step #3 Data Preprocessing

Next, we prepare the data for use as input to train a regression model. Because we want to keep things simple, we reduce the number of variables and use only a small set of features. In addition, we encode categorical variables with integer dummy values.

To ensure that our regression model does not know the target variable, we separate house price (y) from features (x). Last, we split the data into separate datasets for training and testing. The result is four different data sets: x_train, y_train, x_test, and y_test.

def preprocessFeatures(df):   
    # Define a list of relevant features
    feature_list = ['SalePrice', 'OverallQual', 'Utilities', 'GarageArea', 'LotArea', 'OverallCond']
    df_dummy = pd.get_dummies(df[feature_list])
    # Cleanse records with na values
    #df_prep = df_prep.dropna()
    return df_dummy

df_base = preprocessFeatures(df)

# Split the data into x_train and y_train data sets
x_train, x_test, y_train, y_test = train_test_split( df_base.copy(), df_base['SalePrice'].copy(), train_size=0.7, random_state=0)
x_train
random search python,
random decision forest, 
model features

Now that the dataset is ready, we can train the random decision forest regressor. To do this, we first define a dictionary with different parameter ranges. In addition, we need to define the number of model variants (n) that the algorithm should try. The random search algorithm then selects n random permutations from the grid and uses them to train the model.

We use the RandomSearchCV algorithm from the scikit-learn package. The “CV” in the function name stands for cross-validation. Cross-validation involves splitting the data into subsets (folds) and rotating them between training and validation runs. In this way, each model is trained and tested multiple times on different data partitions. When the search algorithm finally evaluates the model configuration, it summarizes these results into a test score.

We use a Random Decision Forest – a robust machine learning algorithm that can handle classification and regression tasks. As a so-called ensemble model, the Random Forest considers predictions from a set of multiple independent estimators. The estimator is an important parameter that we need to pass to the RandomSearchCV function. Random decision forests have several hyperparameters that we can use to influence their behavior. We define the following parameter ranges:

  • max_leaf_nodes = [2, 3, 4, 5, 6, 7]
  • min_samples_split = [5, 10, 20, 50]
  • max_depth = [5,10,15,20]
  • max_features = [3,4,5]
  • n_estimators = [50, 100, 200]

These parameter ranges define the search space from which the randomized search algorithm (RandomSearchCV) will select random configurations. Other parameters will use default values as defined by scikit-learn.

# Define the Estimator and the Parameter Ranges
dt = RandomForestRegressor()
number_of_iterations = 20
max_leaf_nodes = [2, 3, 4, 5, 6, 7]
min_samples_split = [5, 10, 20, 50]
max_depth = [5,10,15,20]
max_features = [3,4,5]
n_estimators = [50, 100, 200]

# Define the param distribution dictionary
param_distributions = dict(max_leaf_nodes=max_leaf_nodes, 
                           min_samples_split=min_samples_split, 
                           max_depth=max_depth,
                           max_features=max_features,
                           n_estimators=n_estimators)

# Build the gridsearch
grid = RandomizedSearchCV(estimator=dt, 
                          param_distributions=param_distributions, 
                          n_iter=number_of_iterations, 
                          cv = 5)

grid_results = grid.fit(x_train, y_train)

# Summarize the results in a readable format
print("Best params: {0}, using {1}".format(grid_results.cv_results_['mean_test_score'], grid_results.best_params_))
results_df = pd.DataFrame(grid_results.cv_results_)
random search hyperparameter tuning python . random forest regression,
list of tested model configurations

These are the five best models and their respective hyperparameter configurations.

Step #5 Select the best Model and Measure Performance

Finally, we will choose the best model from the list using the “best_model” function. We then calculate the MAE and the MAPE to understand how the model performs on the overall test dataset. We then print a comparison between actual sale prices and predicted sale prices.

# Select the best Model and Measure Performance
best_model = grid_results.best_estimator_
y_pred = best_model.predict(x_test)
y_df = pd.DataFrame(y_test)
y_df['PredictedPrice']=y_pred
y_df.head()
SalePricePredictedPrice
529200624166037.831002
491133000135860.757958
459110000123030.336177
279192000206488.444327
65588000130453.604206
# Mean Absolute Error (MAE)
MAE = mean_absolute_error(y_pred, y_test)
print('Mean Absolute Error (MAE): ' + str(np.round(MAE, 2)))

# Mean Absolute Percentage Error (MAPE)
MAPE = mean_absolute_percentage_error(y_pred, y_test)
print('Median Absolute Percentage Error (MAPE): ' + str(np.round(MAPE*100, 2)) + ' %')

Mean Absolute Error (MAE): 29591.56
Median Absolute Percentage Error (MAPE): 15.57 %

On average, the model deviates from the actual value by 16 %. Considering we only used a fraction of the available features and defined a small search space, there is much room for improvements.

Summary

This article has shown how we can use grid Search in Python to efficiently search for the optimal hyperparameter configuration of a machine learning model. In the conceptual part, you learned about hyperparameters and how to use random search to try out all permutations of a predefined parameter grid. The second part was a Python hands-on tutorial, in which you learned to use random search to tune the hyperparameters of a regression model. We worked with a house price dataset and trained a random decision forest regressor that predicts the sale price for houses depending on several characteristics. Then we defined parameter ranges and tested random permutations. In this way, we quickly identified a configuration that outperforms our initial baseline model.

Remember that a random search efficiently identifies a good-performing model, but it does not necessarily return the best-performing one. Tech random search technique can be used to tune the hyperparameters of both regression and classification models.

I hope this article was helpful. If you have any questions or suggestions, please write them in the comments.

Author

  • Hi, I am Florian, a Zurich-based consultant for AI and Data. Since the completion of my Ph.D. in 2017, I have been working on the design and implementation of ML use cases in the Swiss financial sector. I started this blog in 2020 with the goal in mind to share my experiences and create a place where you can find key concepts of machine learning and materials that will allow you to kick-start your own Python projects.

Leave a Reply