Finding the perfect hyperparameters for your machine learning model can be like searching for a needle in a haystack – unless you use random search. This efficient method automates the process of hyperparameter tuning, so you don’t have to spend hours manually testing different configurations. Hyperparameters are model properties (e.g., the number of estimators for an ensemble model). Model performance depends heavily on the hyperparameters. Unlike model parameters, the machine learning algorithm does not discover the hyperparameters during training. Instead, we need to specify them in advance. In this tutorial, we’ll show you how to use random search to optimize the hyperparameters of a regression model in Python. Our example is a support vector machine that predicts house prices, but the principles you’ll learn here can be applied to any model. So why waste time with manual hyperparameter tuning when you can let random search do the work for you?

The rest of this Python tutorial proceeds as follows: We begin by briefly explaining how random search works and when to prefer it over grid search. Then we turn to the Python hands-on part, where we will work with a public house price dataset from Kaggle.com. Our goal is to train a regression model that predicts the sale price of a given house in the US based on several characteristics. Once we have trained a best-guess model in Python, we use the random search technique to identify a better-performing model. We use cross-validation to validate the models.

## Hyperparameter Tuning

Hyperparameters are configuration options that allow us to customize machine learning models and improve their performance. While normal parameters are the internal coefficients that the model learns during training, we need to specify hyperparameters before the training. It is usually impossible to find the best configuration without testing different configurations.

Searching for a suitable model configuration is called “hyperparameter tuning” or “hyperparameter optimization.” Machine learning algorithms have varying hyperparameters and parameter values. For example, a random decision forest classifier allows us to configure varying parameters such as the number of trees, the maximum tree depth, and the minimum number of nodes required for a new branch.

The hyperparameters and the range of possible parameter values span a search space in which we seek to identify the best configuration. The larger the search space, the more difficult it gets to find an optimal model. We can use random search to automatize this process.

## Techniques for Tuning Hyperparameters

Hyperparameter tuning is the process of adjusting the hyperparameters of a machine learning algorithm to optimize its performance on a specific dataset or task. Several techniques can be used for hyperparameter tuning, including:

**Grid Search:**grid search is a brute-force search algorithm that systematically evaluates a given set of hyperparameter values by training and evaluating a model for each combination of values. It is a simple and effective technique, but it can be computationally expensive, especially for large or complex datasets.**Random Search:**As mentioned, random search is an alternative to grid search that randomly samples a given set of hyperparameter values rather than evaluating all possible combinations. It can be more efficient than grid search, but it may not find the optimal set of hyperparameters.**Bayesian Optimization:**A bayesian optimization is a probabilistic approach to hyperparameter tuning, which uses Bayesian inference to model the distribution of hyperparameter values that are likely to produce a good performance. It can be more efficient and effective than grid search or random search, but it can be more challenging to implement and interpret.**Genetic Algorithms:**genetic algorithms are optimization algorithms inspired by the principles of natural selection and genetics. They use a population of candidate solutions, which are iteratively evolved and selected based on their fitness or performance, to find the optimal set of hyperparameters.

In this article, we specifically look at the Random Search technique.

## What is Random Search?

The random search algorithm generates models from hyperparameter permutations randomly selected from a grid of parameter values. The idea behind the randomized approach is that testing random configurations efficiently identifies a good model. We can use random search both for regression and classification models.

Random Search and Grid Search are the most popular techniques for hyperparametric tuning, and both methods are often compared. Unlike random search, grid search covers the search space exhaustively by trying all possible variants. The technique works well for testing a small number of configurations already known to work well.

As long as both search space and training time are small, the grid search technique is excellent for finding the best model. However, the number of model variants increases exponentially with the size of the search space. It is often more efficient for large search spaces or complex models to use random search.

Since random search does not exhaustively cover the search space, it does not necessarily yield the best model. However, it is also much faster than grid search and efficient in delivering a suitable model in a short time.

## Tuning the Hyperparameters of a Random Decision Forest Regressor in Python using Random Search

Let’s see how random search works in Python. In the following, we will work with the house price data set and train a regression model that predicts the sale price based on several house characteristics.

Several factors can impact the accuracy of house price predictions, including:

- The quality and quantity of data: The more data you have about the housing market, the more relevant and accurate that data is, and the better your predictions are likely to be.
- The complexity of the model: More complex models may be able to capture more nuances in the data, but they may also be more prone to overfitting and may be more difficult to interpret.
- The choice of machine learning algorithm: Different algorithms may be better suited to different types of data and prediction tasks.
- The stability of the housing market: If the housing market is stable and predictable, it may be easier to make accurate predictions.

Our model will use a random decision forest algorithm. Random decision forests have several hyperparameters, and we will start with the best guess configuration. Once we have trained this initial best-guess model, we will use a random search to optimize the hyperparameters and identify a configuration that outperforms our model.

The tutorial covers the following steps:

- Load the house price data
- Explore the data
- Prepare the data
- Train a baseline model
- Use a random search approach to identify an excellent model
- Measure model performance

The Python code is available in the relataly GitHub repository.

### Prerequisites

Before starting the coding part, ensure that you have set up your Python (3.8 or higher) environment and required packages. If you don’t have an environment, follow this tutorial to set up the Anaconda environment.

Also, make sure you install all required packages. In this tutorial, we will be working with the following standard packages:

In addition, we will be using the Python Machine Learning library Scikit-learn to implement the random forest and the grid search technique.

You can install packages using console commands:

*pip install <package name>**conda install <package name>*(if you are using the anaconda packet manager)

### House Price Prediction: About the Use Case and the Data

House price prediction is the process of using statistical and machine learning techniques to predict the future value of a house. This can be useful for a variety of applications, such as helping homeowners and real estate professionals to make informed decisions about buying and selling properties. In order to make accurate predictions, it is important to have access to high-quality data about the housing market.

In this tutorial, we will work with a house price dataset from the house price regression challenge on Kaggle.com. The dataset is available via a git hub repository. It contains information about 4800 houses sold between 2016 and 2020 in the US. The data includes the sale price and a list of 48 house characteristics, such as:

- Year – The year of construction,
- SaleYear – The year in which the house was sold
- Lot Area – The lot area of the house
- Quality – The overall quality of the house from one (lowest) to ten (highest)
- Road – The type of road, e.g., paved, etc.
- Utility – The type of the utility
- Park Lot Area – The parking space included with the property
- Room number – The number of rooms

### Step #1 Load the Data

We begin by loading the house price data from the relataly GitHub repository. A separate download is not required.

# A tutorial for this file is available at www.relataly.com import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns from sklearn.model_selection import RandomizedSearchCV, train_test_split from sklearn.metrics import mean_absolute_error, mean_absolute_percentage_error from sklearn import svm # Source: # https://www.kaggle.com/c/house-prices-advanced-regression-techniques # Load train and test datasets path = "https://raw.githubusercontent.com/flo7up/relataly_data/main/house_prices/train.csv" df = pd.read_csv(path) print(df.columns) df.head()

Index(['Id', 'MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street', 'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig', 'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType', 'HouseStyle', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd', 'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType', 'MasVnrArea', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1', 'BsmtFinType2', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating', 'HeatingQC', 'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual', 'TotRmsAbvGrd', 'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType', 'GarageYrBlt', 'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual', 'GarageCond', 'PavedDrive', 'WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'PoolQC', 'Fence', 'MiscFeature', 'MiscVal', 'MoSold', 'YrSold', 'SaleType', 'SaleCondition', 'SalePrice'], dtype='object') Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities ... PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition SalePrice 0 1 60 RL 65.0 8450 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 2 2008 WD Normal 208500 1 2 20 RL 80.0 9600 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 5 2007 WD Normal 181500 2 3 60 RL 68.0 11250 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 9 2008 WD Normal 223500 3 4 70 RL 60.0 9550 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 2 2006 WD Abnorml 140000 4 5 60 RL 84.0 14260 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 12 2008 WD Normal 250000 5 rows × 81 columns

### Step #2 Explore the Data

Before jumping into preprocessing and model training, let’s quickly explore the data. A distribution plot can help us understand our dataset’s frequency of regression values.

# Create histograms for feature columns separated by prediction label value ax = sns.displot(data=df[['SalePrice']].dropna(), height=6, aspect=2) plt.title('Sale Price Distribution')

For feature selection, it is helpful to understand the predictive power of the different variables in a dataset. We can use scatterplots to estimate the predictive power of specific features. Running the code below will create a scatterplot that visualizes the relation between the sale price, lot area, and the house’s overall quality.

# Create histograms for feature columns separated by prediction label value plt.figure(figsize=(16,6)) df_features = df[['SalePrice', 'LotArea', 'OverallQual']] sns.scatterplot(data=df_features, x='LotArea', y='SalePrice', hue='OverallQual') plt.title('Sale Price Distribution')

As expected, the scatterplot shows that the sale price increases with the overall quality. On the other hand, the LotArea has only a minor effect on the sale price.

### Step #3 Data Preprocessing

Next, we prepare the data for use as input to train a regression model. Because we want to keep things simple, we reduce the number of variables and use only a small set of features. In addition, we encode categorical variables with integer dummy values.

To ensure that our regression model does not know the target variable, we separate house price (y) from features (x). Last, we split the data into separate datasets for training and testing. The result is four different data sets: x_train, y_train, x_test, and y_test.

def preprocessFeatures(df): # Define a list of relevant features feature_list = ['SalePrice', 'OverallQual', 'Utilities', 'GarageArea', 'LotArea', 'OverallCond'] df_dummy = pd.get_dummies(df[feature_list]) # Cleanse records with na values #df_prep = df_prep.dropna() return df_dummy df_base = preprocessFeatures(df) # Split the data into x_train and y_train data sets x_train, x_test, y_train, y_test = train_test_split( df_base.copy(), df_base['SalePrice'].copy(), train_size=0.7, random_state=0) x_train

OverallQual GarageArea LotArea OverallCond Utilities_AllPub Utilities_NoSeWa 682 6 431 2887 5 1 0 960 5 0 7207 7 1 0 1384 6 280 9060 5 1 0 1100 2 246 8400 5 1 0 416 6 440 7844 7 1 0

### Step #4 Train Different Regression Models using Random Search

Now that the dataset is ready, we can train the random decision forest regressor. To do this, we first define a dictionary with different parameter ranges. In addition, we need to define the number of model variants (n) that the algorithm should try. The random search algorithm then selects n random permutations from the grid and uses them to train the model.

We use the RandomSearchCV algorithm from the scikit-learn package. The “CV” in the function name stands for cross-validation. Cross-validation involves splitting the data into subsets (folds) and rotating them between training and validation runs. This way, each model is trained and tested multiple times on different data partitions. When the search algorithm finally evaluates the model configuration, it summarizes these results into a test score.

We use a Random Decision Forest – a robust machine learning algorithm that can handle classification and regression tasks. As a so-called ensemble model, the Random Forest considers predictions from a set of multiple independent estimators. The estimator is an important parameter to pass to the RandomSearchCV function. Random decision forests have several hyperparameters that we can use to influence their behavior. We define the following parameter ranges:

- max_leaf_nodes = [2, 3, 4, 5, 6, 7]
- min_samples_split = [5, 10, 20, 50]
- max_depth = [5,10,15,20]
- max_features = [3,4,5]
- n_estimators = [50, 100, 200]

These parameter ranges define the search space from which the randomized search algorithm (RandomSearchCV) will select random configurations. Other parameters will use default values as defined by scikit-learn.

# Define the Estimator and the Parameter Ranges dt = RandomForestRegressor() number_of_iterations = 20 max_leaf_nodes = [2, 3, 4, 5, 6, 7] min_samples_split = [5, 10, 20, 50] max_depth = [5,10,15,20] max_features = [3,4,5] n_estimators = [50, 100, 200] # Define the param distribution dictionary param_distributions = dict(max_leaf_nodes=max_leaf_nodes, min_samples_split=min_samples_split, max_depth=max_depth, max_features=max_features, n_estimators=n_estimators) # Build the gridsearch grid = RandomizedSearchCV(estimator=dt, param_distributions=param_distributions, n_iter=number_of_iterations, cv = 5) grid_results = grid.fit(x_train, y_train) # Summarize the results in a readable format print("Best params: {0}, using {1}".format(grid_results.cv_results_['mean_test_score'], grid_results.best_params_)) results_df = pd.DataFrame(grid_results.cv_results_)

Best params: [0.68738293 0.49581669 0.52138751 0.61235299 0.65360944 0.61165147 0.70392285 0.52278886 0.67687248 0.68219638 0.70031536 0.65842909 0.51939338 0.70801017 0.70911805 0.69543885 0.67983801 0.60744371 0.68270285 0.70741042], using {'n_estimators': 100, 'min_samples_split': 5, 'max_leaf_nodes': 7, 'max_features': 3, 'max_depth': 15} mean_fit_time std_fit_time mean_score_time std_score_time param_n_estimators param_min_samples_split param_max_leaf_nodes param_max_features param_max_depth params split0_test_score split1_test_score split2_test_score split3_test_score split4_test_score mean_test_score std_test_score rank_test_score 0 0.049196 0.002071 0.004074 0.000820 50 20 5 4 15 {'n_estimators': 50, 'min_samples_split': 20, ... 0.662973 0.705533 0.669520 0.702608 0.696280 0.687383 0.017637 7 1 0.041115 0.000554 0.003046 0.000094 50 50 2 3 10 {'n_estimators': 50, 'min_samples_split': 50, ... 0.490984 0.527231 0.426270 0.523086 0.511513 0.495817 0.036978 20 2 0.043325 0.000779 0.003486 0.000447 50 50 2 5 20 {'n_estimators': 50, 'min_samples_split': 50, ... 0.484524 0.559358 0.485459 0.517253 0.560343 0.521388 0.033545 18 3 0.162083 0.005665 0.012420 0.004788 200 5 3 3 20 {'n_estimators': 200, 'min_samples_split': 5, ... 0.586586 0.638341 0.573437 0.626793 0.636608 0.612353 0.027021 14 4 0.166659 0.003026 0.010958 0.000084 200 10 4 3 15 {'n_estimators': 200, 'min_samples_split': 10,... 0.633305 0.679161 0.623236 0.661864 0.670481 0.653609 0.021636 13

These are the five best models and their respective hyperparameter configurations.

**Step #5 Select the best Model and Measure Performance**

Finally, we will choose the best model from the list using the “best_model” function. We then calculate the MAE and the MAPE to understand how the model performs on the overall test dataset. We then print a comparison between actual sale prices and predicted sale prices.

# Select the best Model and Measure Performance best_model = grid_results.best_estimator_ y_pred = best_model.predict(x_test) y_df = pd.DataFrame(y_test) y_df['PredictedPrice']=y_pred y_df.head()

SalePrice PredictedPrice 529 200624 166037.831002 491 133000 135860.757958 459 110000 123030.336177 279 192000 206488.444327 655 88000 130453.604206

Next, let’s take a look at the classification errors.

# Mean Absolute Error (MAE) MAE = mean_absolute_error(y_pred, y_test) print('Mean Absolute Error (MAE): ' + str(np.round(MAE, 2))) # Mean Absolute Percentage Error (MAPE) MAPE = mean_absolute_percentage_error(y_pred, y_test) print('Median Absolute Percentage Error (MAPE): ' + str(np.round(MAPE*100, 2)) + ' %')

Mean Absolute Error (MAE): 29591.56 Median Absolute Percentage Error (MAPE): 15.57 %

On average, the model deviates from the actual value by 16 %. Considering we only used a fraction of the available features and defined a small search space, there is much room for improvement.

## Summary

This article has shown how we can use grid Search in Python to efficiently search for the optimal hyperparameter configuration of a machine learning model. In the conceptual part, you learned about hyperparameters and how to use random search to try out all permutations of a predefined parameter grid. The second part was a Python hands-on tutorial, in which you learned to use random search to tune the hyperparameters of a regression model. We worked with a house price dataset and trained a random decision forest regressor that predicts the sale price for houses depending on several characteristics. Then we defined parameter ranges and tested random permutations. In this way, we quickly identified a configuration that outperforms our initial baseline model.

Remember that a random search efficiently identifies a good-performing model but does not necessarily return the best-performing one. Tech random search techniques can be used to tune the hyperparameters of both regression and classification models.

## Sources and Further Reading

I hope this article was helpful. If you have any questions or suggestions, please write them in the comments.

- Andriy Burkov (2020) Machine Learning Engineering
- Oliver Theobald (2020) Machine Learning For Absolute Beginners: A Plain English Introduction
- Aurélien Géron (2019) Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems
- David Forsyth (2019) Applied Machine Learning Springer

*The links above to Amazon are affiliate links. By buying through these links, you support the Relataly.com blog and help to cover the hosting costs. Using the links does not affect the price.*