<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Random Decision Forests Archives - relataly.com</title>
	<atom:link href="https://www.relataly.com/category/machine-learning-algorithms/random-decision-forests/feed/" rel="self" type="application/rss+xml" />
	<link>https://www.relataly.com/category/machine-learning-algorithms/random-decision-forests/</link>
	<description>The Business AI Blog</description>
	<lastBuildDate>Thu, 13 Jul 2023 12:10:12 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	<generator>https://wordpress.org/?v=7.0</generator>

<image>
	<url>https://www.relataly.com/wp-content/uploads/2023/04/cropped-AI-cat-Icon-White.png</url>
	<title>Random Decision Forests Archives - relataly.com</title>
	<link>https://www.relataly.com/category/machine-learning-algorithms/random-decision-forests/</link>
	<width>32</width>
	<height>32</height>
</image> 
<site xmlns="com-wordpress:feed-additions:1">175977316</site>	<item>
		<title>Feature Engineering and Selection for Regression Models with Python and Scikit-learn</title>
		<link>https://www.relataly.com/exploratory-feature-preparation-for-regression-with-python-and-scikit-learn/8832/</link>
					<comments>https://www.relataly.com/exploratory-feature-preparation-for-regression-with-python-and-scikit-learn/8832/#comments</comments>
		
		<dc:creator><![CDATA[Florian Follonier]]></dc:creator>
		<pubDate>Mon, 26 Sep 2022 22:20:29 +0000</pubDate>
				<category><![CDATA[Algorithms]]></category>
		<category><![CDATA[Data Visualization]]></category>
		<category><![CDATA[Exploratory Data Analysis (EDA)]]></category>
		<category><![CDATA[Feature Engineering]]></category>
		<category><![CDATA[Feature Permutation Importance]]></category>
		<category><![CDATA[Linear Regression]]></category>
		<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[Measuring Model Performance]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[Random Decision Forests]]></category>
		<category><![CDATA[Sales Forecasting]]></category>
		<category><![CDATA[Scikit-Learn]]></category>
		<category><![CDATA[Seaborn]]></category>
		<category><![CDATA[Simple Regression]]></category>
		<category><![CDATA[Use Cases]]></category>
		<category><![CDATA[Advanced Tutorials]]></category>
		<category><![CDATA[AI in Finance]]></category>
		<category><![CDATA[Deep Learning]]></category>
		<category><![CDATA[Feature Engineering for Time Series Forecasting]]></category>
		<category><![CDATA[Feature Exploration]]></category>
		<category><![CDATA[Feature Selection]]></category>
		<category><![CDATA[Multivariate Models]]></category>
		<category><![CDATA[Neural Networks]]></category>
		<category><![CDATA[Price Regression]]></category>
		<guid isPermaLink="false">https://www.relataly.com/?p=8832</guid>

					<description><![CDATA[<p>Training a machine learning model is like baking a cake: the quality of the end result depends on the ingredients you put in. If your input data is poor, your predictions will be too. But with the right ingredients &#8211; in this case, carefully selected input features &#8211; you can create a model that&#8217;s both ... <a title="Feature Engineering and Selection for Regression Models with Python and Scikit-learn" class="read-more" href="https://www.relataly.com/exploratory-feature-preparation-for-regression-with-python-and-scikit-learn/8832/" aria-label="Read more about Feature Engineering and Selection for Regression Models with Python and Scikit-learn">Read more</a></p>
<p>The post <a href="https://www.relataly.com/exploratory-feature-preparation-for-regression-with-python-and-scikit-learn/8832/">Feature Engineering and Selection for Regression Models with Python and Scikit-learn</a> appeared first on <a href="https://www.relataly.com">relataly.com</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">Training a machine learning model is like baking a cake: the quality of the end result depends on the ingredients you put in. If your input data is poor, your predictions will be too. But with the right ingredients &#8211; in this case, carefully selected input features &#8211; you can create a model that&#8217;s both accurate and powerful. This is where feature engineering comes in. It&#8217;s the process of exploring, creating, and selecting the most relevant and useful features to use in your model. And just like a chef experimenting with different spices and flavors, the process of feature engineering is iterative and tailored to the problem at hand. In this guide, we&#8217;ll walk you through a step-by-step process using Python and Scikit-learn to create a strong set of features for a regression problem. By the end, you&#8217;ll have the skills to tackle any feature engineering challenge that comes your way.</p>



<p class="wp-block-paragraph">The remainder of this article proceeds as follows: We begin with a brief intro to feature engineering and describe valuable techniques. We then turn to the hands-on part, in which we develop a regression model for car sales. We apply various techniques that show how to handle outliers and missing values, perform correlation analysis, and discover and manipulate features. You will also find information about common challenges and helpful sklearn functions. Finally, we will compare our regression model to a baseline model that uses the original dataset.</p>



<p class="wp-block-paragraph">Also: <a href="https://www.relataly.com/simple-sentiment-analysis-using-naive-bayes-and-logistic-regression/2007/" target="_blank" rel="noreferrer noopener">Sentiment Analysis with Naive Bayes and Logistic Regression in Python</a></p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%"></div>
</div>



<div style="height:31px" aria-hidden="true" class="wp-block-spacer"></div>



<h2 class="wp-block-heading">What is Feature Engineering?</h2>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">Feature engineering is the process of using domain knowledge of the data to create features (variables) that make machine learning algorithms work. This is an important step in the machine learning pipeline because the choice of good features can greatly affect the performance of the model. The goal is to identify features, tweak them, and select the most promising ones into a smaller feature subset. We can break this process down into several action items. </p>



<p class="wp-block-paragraph">Data Scientists can easily spend 70% to 80% of their time on feature engineering. The time is well spent, as changes to input data have a direct impact on performance. This process is often iterative and requires repeatedly revisiting the various tasks as understanding the data and the problem evolves. Knowing techniques and associated challenges helps in adequate feature engineering.</p>



<p class="wp-block-paragraph">Also: <a href="https://www.relataly.com/mastering-prompt-engineering-for-chatgpt-a-practical-guide-for-businesses/13134/" target="_blank" rel="noreferrer noopener">Mastering Prompt Engineering for ChatGPT for Business Use</a></p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%">
<figure class="wp-block-image size-large"><img fetchpriority="high" decoding="async" width="1024" height="1024" data-attachment-id="12411" data-permalink="https://www.relataly.com/exploratory-feature-preparation-for-regression-with-python-and-scikit-learn/8832/engineering-features-python-tutorial-machine-learning/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2023/02/engineering-features-python-tutorial-machine-learning.png" data-orig-size="1024,1024" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="engineering-features-python-tutorial-machine-learning" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2023/02/engineering-features-python-tutorial-machine-learning.png" src="https://www.relataly.com/wp-content/uploads/2023/02/engineering-features-python-tutorial-machine-learning-1024x1024.png" alt="Engineering features python tutorial machine learning. Image of an engineer working on a technical document. Midjourney. relataly.com" class="wp-image-12411" srcset="https://www.relataly.com/wp-content/uploads/2023/02/engineering-features-python-tutorial-machine-learning.png 1024w, https://www.relataly.com/wp-content/uploads/2023/02/engineering-features-python-tutorial-machine-learning.png 300w, https://www.relataly.com/wp-content/uploads/2023/02/engineering-features-python-tutorial-machine-learning.png 140w, https://www.relataly.com/wp-content/uploads/2023/02/engineering-features-python-tutorial-machine-learning.png 768w" sizes="(max-width: 1024px) 100vw, 1024px" /><figcaption class="wp-element-caption">Feature engineering is about carefully choosing features instead of taking all the features at once. Image created with <a href="http://www.midjourney.com" target="_blank" rel="noreferrer noopener">Midjourney</a>.</figcaption></figure>
</div>
</div>



<p class="wp-block-paragraph"></p>



<h3 class="wp-block-heading">Core Tasks</h3>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">The goal of feature engineering is to create a set of features that are representative of the underlying data and that can be used by the machine learning algorithm to make accurate predictions. Several tasks are commonly performed as part of the feature engineering process, including:</p>



<ul class="wp-block-list">
<li><strong>Data discovery</strong>: To solve real-world problems with analytics, it is crucial to understand the data. Once you have gathered your data, describing and visualizing the data are means to familiarize yourself with it and develop a general feel for the data. </li>



<li><strong>Data structuring:</strong> The data needs to be structured into a unified and usable format. Variables may have a wrong datatype, or the data is distributed across different data frames and must first be merged. In these cases, we first need to bring the data together and into the right shape.</li>



<li><strong>Data cleansing:</strong> Besides being structured, data needs to be cleaned. Records may be redundant or contaminated with errors and missing values that can hinder our model from learning effectively. The same goes for outliers that can distort statistics. </li>



<li><strong>Data transformation:</strong> We can increase the predictive power of our input features by transforming them. Activities may include applying mathematical functions, removing specific data, or grouping variables into bins. Or we create entirely new features out of several existing ones. </li>



<li><strong>Feature selection: </strong>Only some may contain valuable information from the many available variables. By sorting variables that are less relevant and selecting the most promising features, we can create models that are less complex and yield better results.</li>
</ul>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%"></div>
</div>



<h3 class="wp-block-heading">Exploratory Feature Engineering Toolset</h3>



<p class="wp-block-paragraph">Exploratory analysis for identifying and assessing relevant features knows several tools: </p>



<ul class="wp-block-list">
<li>Data Cleansing</li>



<li>Descriptive statistics</li>



<li>Univariate Analysis</li>



<li>Bi-variate Analysis</li>



<li>Multivariate Analysis</li>
</ul>



<h2 class="wp-block-heading">Data Cleansing</h2>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">Educational data is often remarkably perfect, without any errors or missing values. However, it is important to recognize that most real-world data has data quality issues. Some reasons for data quality issues are </p>



<ul class="wp-block-list">
<li>Standardization issues because the data was recorded from different peoples, sensor types, etc.</li>



<li>Sensor or system outages can lead to gaps in the data or create erroneous data points.</li>



<li>Human errors</li>
</ul>



<p class="wp-block-paragraph">An important part of feature engineering is to inspect the data and ensure its quality before use. This is what we understand as &#8220;data cleansing.&#8221; It includes several tasks that aim to improve the data quality, remove erroneous data points and bring the data into a more useful form. </p>



<ul class="wp-block-list">
<li>Cleaning errors, missing values, and other issues.</li>



<li>Handling possible imbalanced data </li>



<li>Removing obvious outliers</li>



<li>Standardisation, e.g., dates or adresses </li>
</ul>



<p class="wp-block-paragraph">Accomplishing these tasks requires a good understanding of the data. We, therefore, carry out data cleansing activities closely intertwined with other exploratory tasks, e.g., univariate and bivariate data analysis. Also, remember that visualizations can aid in the process, as they can greatly enhance your ability to analyze and understand the data. </p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%"></div>
</div>



<h4 class="wp-block-heading">Descriptive Statistics</h4>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">One of the first steps in familiarizing oneself with a new dataset is to use descriptive statistics. Descriptive statistics help understand the data and how the sample represents the real-world population. We can use several statistical measures to analyze and describe a dataset, including the following:</p>



<ul class="wp-block-list">
<li><strong>Measures of Central Tendency</strong> represent a typical value of the data.
<ul class="wp-block-list">
<li><strong>The mean:</strong> The average-based adds together all values in the sample and divides them by the number of samples.</li>



<li><strong>The median</strong>: The median is the value that lies in the middle of the range of all sample values</li>



<li><strong>The mode: </strong>is the most occurring value in a sample set (for categorical variables)</li>
</ul>
</li>



<li><strong>Measures of Variability</strong> tell us something about the spread of the data.
<ul class="wp-block-list">
<li><strong>Range:</strong> The difference between the minimum and maximum value</li>



<li><strong>Variance:</strong> This is the average of the squared difference of the mean.</li>



<li><strong>Standard Deviation:</strong> The square root of the variance.</li>
</ul>
</li>



<li>and <strong>Measures of Frequency</strong> inform us how often we can expect a value to be present in the data, e.g., value counts</li>
</ul>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%"></div>
</div>



<p class="wp-block-paragraph"><strong>Univariate Analysis</strong></p>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow">
<p class="wp-block-paragraph">As &#8220;uni&#8221; suggests, the univariate analysis focuses on a single variable. Rather than examining the relationships between the variables, univariate analysis employs descriptive statistics and visualizations to understand individual columns better.</p>



<p class="wp-block-paragraph">Which illustrations and measures we use depends on the type of the variable.</p>



<p class="wp-block-paragraph"><strong>Categorical variables (incl. binary)</strong></p>



<ul class="wp-block-list">
<li>Descriptive measures include counts in percent and absolute values</li>



<li>Visualizations include pie charts, bar charts (count plots)</li>
</ul>



<p class="wp-block-paragraph"><strong>Continuous variables</strong></p>



<ul class="wp-block-list">
<li>Descriptive measures include min, max, median, mean, variance, standard deviation, and quantiles.</li>



<li>Visualizations include box plots, line plots, and histograms.</li>
</ul>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow">
<figure class="wp-block-image size-full"><img decoding="async" width="838" height="585" data-attachment-id="9261" data-permalink="https://www.relataly.com/exploratory-feature-preparation-for-regression-with-python-and-scikit-learn/8832/output-9/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/09/output.png" data-orig-size="838,585" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="Normal distribution" data-image-description="&lt;p&gt;Normal distribution, univariate analysis&lt;/p&gt;
" data-image-caption="&lt;p&gt;Normal distribution, univariate analysis&lt;/p&gt;
" data-large-file="https://www.relataly.com/wp-content/uploads/2022/09/output.png" src="https://www.relataly.com/wp-content/uploads/2022/09/output.png" alt="" class="wp-image-9261" srcset="https://www.relataly.com/wp-content/uploads/2022/09/output.png 838w, https://www.relataly.com/wp-content/uploads/2022/09/output.png 300w, https://www.relataly.com/wp-content/uploads/2022/09/output.png 768w" sizes="(max-width: 838px) 100vw, 838px" /><figcaption class="wp-element-caption">Normal distribution, univariate analysis</figcaption></figure>



<figure class="wp-block-image size-full"><img decoding="async" width="751" height="194" data-attachment-id="9293" data-permalink="https://www.relataly.com/exploratory-feature-preparation-for-regression-with-python-and-scikit-learn/8832/image-12-15/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/09/image-12.png" data-orig-size="751,194" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-12" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/09/image-12.png" src="https://www.relataly.com/wp-content/uploads/2022/09/image-12.png" alt="" class="wp-image-9293" srcset="https://www.relataly.com/wp-content/uploads/2022/09/image-12.png 751w, https://www.relataly.com/wp-content/uploads/2022/09/image-12.png 300w" sizes="(max-width: 751px) 100vw, 751px" /></figure>
</div>
</div>



<h4 class="wp-block-heading">Bi-variate Analysis </h4>



<p class="wp-block-paragraph">Bi-variate (two-variate) analysis is a kind of statistical analysis that focuses on the relationship between two variables, for example, between a feature column and the target variable. In the case of machine learning projects, bivariate analysis can help to identify features that are potentially predictive of the label or the regression target. </p>



<p class="wp-block-paragraph">Model performance will benefit from strong linear dependencies. In addition, we are also interested in examining the relationships among the features used to train the model. Different types of relations exist that can be examined using various plots and statistical measures:</p>



<h4 class="wp-block-heading">Numerical/Numerical</h4>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">Both variables have numerical values. We can illustrate their relation using lineplots or dot plots. We can examine such relations with <a href="https://www.relataly.com/category/data-science/pearson-correlation/" target="_blank" rel="noreferrer noopener">correlation analysis</a>.</p>



<p class="wp-block-paragraph">The ideal feature subset contains features that are not correlated with each other but are heavily correlated with the target variable. We can use dimensionality reduction to reduce a dataset with many features to a lower-dimensional space in which the remaining features are less correlated.</p>



<p class="wp-block-paragraph">Traditional correlation analysis (e.g., Pearson) cannot consider non-linear relations. We can identify such a relation manually by visualizing the data, for example, using line plots. Once we denote a non-linear relation, we could try to apply mathematical transformations to one of the variables to make their relation more linear. </p>



<p class="wp-block-paragraph">For pairwise analysis, we must understand which variables we deal with. We can differentiate between three categories:</p>



<ul class="wp-block-list">
<li>Numerical/Categorical</li>



<li>Numerical/Numerical</li>



<li>Categorical/Categorical</li>
</ul>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%">
<figure class="wp-block-image size-full is-resized"><img decoding="async" src="https://www.relataly.com/wp-content/uploads/2022/09/image-2.png" alt="Heatmaps illustrate the relation between features and a target variable." class="wp-image-9269" width="372" height="328"/><figcaption class="wp-element-caption">Heatmaps illustrate the relation between features and a target variable.</figcaption></figure>
</div>
</div>



<div style="height:36px" aria-hidden="true" class="wp-block-spacer"></div>



<h4 class="wp-block-heading">Numerical/Categorical</h4>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">Plots that visualize the relationship between a categorical and a numerical variable include barplots and lineplots. </p>



<p class="wp-block-paragraph">Especially helpful are histograms (count plots). They can highlight differences in the distribution of the numerical variable for different categories.</p>



<p class="wp-block-paragraph">A specific subcase is a numerical/date relation. Such relations are typically visualized using line plots. In addition, we want to look out for linear or non-linear dependencies. </p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%">
<figure class="wp-block-image size-full is-resized"><img decoding="async" data-attachment-id="9286" data-permalink="https://www.relataly.com/exploratory-feature-preparation-for-regression-with-python-and-scikit-learn/8832/image-6-15/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/09/image-6.png" data-orig-size="764,406" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-6" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/09/image-6.png" src="https://www.relataly.com/wp-content/uploads/2022/09/image-6.png" alt="the lineplot is useful for feature exploration and engineering" class="wp-image-9286" width="379" height="201" srcset="https://www.relataly.com/wp-content/uploads/2022/09/image-6.png 764w, https://www.relataly.com/wp-content/uploads/2022/09/image-6.png 300w" sizes="(max-width: 379px) 100vw, 379px" /><figcaption class="wp-element-caption">Line charts are useful when examining trends.</figcaption></figure>
</div>
</div>



<h4 class="wp-block-heading">Categorical/Categorical</h4>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">The relation between two categorical variables can be studied, including density plots, histograms, and bar plots.</p>



<p class="wp-block-paragraph">For example, with car types (attributes: sedan and coupe) and colors (characteristics: red, blue, yellow), we can use a barplot to see if sedans are more often red than coupes. Differences in the distribution of characteristics can be a starting point for attempts to manipulate the features and improve model performance. </p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%">
<figure class="wp-block-image size-full is-resized"><img decoding="async" data-attachment-id="9291" data-permalink="https://www.relataly.com/exploratory-feature-preparation-for-regression-with-python-and-scikit-learn/8832/image-11-8/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/09/image-11.png" data-orig-size="765,396" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-11" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/09/image-11.png" src="https://www.relataly.com/wp-content/uploads/2022/09/image-11.png" alt="the barplot is useful for feature exploration and engineering" class="wp-image-9291" width="374" height="194" srcset="https://www.relataly.com/wp-content/uploads/2022/09/image-11.png 765w, https://www.relataly.com/wp-content/uploads/2022/09/image-11.png 300w" sizes="(max-width: 374px) 100vw, 374px" /><figcaption class="wp-element-caption">Bar and column charts are a great way to compare numeric values for discrete categories visually.</figcaption></figure>
</div>
</div>



<h4 class="wp-block-heading">Multivariate Analysis</h4>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph"><em>Multivariate</em> analysis encompasses the simultaneous analysis of more than two variables. The approach can uncover multi-dimensional dependencies and is often used in advanced feature engineering. For example, you may find that two variables are weakly correlated with the target variable, but when combined, their relation intensifies. So you might try to create a new feature that uses the two variables as input. Plots that can visualize relations between several variables include dot plots and violin plots.</p>



<p class="wp-block-paragraph">In addition, multivariate analysis refers to techniques to reduce the dimensionality of a dataset. For example, principal component analysis (PCA) or factor analysis can condense the information in a data set into a smaller number of synthetic features.</p>



<p class="wp-block-paragraph">Now that we have a good understanding of what feature selection techniques are available, we can start the practical part and apply them.</p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%">
<figure class="wp-block-image size-full is-resized"><img decoding="async" data-attachment-id="9282" data-permalink="https://www.relataly.com/exploratory-feature-preparation-for-regression-with-python-and-scikit-learn/8832/image-4-20/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/09/image-4.png" data-orig-size="738,409" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-4" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/09/image-4.png" src="https://www.relataly.com/wp-content/uploads/2022/09/image-4.png" alt="the scatterplot is useful for feature exploration and engineering" class="wp-image-9282" width="377" height="209" srcset="https://www.relataly.com/wp-content/uploads/2022/09/image-4.png 738w, https://www.relataly.com/wp-content/uploads/2022/09/image-4.png 300w" sizes="(max-width: 377px) 100vw, 377px" /><figcaption class="wp-element-caption">Scatter charts are useful when you want to compare two numeric quantities and see a relationship or correlation between them.</figcaption></figure>



<figure class="wp-block-image size-full"><img decoding="async" width="743" height="405" data-attachment-id="9294" data-permalink="https://www.relataly.com/exploratory-feature-preparation-for-regression-with-python-and-scikit-learn/8832/image-13-7/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/09/image-13.png" data-orig-size="743,405" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-13" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/09/image-13.png" src="https://www.relataly.com/wp-content/uploads/2022/09/image-13.png" alt="the violin plot is useful for feature exploration and engineering" class="wp-image-9294" srcset="https://www.relataly.com/wp-content/uploads/2022/09/image-13.png 743w, https://www.relataly.com/wp-content/uploads/2022/09/image-13.png 300w" sizes="(max-width: 743px) 100vw, 743px" /></figure>
</div>
</div>



<div style="height:100px" aria-hidden="true" class="wp-block-spacer"></div>



<p class="wp-block-paragraph">Also: <a href="https://www.relataly.com/cryptocurrency-price-charts-with-color-overlay-python/2820/" target="_blank" rel="noreferrer noopener">Color-Coded Cryptocurrency Price Charts in Python</a></p>



<h2 class="wp-block-heading" id="h-feature-engineering-for-car-price-regression-with-python-and-scikit-learn">Feature Engineering for Car Price Regression with Python and Scikit-learn</h2>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">The value of a car on the market depends on various factors. The distance traveled with the vehicle and the year of manufacture is obvious dependencies. But beyond that, we can use many other factors to train a machine learning model that predicts the selling price of the used car market. The following hands-on Python tutorial will create such a model. We will work with a dataset containing used cars&#8217; characteristics in the following. For marketing, it is crucial to understand what car characteristics determine the price of a vehicle. Our goal is to model the car price from the available independent variables. We aim to build a model that performs well on a small but powerful input subset. </p>



<p class="wp-block-paragraph">Exploring and creating features varies between different application domains. For example, feature engineering in computer vision will differ greatly from feature engineering for regression or classification models or NLP models. So the example provided in this article is just for regression models.</p>



<p class="wp-block-paragraph">We follow an exploratory process that includes the following steps:</p>



<ol class="wp-block-list">
<li>Loading the data</li>



<li>Cleaning the data</li>



<li>Univariate analysis</li>



<li>Bivariate analysis</li>



<li>Selecting features</li>



<li>Data preparation </li>



<li>Model training</li>



<li>Measuring performance</li>
</ol>



<p class="wp-block-paragraph">Finally, we compare the performance of our model, which was trained on a minimal set of features, to a model that uses the original data.</p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%">
<figure class="wp-block-image size-large"><img decoding="async" width="512" height="512" data-attachment-id="12810" data-permalink="https://www.relataly.com/exploratory-feature-preparation-for-regression-with-python-and-scikit-learn/8832/dwarf-blacksmith-machine-learning-python-feature-engineering-relataly-midjourney/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2023/03/Dwarf-blacksmith-machine-learning-python-feature-engineering-relataly-midjourney.png" data-orig-size="1024,1024" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="Dwarf-blacksmith-machine-learning-python-feature-engineering-relataly-midjourney" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2023/03/Dwarf-blacksmith-machine-learning-python-feature-engineering-relataly-midjourney.png" src="https://www.relataly.com/wp-content/uploads/2023/03/Dwarf-blacksmith-machine-learning-python-feature-engineering-relataly-midjourney-512x512.png" alt="Yes, you can judge by the length of the beard that this guy is a legendary feature engineer. Image created with Midjourney." class="wp-image-12810" srcset="https://www.relataly.com/wp-content/uploads/2023/03/Dwarf-blacksmith-machine-learning-python-feature-engineering-relataly-midjourney.png 512w, https://www.relataly.com/wp-content/uploads/2023/03/Dwarf-blacksmith-machine-learning-python-feature-engineering-relataly-midjourney.png 300w, https://www.relataly.com/wp-content/uploads/2023/03/Dwarf-blacksmith-machine-learning-python-feature-engineering-relataly-midjourney.png 140w, https://www.relataly.com/wp-content/uploads/2023/03/Dwarf-blacksmith-machine-learning-python-feature-engineering-relataly-midjourney.png 768w, https://www.relataly.com/wp-content/uploads/2023/03/Dwarf-blacksmith-machine-learning-python-feature-engineering-relataly-midjourney.png 1024w" sizes="(max-width: 512px) 100vw, 512px" /><figcaption class="wp-element-caption">Yes, you can judge by the length of the beard that this guy is a legendary feature engineer. Image created with <a href="http://www.midjourney.com" target="_blank" rel="noreferrer noopener">Midjourney</a>.</figcaption></figure>
</div>
</div>



<p class="wp-block-paragraph">The Python code is available in the relataly GitHub repository.</p>



<div class="wp-block-kadence-advancedbtn kb-buttons-wrap kb-btns_f9d778-26"><a class="kb-button kt-button button kb-btn_d0af05-38 kt-btn-size-standard kt-btn-width-type-full kb-btn-global-inherit kt-btn-has-text-true kt-btn-has-svg-true wp-block-button__link wp-block-kadence-singlebtn" href="https://github.com/flo7up/relataly-public-python-tutorials/blob/master/11%20Hyperparamter%20Tuning/015%20Hyperparameter%20Tuning%20of%20Regression%20Models%20using%20Random%20Search.ipynb" target="_blank" rel="noreferrer noopener"><span class="kb-svg-icon-wrap kb-svg-icon-fe_eye kt-btn-icon-side-left"><svg viewBox="0 0 24 24"  fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"  aria-hidden="true"><path d="M1 12s4-8 11-8 11 8 11 8-4 8-11 8-11-8-11-8z"/><circle cx="12" cy="12" r="3"/></svg></span><span class="kt-btn-inner-text">View on GitHub </span></a>

<a class="kb-button kt-button button kb-btn_7b2495-91 kt-btn-size-standard kt-btn-width-type-full kb-btn-global-inherit kt-btn-has-text-true kt-btn-has-svg-true wp-block-button__link wp-block-kadence-singlebtn" href="https://github.com/flo7up/relataly-public-python-API-tutorials" target="_blank" rel="noreferrer noopener"><span class="kb-svg-icon-wrap kb-svg-icon-fa_github kt-btn-icon-side-left"><svg viewBox="0 0 496 512"  fill="currentColor" xmlns="http://www.w3.org/2000/svg"  aria-hidden="true"><path d="M165.9 397.4c0 2-2.3 3.6-5.2 3.6-3.3.3-5.6-1.3-5.6-3.6 0-2 2.3-3.6 5.2-3.6 3-.3 5.6 1.3 5.6 3.6zm-31.1-4.5c-.7 2 1.3 4.3 4.3 4.9 2.6 1 5.6 0 6.2-2s-1.3-4.3-4.3-5.2c-2.6-.7-5.5.3-6.2 2.3zm44.2-1.7c-2.9.7-4.9 2.6-4.6 4.9.3 2 2.9 3.3 5.9 2.6 2.9-.7 4.9-2.6 4.6-4.6-.3-1.9-3-3.2-5.9-2.9zM244.8 8C106.1 8 0 113.3 0 252c0 110.9 69.8 205.8 169.5 239.2 12.8 2.3 17.3-5.6 17.3-12.1 0-6.2-.3-40.4-.3-61.4 0 0-70 15-84.7-29.8 0 0-11.4-29.1-27.8-36.6 0 0-22.9-15.7 1.6-15.4 0 0 24.9 2 38.6 25.8 21.9 38.6 58.6 27.5 72.9 20.9 2.3-16 8.8-27.1 16-33.7-55.9-6.2-112.3-14.3-112.3-110.5 0-27.5 7.6-41.3 23.6-58.9-2.6-6.5-11.1-33.3 2.6-67.9 20.9-6.5 69 27 69 27 20-5.6 41.5-8.5 62.8-8.5s42.8 2.9 62.8 8.5c0 0 48.1-33.6 69-27 13.7 34.7 5.2 61.4 2.6 67.9 16 17.7 25.8 31.5 25.8 58.9 0 96.5-58.9 104.2-114.8 110.5 9.2 7.9 17 22.9 17 46.4 0 33.7-.3 75.4-.3 83.6 0 6.5 4.6 14.4 17.3 12.1C428.2 457.8 496 362.9 496 252 496 113.3 383.5 8 244.8 8zM97.2 352.9c-1.3 1-1 3.3.7 5.2 1.6 1.6 3.9 2.3 5.2 1 1.3-1 1-3.3-.7-5.2-1.6-1.6-3.9-2.3-5.2-1zm-10.8-8.1c-.7 1.3.3 2.9 2.3 3.9 1.6 1 3.6.7 4.3-.7.7-1.3-.3-2.9-2.3-3.9-2-.6-3.6-.3-4.3.7zm32.4 35.6c-1.6 1.3-1 4.3 1.3 6.2 2.3 2.3 5.2 2.6 6.5 1 1.3-1.3.7-4.3-1.3-6.2-2.2-2.3-5.2-2.6-6.5-1zm-11.4-14.7c-1.6 1-1.6 3.6 0 5.9 1.6 2.3 4.3 3.3 5.6 2.3 1.6-1.3 1.6-3.9 0-6.2-1.4-2.3-4-3.3-5.6-2z"/></svg></span><span class="kt-btn-inner-text">Relataly Github Repo </span></a></div>



<h3 class="wp-block-heading" id="h-prerequisites">Prerequisites</h3>



<p class="wp-block-paragraph">Before you proceed, ensure that you have set up your <a href="https://www.python.org/downloads/" target="_blank" rel="noreferrer noopener">Python</a> environment (3.8 or higher) and the required packages. If you don&#8217;t have an environment, follow&nbsp;<a href="https://www.relataly.com/anaconda-python-environment-machine-learning/1663/" target="_blank" rel="noreferrer noopener">this tutorial</a>&nbsp;to set up the&nbsp;<a href="https://www.anaconda.com/products/individual" target="_blank" rel="noreferrer noopener">Anaconda environment</a>.</p>



<p class="wp-block-paragraph">Also, make sure you install all required packages. In this tutorial, we will be working with the following standard packages:&nbsp;</p>



<ul class="wp-block-list">
<li><em><a href="https://pandas.pydata.org/" target="_blank" rel="noreferrer noopener">pandas</a></em></li>



<li><em><a href="https://numpy.org/" target="_blank" rel="noreferrer noopener">NumPy</a></em></li>



<li><em><a href="https://matplotlib.org/" target="_blank" rel="noreferrer noopener">matplotlib</a></em></li>



<li>Seaborn</li>



<li>Scikit-learn</li>
</ul>



<p class="wp-block-paragraph">You can install packages using console commands:</p>



<ul class="wp-block-list">
<li><em>pip install &lt;package name&gt;</em></li>



<li><em>conda install &lt;package name&gt;</em>&nbsp;(if you are using the anaconda packet manager)</li>
</ul>



<h3 class="wp-block-heading">About the Dataset</h3>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">In this tutorial, we will be working with a dataset containing listings for 111763&nbsp;used cars. The data includes 13 variables, including the dependent target variable</p>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow">
<ul class="wp-block-list">
<li><strong>prod_date:</strong> The year of production</li>



<li><strong>maker: </strong>The manufacturer&#8217;s name</li>



<li><strong>model: </strong>The car edition</li>



<li><strong>trim: </strong>Different versions of the model</li>



<li><strong>body_type: </strong>The body style of a vehicle</li>



<li><strong>transmission_type: </strong>The way the power is brought to the wheels</li>



<li><strong>state</strong>: The state in which the car is auctioned</li>



<li><strong>condition</strong>: The condition of the cars</li>



<li><strong>odometer</strong>: The distance the car has traveled since manufactured</li>



<li><strong>exterior_color</strong>: Exterior color</li>



<li><strong>interior_color</strong>: Interior color</li>



<li><strong>sale_price (target variable):</strong> The price a car was sold </li>



<li><strong>sale_date: </strong>The date on which the car has been sold</li>
</ul>
</div>
</div>



<p class="wp-block-paragraph">The dataset is available for download from <a href="https://www.kaggle.com/datasets/lepchenkov/usedcarscatalog" target="_blank" rel="noreferrer noopener">Kaggle.com</a>, but you can execute the code below and load the data from the relataly GitHub repository.</p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%">
<figure class="wp-block-image size-full"><img decoding="async" width="505" height="510" data-attachment-id="12429" data-permalink="https://www.relataly.com/exploratory-feature-preparation-for-regression-with-python-and-scikit-learn/8832/artishellen_set_of_elements_cars_different_colored_cars_cartoon_87cde816-541c-4e6e-ba8c-cfa530032760-min/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2023/02/artishellen_set_of_elements_cars_different_colored_cars_cartoon_87cde816-541c-4e6e-ba8c-cfa530032760-min.png" data-orig-size="505,510" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="artishellen_set_of_elements_cars_different_colored_cars_cartoon_87cde816-541c-4e6e-ba8c-cfa530032760-min" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2023/02/artishellen_set_of_elements_cars_different_colored_cars_cartoon_87cde816-541c-4e6e-ba8c-cfa530032760-min.png" src="https://www.relataly.com/wp-content/uploads/2023/02/artishellen_set_of_elements_cars_different_colored_cars_cartoon_87cde816-541c-4e6e-ba8c-cfa530032760-min.png" alt="Car price prediction machine learning python tutorial. Image of different cars cartoon style. Midjourney. relataly.com" class="wp-image-12429" srcset="https://www.relataly.com/wp-content/uploads/2023/02/artishellen_set_of_elements_cars_different_colored_cars_cartoon_87cde816-541c-4e6e-ba8c-cfa530032760-min.png 505w, https://www.relataly.com/wp-content/uploads/2023/02/artishellen_set_of_elements_cars_different_colored_cars_cartoon_87cde816-541c-4e6e-ba8c-cfa530032760-min.png 297w, https://www.relataly.com/wp-content/uploads/2023/02/artishellen_set_of_elements_cars_different_colored_cars_cartoon_87cde816-541c-4e6e-ba8c-cfa530032760-min.png 140w" sizes="(max-width: 505px) 100vw, 505px" /><figcaption class="wp-element-caption">Car price prediction is a solid use case for machine learning. Image created with <a href="http://www.midjourney.com" target="_blank" rel="noreferrer noopener">Midjourney</a>.</figcaption></figure>
</div>
</div>



<p class="wp-block-paragraph"></p>



<h3 class="wp-block-heading" id="h-step-1-load-the-data">Step #1 Load the Data</h3>



<p class="wp-block-paragraph">We begin by importing the necessary libraries and downloading the dataset from the relataly GitHub repository. Next, we will read the dataset into a pandas DataFrame. In addition, we store the name of our regression target variable to &#8216;price_usd,&#8217; which is one of the columns in the initial dataset. The &#8220;.head ()&#8221; function displays the first records of our DataFrame.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Tested with Python 3.8.8, Matplotlib 3.5, Scikit-learn 0.24.1, Seaborn 0.11.1, numpy 1.19.5
from codecs import ignore_errors
import math
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('white', {'axes.spines.right': False, 'axes.spines.top': False})
from pandas.api.types import is_string_dtype, is_numeric_dtype 
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import mean_absolute_error, mean_absolute_percentage_error
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.inspection import permutation_importance
from sklearn.model_selection import ShuffleSplit
# Original Data Source: 
# https://www.kaggle.com/datasets/tunguz/used-car-auction-prices
# Load train and test datasets
df = pd.read_csv(&quot;https://raw.githubusercontent.com/flo7up/relataly_data/main/car_prices2/car_prices.csv&quot;)
df.head(3)</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">	prod_year	maker			model		trim		body_type		transmission_type	state	condition	odometer	exterior_color	interior	sellingprice	date
0	2015		Kia				Sorento		LX			SUV				automatic			ca		5.0			16639.0		white			black		21500	2014-12-16
1	2015		Nissan			Altima		2.5 S		Sedan			automatic			ca		1.0			5554.0		gray			black		10900	2014-12-30
2	2014		Audi			A6	3.0T 	Prestige 	quattro	Sedan	automatic			ca		4.8			14414.0		black			black		49750	2014-12-16</pre></div>



<p class="wp-block-paragraph">We now have a dataframe that contains 12 columns and the dependent target variable we want to predict. </p>



<h3 class="wp-block-heading" id="h-step-2-data-cleansing">Step #2 Data Cleansing</h3>



<p class="wp-block-paragraph">Now that we have loaded the data, we begin with the exploratory analysis. First, we will put it into shape. </p>



<h4 class="wp-block-heading" id="h-2-1-check-names-and-datatypes">2.1 Check Names and Datatypes</h4>



<p class="wp-block-paragraph">If the names in a dataset are not self-explaining, it is easy to get confused with all the data. Therefore, will rename some of the columns and provide clearer names. There is no default naming convention, but striving for consistency, simplicity, and understandability is generally a good idea. </p>



<p class="wp-block-paragraph">The following code line renames some of the columns. </p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># rename some columns for consistency
df.rename(columns={'exterior_color': 'ext_color', 
                   'interior': 'int_color', 
                   'sellingprice': 'sale_price'}, inplace=True)
df.head(1)</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">	prod_year	maker	model	trim	body_type	transmission_type	state	condition	odometer	ext_color	int_color	sale_price	date
0	2015		Kia		Sorento	LX		SUV			automatic			ca		5.0			16639.0		white		black		21500		2014-12-16</pre></div>



<p class="wp-block-paragraph">Next, we will check and remove possible duplicates.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># check and remove dublicates
print(len(df))
df = df.drop_duplicates()
print(len(df))</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">OUT: 111763, 111763</pre></div>



<p class="wp-block-paragraph">There were no duplicates in the data, which is good.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># check datatypes
df.dtypes</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">prod_year              int64
maker                 object
model                 object
trim                  object
body_type             object
transmission_type     object
state                 object
condition            float64
odometer             float64
ext_color             object
int_color             object
sale_price             int64
date                  object
dtype: object</pre></div>



<p class="wp-block-paragraph">We compare the datatypes to the first records we printed in the previous section. Be aware that categorical variables (e.g., of type &#8220;string&#8221;) are shown as &#8220;objects.&#8221; The data types look as expected.</p>



<p class="wp-block-paragraph">Finally, we define our target variable&#8217;s name, &#8220;sale_price.&#8221; The target variable will be our regression target, and we will use its name often. </p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># consistently define the target variable
target_name = 'sale_price'</pre></div>



<h4 class="wp-block-heading">2.2 Checking Missing Values</h4>



<p class="wp-block-paragraph">Some machine learning algorithms are sensitive to missing values. Handling missing values is, therefore a crucial step in exploratory feature engineering. </p>



<p class="wp-block-paragraph">Let&#8217;s first gain an overview of null values. With a larger DataFrame, it would be inefficient to review all the rows and columns individually for missing values. Instead, we use the sum function and visualize the results to get a quick overview of missing data in the DataFrame.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># check for missing values
null_df = pd.DataFrame(df.isna().sum(), columns=['null_values']).sort_values(['null_values'], ascending=False)
fig = plt.subplots(figsize=(16, 6))
ax = sns.barplot(data=null_df, x='null_values', y=null_df.index, color='royalblue')
pct_values = [' {:g}'.format(elm) + ' ({:.1%})'.format(elm/len(df)) for elm in list(null_df['null_values'])]
ax.bar_label(container=ax.containers[0], labels=pct_values, size=12)
ax.set_title('Overview of missing values')</pre></div>



<figure class="wp-block-image size-large"><img decoding="async" width="1024" height="384" data-attachment-id="9365" data-permalink="https://www.relataly.com/exploratory-feature-preparation-for-regression-with-python-and-scikit-learn/8832/missing-values-bar-chart-for-car-price-regression/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/09/missing-values-bar-chart-for-car-price-regression.png" data-orig-size="1026,385" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="missing-values-bar-chart-for-car-price-regression" data-image-description="&lt;p&gt;overview of missing values in the car price regression dataset&lt;/p&gt;
" data-image-caption="&lt;p&gt;overview of missing values in the car price regression dataset&lt;/p&gt;
" data-large-file="https://www.relataly.com/wp-content/uploads/2022/09/missing-values-bar-chart-for-car-price-regression.png" src="https://www.relataly.com/wp-content/uploads/2022/09/missing-values-bar-chart-for-car-price-regression-1024x384.png" alt="overview of missing values in the car price regression dataset" class="wp-image-9365" srcset="https://www.relataly.com/wp-content/uploads/2022/09/missing-values-bar-chart-for-car-price-regression.png 1024w, https://www.relataly.com/wp-content/uploads/2022/09/missing-values-bar-chart-for-car-price-regression.png 300w, https://www.relataly.com/wp-content/uploads/2022/09/missing-values-bar-chart-for-car-price-regression.png 768w, https://www.relataly.com/wp-content/uploads/2022/09/missing-values-bar-chart-for-car-price-regression.png 1026w" sizes="(max-width: 1024px) 100vw, 1024px" /></figure>



<p class="wp-block-paragraph">The bar chart shows that there are several variables with missing values. Variables with many missing values can negatively affect model performance, which is why we should try to treat them. </p>



<h4 class="wp-block-heading">2.3 Overview of Techniques for Handling Missing Values</h4>



<p class="wp-block-paragraph"> There are various ways to handle missing data. The most common options to handle missing values are:</p>



<ul class="wp-block-list">
<li><strong>Custom substitution value:</strong> Sometimes, the information that a value is missing can be important information to a predictive model. We can substitute missing values with a placeholder value such as &#8220;missing&#8221; or &#8220;unknown.&#8221; The approach works particularly well for variables with many missing values. </li>



<li><strong>Statistical filling: </strong>We can fill in a statistically chosen measure, such as the mean or median for numeric variables, or the mode for categorical variables.</li>



<li><strong>Replace using Probabilistic PCA:</strong> PCA uses a linear approximation function that tries to reconstruct the missing values from the data.</li>



<li><strong>Remove entire rows:</strong> It is crucial to ensure that we only use data we know is correct. In those cases, we can drop an entire row if it contains a missing value. This also solves the problem but comes at the cost of losing potentially important information &#8211; especially if the data quantity is small.</li>



<li><strong>Remove the entire column:</strong> It is another alternative way of resolving missing values. This is typically the least option, as we lose an entire feature. </li>
</ul>



<p class="wp-block-paragraph">How we handle missing values can dramatically affect our prediction results. To find the ideal method, it is often necessary to experiment with different techniques. Sometimes, the information that a value is missing can also be important. This occurs when the missing values are not randomly distributed in the data and show a pattern. In such a case, you should create an additional feature that states whether values are missing.</p>



<h4 class="wp-block-heading">2.4 Handle Missing Values</h4>



<p class="wp-block-paragraph">In this example, we will use the median value to fill in the missing values of our numeric variables and the mode to replace the missing values of categorical variables. When we check again, we can see that odometer and condition have no more missing values.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># fill missing values with the mean for numeric columns
for col_name in df.columns:
    if (is_numeric_dtype(df[col_name])) and (df[col_name].isna().sum() &gt; 0):
        df[col_name].fillna(df[col_name].median(), inplace=True) # alternatively you could also drop the columns with missing values using .drop(columns=['engine_capacity']) 
print(df.isna().sum())</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">prod_year                0
maker                 2078
model                 2096
trim                  2157
body_type             2641
transmission_type    13135
state                    0
condition                0
odometer                 0
ext_color              173
int_color              173
sale_price               0
date                     0
dtype: int64</pre></div>



<p class="wp-block-paragraph">Next, we handle the missing values of transmission_type by filling them with the mode.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># check the distribution of missing values for transmission type
print(df['transmission_type'].value_counts())
# fill values with the mode
df['transmission_type'].fillna(df['transmission_type'].mode()[0], inplace=True)
print(df['transmission_type'].isna().sum())</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">automatic    108198
manual         3565
Name: transmission_type, dtype: int64
0</pre></div>



<p class="wp-block-paragraph">We handle body_type analogs as transmission_type and fill the missing values with the mode. The mode is the value that appears most often in the data. The mode of transmission_type is &#8220;Sedan.&#8221; However, this value is not that prevalent, as half of the cars have other body types, e.g., &#8220;SUV.&#8221; Therefore, we will replace the missing values with &#8220;Unknown.&#8221;</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># check the distribution of missing values for body type
print(df['body_type'].value_counts())
# fill values with 'Unknown'
df['body_type'].fillna(&quot;Unknown&quot;, inplace=True)
print(df['body_type'].isna().sum())</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">Sedan                 39955
SUV                   23836
sedan                  8377
suv                    4934
Hatchback              4241
                      ...  
cts-v coupe               2
Ram Van                   1
Transit Van               1
CTS Wagon                 1
beetle convertible        1
Name: body_type, Length: 74, dtype: int64
0</pre></div>



<p class="wp-block-paragraph">Now we have handled most of the missing values in our data. However, some variables are still left, with a few missing values. We will make things easy and simply drop all remaining records with missing values. Considering that we have more than 100k records and only a few variables, we can afford to do this without fear of a severe impact on our model performance. </p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># remove all other records with missing values
df.dropna(inplace=True)
print(df.isna().sum())</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">prod_year            0
maker                0
model                0
trim                 0
body_type            0
transmission_type    0
state                0
condition            0
odometer             0
ext_color            0
int_color            0
sale_price           0
date                 0
dtype: int64</pre></div>



<p class="wp-block-paragraph">Finally, we check again for missing values and see that everything has been filled. Now, we have a cleansed dataset with 13 columns. </p>



<h4 class="wp-block-heading">2.3 Save a Copy of the Cleaned Data</h4>



<p class="wp-block-paragraph">Before exploring the features, let&#8217;s make a copy of the cleaned data. We will later use this &#8220;full&#8221; dataset to compare the performance of our model with a baseline model.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Create a copy of the dataset with all features for comparison reasons
df_all = df.copy()</pre></div>



<h3 class="wp-block-heading">Step #3 Getting started with Statistical Univariate Analysis</h3>



<p class="wp-block-paragraph">Now it&#8217;s time to analyze the data and explore potential useful features for our subset. Although the process follows a linear flow in this example, you may notice in practice that you must go back and forth between different steps of the feature exploration and engineering process. </p>



<p class="wp-block-paragraph">First, we will look at the variance of the features in the initial dataset. Machine learning models can only learn from variables that have adequate variance. So, low-variance features are often candidates to exclude from the feature subset.</p>



<p class="wp-block-paragraph">We use the .describe() method to display univariate descriptive statistics about the numerical columns in our dataset. </p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># show statistics for numeric variables
print(df.columns)
df.describe()</pre></div>



<p class="wp-block-paragraph">Next, we check the categorical variables. All variables seem to have a good variance. We can measure the variance with statistical measures or observe it manually using bar charts and scatterplots.</p>



<p class="wp-block-paragraph">We can use histplots to visualize the distributions of the numeric variables. The example below shows the histplot for our target variable sale_price.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Explore the variance of the target variable
variable_name = 'sale_price'
fig, ax = plt.subplots(figsize=(14,5))
sns.histplot(data=df[[variable_name]].dropna(), ax=ax, color='royalblue', kde=True)
ax.get_legend().remove()
ax.set_title(variable_name + ' Distribution')
ax.set_xlim(0, df[variable_name].quantile(0.99))</pre></div>



<figure class="wp-block-image size-large is-resized"><img decoding="async" data-attachment-id="9395" data-permalink="https://www.relataly.com/exploratory-feature-preparation-for-regression-with-python-and-scikit-learn/8832/image-23-2/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/09/image-23.png" data-orig-size="1051,395" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-23" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/09/image-23.png" src="https://www.relataly.com/wp-content/uploads/2022/09/image-23-1024x385.png" alt="distribution of the target variable in sale price regression; example for feature exploration and preparation with python and sklearn" class="wp-image-9395" width="695" height="261" srcset="https://www.relataly.com/wp-content/uploads/2022/09/image-23.png 1024w, https://www.relataly.com/wp-content/uploads/2022/09/image-23.png 300w, https://www.relataly.com/wp-content/uploads/2022/09/image-23.png 768w, https://www.relataly.com/wp-content/uploads/2022/09/image-23.png 1051w" sizes="(max-width: 695px) 100vw, 695px" /></figure>



<p class="wp-block-paragraph">The histplot shows that sale prices are skewed to the left. This means there are many cheap cars and fewer expensive ones, which makes sense.</p>



<p class="wp-block-paragraph">Next, we create bar plots for categorical values.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># 3.2 Illustrate the Variance of Numeric Variables 
f_list_numeric = [x for x in df.columns if (is_numeric_dtype(df[x]) and df[x].nunique() &gt; 2)]
f_list_numeric
# box plot design
PROPS = {
    'boxprops':{'facecolor':'none', 'edgecolor':'royalblue'},
    'medianprops':{'color':'coral'},
    'whiskerprops':{'color':'royalblue'},
    'capprops':{'color':'royalblue'}
    }
sns.set_style('ticks', {'axes.edgecolor': 'grey',  
                        'xtick.color': '0',
                        'ytick.color': '0'})
# Adjust plotsize based on the number of features
ncols = 1
nrows = math.ceil(len(f_list_numeric) / ncols)
fig, axs = plt.subplots(nrows, ncols, figsize=(14, nrows*1))
for i, ax in enumerate(fig.axes):
    if i &lt; len(f_list_numeric):
        column_name = f_list_numeric[i]
        sns.boxplot(data=df[column_name], orient=&quot;h&quot;, ax = ax, color='royalblue', flierprops={&quot;marker&quot;: &quot;o&quot;}, **PROPS)
        ax.set(yticklabels=[column_name])
        fig.tight_layout()</pre></div>



<figure class="wp-block-image size-large is-resized"><img decoding="async" data-attachment-id="9392" data-permalink="https://www.relataly.com/exploratory-feature-preparation-for-regression-with-python-and-scikit-learn/8832/barplots-to-visualize-the-variance-of-categorical-variables/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/09/barplots-to-visualize-the-variance-of-categorical-variables.png" data-orig-size="1434,425" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="barplots-to-visualize-the-variance-of-categorical-variables" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/09/barplots-to-visualize-the-variance-of-categorical-variables.png" src="https://www.relataly.com/wp-content/uploads/2022/09/barplots-to-visualize-the-variance-of-categorical-variables-1024x303.png" alt="" class="wp-image-9392" width="786" height="232" srcset="https://www.relataly.com/wp-content/uploads/2022/09/barplots-to-visualize-the-variance-of-categorical-variables.png 1024w, https://www.relataly.com/wp-content/uploads/2022/09/barplots-to-visualize-the-variance-of-categorical-variables.png 300w, https://www.relataly.com/wp-content/uploads/2022/09/barplots-to-visualize-the-variance-of-categorical-variables.png 768w, https://www.relataly.com/wp-content/uploads/2022/09/barplots-to-visualize-the-variance-of-categorical-variables.png 1434w" sizes="(max-width: 786px) 100vw, 786px" /></figure>



<p class="wp-block-paragraph">We can observe two things: First, the variance of transmission type is low, as most cars have an automatic transmission. So transmission_type is the first variable that we exclude from our feature subset.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Drop features with low variety
df = df.drop(columns=['transmission_type'])
df.head(2)</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">	prod_year	maker	model	trim	body_type	state	condition	odometer	ext_color	int_color	sale_price	date
0	2015		Kia		Sorento	LX		SUV			ca		5.0			16639.0		white		black		21500		2014-12-16
1	2015		Nissan	Altima	2.5 S	Sedan		ca		1.0			5554.0		gray		black		10900		2014-12-30</pre></div>



<p class="wp-block-paragraph">Second, int_color and ext_color have many categorical values. By grouping some of these values that hardly ever occur, we can help the model to focus on the most relevant patterns. However, before we do that, we need to take a closer look at how the target variable differs between the categories. </p>



<h3 class="wp-block-heading">Step #4 Bi-variate Analysis</h3>



<p class="wp-block-paragraph">Now that we have a general understanding of our dataset&#8217;s individual variables, let&#8217;s look at pairwise dependencies. We are particularly interested in the relationship between features and the target variables. Our goal is to keep features whose dependence on the target variable shows some pattern &#8211; linear or non-linear. On the other hand, we want to exclude features whose relationship with the target variable looks arbitrary. </p>



<p class="wp-block-paragraph">Visualizations have to take the datatypes of our variables into account. To illustrate the relation between categorical features and the target, we create boxplots and kdeplots. For numeric (continuous) features, we use scatterplots.</p>



<h4 class="wp-block-heading">4.1 Analyzing the Relation between Features and the Target Variable</h4>



<p class="wp-block-paragraph">We begin by taking a closer look at the int_color and ext_color. We use kdeplots to highlight the distribution of prices depending on different colors. </p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}">def make_kdeplot(column_name):
    fig, ax = plt.subplots(figsize=(20,8))
    sns.kdeplot(data=df, hue=column_name, x=target_name, ax = ax, linewidth=2,)
    ax.tick_params(axis=&quot;x&quot;, rotation=90, labelsize=10, length=0)
    ax.set_title(column_name)
    ax.set_xlim(0, df[target_name].quantile(0.99))
    plt.show()
    
make_kdeplot('ext_color')
</pre></div>



<figure class="wp-block-image size-large is-resized"><img decoding="async" data-attachment-id="9418" data-permalink="https://www.relataly.com/exploratory-feature-preparation-for-regression-with-python-and-scikit-learn/8832/output-2-4/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/09/output-2.png" data-orig-size="1168,507" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="output-2" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/09/output-2.png" src="https://www.relataly.com/wp-content/uploads/2022/09/output-2-1024x444.png" alt="Density plots are useful during feature exloration and selection" class="wp-image-9418" width="637" height="275" srcset="https://www.relataly.com/wp-content/uploads/2022/09/output-2.png 1024w, https://www.relataly.com/wp-content/uploads/2022/09/output-2.png 300w, https://www.relataly.com/wp-content/uploads/2022/09/output-2.png 768w" sizes="(max-width: 637px) 100vw, 637px" /></figure>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}">make_kdeplot('int_color')</pre></div>



<figure class="wp-block-image size-large is-resized"><img decoding="async" data-attachment-id="9419" data-permalink="https://www.relataly.com/exploratory-feature-preparation-for-regression-with-python-and-scikit-learn/8832/output2-2/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/09/output2.png" data-orig-size="1168,507" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="output2" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/09/output2.png" src="https://www.relataly.com/wp-content/uploads/2022/09/output2-1024x444.png" alt="Another density plot that shows the distribution of colors across our car dataset" class="wp-image-9419" width="655" height="283" srcset="https://www.relataly.com/wp-content/uploads/2022/09/output2.png 1024w, https://www.relataly.com/wp-content/uploads/2022/09/output2.png 300w, https://www.relataly.com/wp-content/uploads/2022/09/output2.png 768w, https://www.relataly.com/wp-content/uploads/2022/09/output2.png 1168w" sizes="(max-width: 655px) 100vw, 655px" /></figure>



<p class="wp-block-paragraph">In both cases, a few colors are prevalent and account for most observations. Moreover, distributions of the car price differ for these prevalent colors. These differences look promising as they may help our model to differentiate cheaper cars from more expensive ones. To simplify things, we group the colors that hardly occur into a color category called &#8220;other.&#8221;</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Binning features
df['int_color'] = [x if  x in(['black', 'gray', 'white', 'silver', 'blue', 'red']) else 'other' for x in df['int_color']]
df['ext_color'] = [x if  x in(['black', 'gray', 'white', 'silver', 'blue', 'red']) else 'other' for x in df['ext_color']]</pre></div>



<p class="wp-block-paragraph">Next, we create plots for all remaining features. </p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Vizualising Distributions
f_list = [x for x in df.columns if ((is_numeric_dtype(df[x])) and x != target_name) or (df[x].nunique() &lt; 50)]
f_list_len = len(f_list)
print(f'numeric features: {f_list_len}')
# Adjust plotsize based on the number of features
ncols = 1
nrows = math.ceil(f_list_len / ncols)
fig, axs = plt.subplots(nrows, ncols, figsize=(18, nrows*5))
for i, ax in enumerate(fig.axes):
    if i &lt; f_list_len:
        column_name = f_list[i]
        print(column_name)
        # If a variable has more than 8 unique values draw a scatterplot, else draw a violinplot 
        if df[column_name].nunique() &gt; 100 and is_numeric_dtype(df[column_name]):
            # Draw a scatterplot for each variable and target_name
            sns.scatterplot(data=df, y=target_name, x=column_name, ax = ax)
        else: 
            # Draw a vertical violinplot (or boxplot) grouped by a categorical variable:
            myorder = df.groupby(by=[column_name])[target_name].median().sort_values().index
            sns.boxplot(data=df, x=column_name, y=target_name, ax = ax, order=myorder)
            #sns.violinplot(data=df, x=column_name, y=target_name, ax = ax, order=myorder)
        ax.tick_params(axis=&quot;x&quot;, rotation=90, labelsize=10, length=0)
        ax.set_title(column_name)
    fig.tight_layout()</pre></div>



<figure class="wp-block-image size-large is-resized"><img decoding="async" data-attachment-id="9397" data-permalink="https://www.relataly.com/exploratory-feature-preparation-for-regression-with-python-and-scikit-learn/8832/boxplots/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/09/boxplots.png" data-orig-size="1289,2153" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="boxplots" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/09/boxplots.png" src="https://www.relataly.com/wp-content/uploads/2022/09/boxplots-613x1024.png" alt="boxplots and scatterplots help us to understand the relationship between our features and the target variable" class="wp-image-9397" width="725" height="1211" srcset="https://www.relataly.com/wp-content/uploads/2022/09/boxplots.png 613w, https://www.relataly.com/wp-content/uploads/2022/09/boxplots.png 180w, https://www.relataly.com/wp-content/uploads/2022/09/boxplots.png 768w, https://www.relataly.com/wp-content/uploads/2022/09/boxplots.png 920w, https://www.relataly.com/wp-content/uploads/2022/09/boxplots.png 1226w, https://www.relataly.com/wp-content/uploads/2022/09/boxplots.png 1289w" sizes="(max-width: 725px) 100vw, 725px" /></figure>



<p class="wp-block-paragraph">Again, for categorical variables, we want to see differences in the distribution of the categories. Based on the boxplot&#8217;s median and the quantiles, we can denote that prod_year, int_color, and condition show adequate variance. The scatterplot for the odometer value also looks good. So we want to keep these features. In contrast, the differences between &#8220;state&#8221; and &#8220;ext_color&#8221; are rather weak. Therefore, we exclude these variables from our subset. </p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># drop columns with low variance
df.drop(columns=['state', 'ext_color'], inplace=True)</pre></div>



<p class="wp-block-paragraph">Finally, if you want to take a more detailed look at the numeric features, you can use jointplots. These are scatterplots with additional information about the distributions. The example below shows the jointplot for the odometer value vs price. </p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># detailed univariate and bivariate analysis of 'odometer' using a jointplot 
def make_jointplot(feature_name):
    p = sns.jointplot(data=df, y=feature_name, x=target_name, height=6, ratio=6, kind='reg', joint_kws={'line_kws':{'color':'coral'}})
    p.fig.suptitle(feature_name + ' Distribution')
    p.ax_joint.collections[0].set_alpha(0.3)
    p.ax_joint.set_ylim(df[feature_name].min(), df[feature_name].max())
    p.fig.tight_layout()
    p.fig.subplots_adjust(top=0.95)
make_jointplot ('odometer')
# Alternatively you can use hex_binning
# def make_joint_hexplot(feature_name):
#     p = sns.jointplot(data=df, y=feature_name, x=target_name, height=10, ratio=1, kind=&quot;hex&quot;)
#     p.ax_joint.set_ylim(0, df[feature_name].quantile(0.999))
#     p.ax_joint.set_xlim(0, df[target_name].quantile(0.999))
#     p.fig.suptitle(feature_name + ' Distribution')</pre></div>



<figure class="wp-block-image size-full is-resized"><img decoding="async" data-attachment-id="11491" data-permalink="https://www.relataly.com/exploratory-feature-preparation-for-regression-with-python-and-scikit-learn/8832/image-8-10/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/12/image-8.png" data-orig-size="425,427" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-8" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/12/image-8.png" src="https://www.relataly.com/wp-content/uploads/2022/12/image-8.png" alt="" class="wp-image-11491" width="499" height="502" srcset="https://www.relataly.com/wp-content/uploads/2022/12/image-8.png 425w, https://www.relataly.com/wp-content/uploads/2022/12/image-8.png 140w" sizes="(max-width: 499px) 100vw, 499px" /></figure>



<p class="wp-block-paragraph">Here is another example of a jointplot for the variable &#8216;condition.&#8217;</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># detailed univariate and bivariate analysis of 'condition' using a jointplot 
make_jointplot('condition')</pre></div>



<figure class="wp-block-image size-full is-resized"><img decoding="async" data-attachment-id="9423" data-permalink="https://www.relataly.com/exploratory-feature-preparation-for-regression-with-python-and-scikit-learn/8832/jointplot-condition/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/09/jointplot-condition.png" data-orig-size="425,427" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="jointplot-condition" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/09/jointplot-condition.png" src="https://www.relataly.com/wp-content/uploads/2022/09/jointplot-condition.png" alt="Dotplot that shows the relationship between two variables: car condition vs sale price" class="wp-image-9423" width="472" height="475" srcset="https://www.relataly.com/wp-content/uploads/2022/09/jointplot-condition.png 425w, https://www.relataly.com/wp-content/uploads/2022/09/jointplot-condition.png 150w" sizes="(max-width: 472px) 100vw, 472px" /></figure>



<p class="wp-block-paragraph">The graphs show a linear relationship between the price for the condition and the odometer value. </p>



<h4 class="wp-block-heading" id="h-4-2-correlation-matrix">4.2 Correlation Matrix</h4>



<p class="wp-block-paragraph">Correlation analysis is a technique to quantify the dependency between numeric features and a target variable. Different ways exist to calculate the correlation coefficient. For example, we can use Pearson correlation (linear relation), Kendall correlation (ordinal association), or Spearman (monotonic dependence). </p>



<p class="wp-block-paragraph">The example below uses Pearson correlation, which concentrates on the linear relationship between two variables. The Pearson correlation score lies between -1 and 1. General interpretations of the absolute value of the correlation coefficient&nbsp;are:</p>



<ul class="wp-block-list">
<li>.00-.19 &#8220;very weak&#8221;</li>



<li>.20-.39 &#8220;weak&#8221;</li>



<li>.40-.59 &#8220;moderate&#8221;</li>



<li>.60-.79 &#8220;strong&#8221;</li>



<li>.80-1.0 &#8220;very strong&#8221;</li>
</ul>



<p class="wp-block-paragraph">More information on the Pearson correlation can be found <a href="https://www.relataly.com/category/data-science/pearson-correlation/" target="_blank" rel="noreferrer noopener">here</a> and in <a href="https://www.relataly.com/stock-market-correlation-matrix-in-python/103/" target="_blank" rel="noreferrer noopener">this article on the correlation between covid-19 and the stock market</a>.</p>



<p class="wp-block-paragraph">We will calculate a correlation matrix that provides the correlation coefficient for all features in our subset, incl. sale_price.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># 4.1 Correlation Matrix
# correlation heatmap allows us to identify highly correlated explanatory variables and reduce collinearity
plt.figure(figsize = (9,8))
plt.yticks(rotation=0)
correlation = df.corr()
ax =  sns.heatmap(correlation, cmap='GnBu',square=True, linewidths=.1, cbar_kws={&quot;shrink&quot;: .82},annot=True,
            fmt='.1',annot_kws={&quot;size&quot;:10})
sns.set(font_scale=0.8)
for f in ax.texts:
        f.set_text(f.get_text())  </pre></div>



<figure class="wp-block-image size-full is-resized"><img decoding="async" data-attachment-id="9400" data-permalink="https://www.relataly.com/exploratory-feature-preparation-for-regression-with-python-and-scikit-learn/8832/image-24-9/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/09/image-24.png" data-orig-size="646,549" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-24" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/09/image-24.png" src="https://www.relataly.com/wp-content/uploads/2022/09/image-24.png" alt="Heatmap in Python that shows the correlation between selected variables in our car dataset" class="wp-image-9400" width="554" height="471" srcset="https://www.relataly.com/wp-content/uploads/2022/09/image-24.png 646w, https://www.relataly.com/wp-content/uploads/2022/09/image-24.png 300w" sizes="(max-width: 554px) 100vw, 554px" /></figure>



<p class="wp-block-paragraph">All our remaining numeric features strongly correlate with price (positive or negative). However, this is not all that matters. Ideally, we want to have features that have a low correlation with each other. We can see that prod_year and condition are moderately correlated (coefficient: 0.5). Because prod_year is more correlated with price (coefficient: 0.6) than condition (coefficient: 0.5), we drop the condition variable. </p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}">df.drop(columns='condition', inplace=True)</pre></div>



<h3 class="wp-block-heading">Step #5 Data Preprocessing </h3>



<p class="wp-block-paragraph">Now our subset contains the following variables:</p>



<ul class="wp-block-list">
<li>prod_year</li>



<li>maker</li>



<li>model</li>



<li>trim</li>



<li>body_type</li>



<li>odometer</li>



<li>int_color</li>



<li>sale_price</li>
</ul>



<p class="wp-block-paragraph">Next, we prepare the data for use as input to train a regression model. Before we train the model, we need to make a few final preparations. For example, we use a label encoder to replace the strong_values of the categorical variables with numeric values.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># encode categorical variables 
def encode_categorical_variables(df):
    # create a list of categorical variables that we want to encode
    categorical_list = [x for x in df.columns if is_string_dtype(df[x])]
    le = LabelEncoder()
    # apply the encoding to the categorical variables
    # because the apply() function has no inplace argument,  we use the following syntax to transform the df
    df[categorical_list] = df[categorical_list].apply(LabelEncoder().fit_transform)
    return df
df_final_subset = encode_categorical_variables(df)
df_all_ = encode_categorical_variables(df_all)
# create a copy of the dataframe but without the target variable
df_without_target = df.drop(columns=[target_name])
df_final_subset.head()</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">	prod_year	maker	model	trim	body_type	odometer	int_color	sale_price	date
0	2015		23		594		794		31			16639.0		0			21500		8
1	2015		34		59		98		32			5554.0		0			10900		17
2	2014		2		46		180		32			14414.0		0			49750		8
3	2015		34		59		98		32			11398.0		0			14100		13
4	2015		7		325		789		32			14538.0		0			7200		158</pre></div>



<h3 class="wp-block-heading" id="h-step-6-splitting-the-data-and-training-the-model">Step #6 Splitting the Data and Training the Model</h3>



<p class="wp-block-paragraph">To ensure that our regression model does not know the target variable, we separate car price (y) from features (x). Last, we split the data into separate datasets for training and testing. The result is four different data sets: x_train, y_train, x_test, and y_test.</p>



<p class="wp-block-paragraph">Once the split function has prepared the datasets, we the regression model. Our model uses the Random Decision Forest algorithm from the scikit learn package. As a so-called ensemble model, the Random Forest is a robust Machine Learning algorithm. It considers predictions from a set of multiple independent estimators. </p>



<p class="wp-block-paragraph">The Random Forest algorithm has a wide range of hyperparameters. While we could optimize our model further by testing various configurations (hyperparameter tuning), this is not the focus of this article. Therefore, we will use the default hyperparameters for our model as defined by <a href="https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html?highlight=random%20forest#sklearn.ensemble.RandomForestClassifier" target="_blank" rel="noreferrer noopener">scikit-learn</a>. Please visit one of my recent articles on <a href="https://www.relataly.com/using-random-search-to-tune-the-hyperparameters-of-a-random-decision-forest-with-python/6875/" target="_blank" rel="noreferrer noopener">hyperparameter tuning</a>, if you want to learn more about this topic.</p>



<p class="wp-block-paragraph">For comparison reasons, we train two models—one model with our subset of selected features. The second model uses all features, cleansed but without any further manipulations. </p>



<p class="wp-block-paragraph">We use shuffled cross-validation (cv=5) to evaluate our model&#8217;s performance on different data folds.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}">def splitting(df, name):
    # separate labels from training data
    X = df.drop(columns=[target_name])
    y = df[target_name] #Prediction label
    # split the data into x_train and y_train data sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, random_state=0)
    # print the shapes: the result is: (rows, training_sequence, features) (prediction value, )
    print(name + '')
    print('train: ', X_train.shape, y_train.shape)
    print('test: ', X_test.shape, y_test.shape)
    return X, y, X_train, X_test, y_train, y_test
# train the model
def train_model(X, y, X_train, y_train):
    estimator = RandomForestRegressor() 
    cv = ShuffleSplit(n_splits=5, test_size=0.3, random_state=0)
    scores = cross_val_score(estimator, X, y, cv=cv)
    estimator.fit(X_train, y_train)
    return scores, estimator
# train the model with the subset of selected features
X_sub, y_sub, X_train_sub, X_test_sub, y_train_sub, y_test_sub = splitting(df_final_subset, 'subset')
scores_sub, estimator_sub = train_model(X_sub, y_sub, X_train_sub, y_train_sub)
    
# train the model with all features
X_all, y_all, X_train_all, X_test_all, y_train_all, y_test_all = splitting(df_all_, 'fullset')
scores_all, estimator_all = train_model(X_all, y_all, X_train_all, y_train_all)</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">subset
train:  (76592, 8) (76592,)
test:  (32826, 8) (32826,)</pre></div>



<h3 class="wp-block-heading" id="h-step-7-comparing-regression-models">Step #7 Comparing Regression Models</h3>



<p class="wp-block-paragraph">Finally, we want to see how the model performs and how its performance compares against the model that uses all variables. </p>



<h4 class="wp-block-heading" id="h-7-1-model-scoring">7.1 Model Scoring</h4>



<p class="wp-block-paragraph">We use different regression metrics to measure the performance. Then we create a barplot that compares the performance scores across the different validation folds (due to cross-validation). </p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># 7.1 Model Scoring 
def create_metrics(scores, estimator, X_test, y_test, col_name):
    scores_df = pd.DataFrame({col_name:scores})
    # predict on the test set
    y_pred = estimator.predict(X_test)
    y_df = pd.DataFrame(y_test)
    y_df['PredictedPrice']=y_pred
    # Mean Absolute Error (MAE)
    MAE = mean_absolute_error(y_test, y_pred)
    print('Mean Absolute Error (MAE): ' + str(np.round(MAE, 2)))
    # Mean Absolute Percentage Error (MAPE)
    MAPE = mean_absolute_percentage_error(y_test, y_pred)
    print('Mean Absolute Percentage Error (MAPE): ' + str(np.round(MAPE*100, 2)) + ' %')
    
    # calculate the feature importance scores
    r = permutation_importance(estimator, X_test, y_test, n_repeats=30, random_state=0)
    data_im = pd.DataFrame(r.importances_mean, columns=['feature_permuation_score'])
    data_im['feature_names'] = X_test.columns
    data_im = data_im.sort_values('feature_permuation_score', ascending=False)
    
    return scores_df, data_im
scores_df_sub, data_im_sub = create_metrics(scores_sub, estimator_sub, X_test_sub, y_test_sub, 'subset')
scores_df_all, data_im_all = create_metrics(scores_all, estimator_all, X_test_all, y_test_all, 'fullset')
scores_df = pd.concat([scores_df_sub, scores_df_all],  axis=1)
# visualize how the two models have performed in each fold
fig, ax = plt.subplots(figsize=(10, 6))
scores_df.plot(y=[&quot;subset&quot;, &quot;fullset&quot;], kind=&quot;bar&quot;, ax=ax)
ax.set_title('Cross validation scores')
ax.set(ylim=(0, 1))
ax.tick_params(axis=&quot;x&quot;, rotation=0, labelsize=10, length=0)</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">Mean Absolute Error (MAE): 1643.39
Mean Absolute Percentage Error (MAPE): 24.36 %
Mean Absolute Error (MAE): 1813.78
Mean Absolute Percentage Error (MAPE): 25.23 %</pre></div>



<figure class="wp-block-image size-full is-resized"><img decoding="async" data-attachment-id="9436" data-permalink="https://www.relataly.com/exploratory-feature-preparation-for-regression-with-python-and-scikit-learn/8832/image-29-8/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/09/image-29.png" data-orig-size="746,468" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-29" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/09/image-29.png" src="https://www.relataly.com/wp-content/uploads/2022/09/image-29.png" alt="barplot that visualizes cross validation for a car price regression model" class="wp-image-9436" width="494" height="310" srcset="https://www.relataly.com/wp-content/uploads/2022/09/image-29.png 746w, https://www.relataly.com/wp-content/uploads/2022/09/image-29.png 300w" sizes="(max-width: 494px) 100vw, 494px" /></figure>



<p class="wp-block-paragraph">The subset model achieves an absolute percentage error of around 24%, which is not so bad. But more importantly, our model performs better than the model that uses all features. However, the subset model is less complex as it only uses eight features instead of 12. So it is easier to understand and less costly to train.</p>



<h4 class="wp-block-heading">7.2 Feature Permutation Importance Scores</h4>



<p class="wp-block-paragraph">Next, we calculate feature importance scores. In this way, we can determine which features attribute the most to the predictive power of our model. Feature importance scores are a useful tool in the feature engineering process, as they provide insights into how the features in our subset contribute to the overall performance of our predictive model. Features with low importance scores can be eliminated from the subset or replaced with other features.</p>



<p class="wp-block-paragraph">Again we will compare our subset model to the model that uses all available features from the initial dataset. </p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># compare the feature importance scores of the subset model to the fullset model
fig, axs = plt.subplots(1, 2, figsize=(20, 8))
sns.barplot(data=data_im_sub, y='feature_names', x=&quot;feature_permuation_score&quot;, ax=axs[0])
axs[0].set_title(&quot;Feature importance scores of the subset model&quot;)
sns.barplot(data=data_im_all, y='feature_names', x=&quot;feature_permuation_score&quot;, ax=axs[1])
axs[1].set_title(&quot;Feature importance scores of the fullset model&quot;)</pre></div>



<figure class="wp-block-image size-large"><img decoding="async" width="1024" height="421" data-attachment-id="9437" data-permalink="https://www.relataly.com/exploratory-feature-preparation-for-regression-with-python-and-scikit-learn/8832/cross-validation-scores-1/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/09/cross-validation-scores-1.png" data-orig-size="1200,493" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="cross-validation-scores-1" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/09/cross-validation-scores-1.png" src="https://www.relataly.com/wp-content/uploads/2022/09/cross-validation-scores-1-1024x421.png" alt="Barplots that compare feature importance between the full dataset model and the subset model" class="wp-image-9437" srcset="https://www.relataly.com/wp-content/uploads/2022/09/cross-validation-scores-1.png 1024w, https://www.relataly.com/wp-content/uploads/2022/09/cross-validation-scores-1.png 300w, https://www.relataly.com/wp-content/uploads/2022/09/cross-validation-scores-1.png 768w, https://www.relataly.com/wp-content/uploads/2022/09/cross-validation-scores-1.png 1200w" sizes="(max-width: 1024px) 100vw, 1024px" /></figure>



<p class="wp-block-paragraph">In the subset model, most features are relevant to the model&#8217;s performance. Only date and int_color do not seem to have a significant impact. For the full set model, five out of 12 features hardly contribute to the model performance (date, int_color, ext_color, state, transmission_type). </p>



<p class="wp-block-paragraph">Once you have a strong subset of features, you can automate the feature selection process using different techniques, e.g., forward or backward selection. Automated feature selection techniques will test different model variants with varying feature combinations to determine the best input dataset. This step is often done at the end of the feature engineering process. However, this is something for another article. </p>



<h2 class="wp-block-heading" id="h-conclusions">Conclusions</h2>



<p class="wp-block-paragraph">That&#8217;s it for now! This tutorial has presented an exploratory approach to feature exploration, engineering, and selection. You have gained an overview of tools and graphs that are useful in identifying and preparing features. The second part was a Python hands-on tutorial. We followed an exploratory feature engineering process to build a regression model for car prices. We used various techniques to discover and sort features and make a vital feature subset. These techniques include data cleansing, descriptive statistics, and univariate and bivariate analysis (incl. correlation). We also used some techniques for feature manipulation, including binning. Finally, we compared our subset model to one that uses all available data. </p>



<p class="wp-block-paragraph">If you take away one learning from this article, remember that in machine learning, less is often more. So training classic machine learning models on carefully curated feature subsets likely outperforms models that use all available information. </p>



<p class="wp-block-paragraph">I hope this article was helpful. I am always trying to improve and learn from my audience. So, if you have any questions or suggestions, please write them in the comments. </p>



<h2 class="wp-block-heading" id="h-sources-and-further-reading">Sources and Further Reading</h2>



<ol class="wp-block-list">
<li><a href="https://amzn.to/3eD49Kv" target="_blank" rel="noreferrer noopener">Zheng and Casari (2018) Feature Engineering for Machine Learning</a></li>



<li><a href="https://amzn.to/3TrBdDY" target="_blank" rel="noreferrer noopener">David Forsyth (2019) Applied Machine Learning Springer</a></li>



<li><a href="https://amzn.to/3T38bLe" target="_blank" rel="noreferrer noopener">Chip Huyen (2022) Designing Machine Learning Systems: An Iterative Process for Production-Ready Applications</a></li>
</ol>



<p class="has-contrast-2-color has-base-3-background-color has-text-color has-background wp-block-paragraph"><em>The links above to Amazon are affiliate links. By buying through these links, you support the Relataly.com blog and help to cover the hosting costs. Using the links does not affect the price.</em></p>



<p class="wp-block-paragraph">Stock-market prediction is a typical regression problem. To learn more about feature engineering for stock-market prediction, check out <a href="https://www.relataly.com/feature-engineering-for-multivariate-time-series-models-with-python/1813/" target="_blank" rel="noreferrer noopener">this article on multivariate stock-market forecasting</a>.</p>
<p>The post <a href="https://www.relataly.com/exploratory-feature-preparation-for-regression-with-python-and-scikit-learn/8832/">Feature Engineering and Selection for Regression Models with Python and Scikit-learn</a> appeared first on <a href="https://www.relataly.com">relataly.com</a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://www.relataly.com/exploratory-feature-preparation-for-regression-with-python-and-scikit-learn/8832/feed/</wfw:commentRss>
			<slash:comments>1</slash:comments>
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">8832</post-id>	</item>
		<item>
		<title>Using Random Search to Tune the Hyperparameters of a Random Decision Forest with Python</title>
		<link>https://www.relataly.com/using-random-search-to-tune-the-hyperparameters-of-a-random-decision-forest-with-python/6875/</link>
					<comments>https://www.relataly.com/using-random-search-to-tune-the-hyperparameters-of-a-random-decision-forest-with-python/6875/#respond</comments>
		
		<dc:creator><![CDATA[Florian Follonier]]></dc:creator>
		<pubDate>Thu, 07 Apr 2022 17:55:36 +0000</pubDate>
				<category><![CDATA[Algorithms]]></category>
		<category><![CDATA[Classification (two-class)]]></category>
		<category><![CDATA[Cross-Validation]]></category>
		<category><![CDATA[Finance]]></category>
		<category><![CDATA[Hyperparameter Tuning]]></category>
		<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[Random Decision Forests]]></category>
		<category><![CDATA[Sales Forecasting]]></category>
		<category><![CDATA[Scikit-Learn]]></category>
		<category><![CDATA[Seaborn]]></category>
		<category><![CDATA[Use Cases]]></category>
		<category><![CDATA[Classic Machine Learning]]></category>
		<category><![CDATA[Intermediate Tutorials]]></category>
		<category><![CDATA[Multivariate Models]]></category>
		<category><![CDATA[Random Forest Regression]]></category>
		<category><![CDATA[Random Search]]></category>
		<category><![CDATA[Supervised Learning]]></category>
		<category><![CDATA[Tuning Random Decision Forests]]></category>
		<guid isPermaLink="false">https://www.relataly.com/?p=6875</guid>

					<description><![CDATA[<p>Perfecting your machine learning model&#8217;s hyperparameters can often feel like hunting for a proverbial needle in a haystack. But with the Random Search algorithm, this intricate process of hyperparameter tuning can be efficiently automated, saving you valuable time and effort. Hyperparameters are properties intrinsic to your model, like the number of estimators in an ensemble ... <a title="Using Random Search to Tune the Hyperparameters of a Random Decision Forest with Python" class="read-more" href="https://www.relataly.com/using-random-search-to-tune-the-hyperparameters-of-a-random-decision-forest-with-python/6875/" aria-label="Read more about Using Random Search to Tune the Hyperparameters of a Random Decision Forest with Python">Read more</a></p>
<p>The post <a href="https://www.relataly.com/using-random-search-to-tune-the-hyperparameters-of-a-random-decision-forest-with-python/6875/">Using Random Search to Tune the Hyperparameters of a Random Decision Forest with Python</a> appeared first on <a href="https://www.relataly.com">relataly.com</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">Perfecting your machine learning model&#8217;s hyperparameters can often feel like hunting for a proverbial needle in a haystack. But with the Random Search algorithm, this intricate process of hyperparameter tuning can be efficiently automated, saving you valuable time and effort. Hyperparameters are properties intrinsic to your model, like the number of estimators in an ensemble model, and heavily influence its performance. Unlike model parameters, which are discovered during training by the machine learning algorithm, hyperparameters require pre-specification.</p>



<p class="wp-block-paragraph">In this comprehensive Python tutorial, we&#8217;ll guide you on how to harness the power of Random Search to optimize a regression model&#8217;s hyperparameters. Our illustrative example utilizes a Support Vector Machine (SVM) for predicting house prices. However, the fundamental principles you&#8217;ll learn can be seamlessly applied to any model. So why painstakingly fine-tune hyperparameters manually when Random Search can handle the task efficiently?</p>



<p class="wp-block-paragraph">Here&#8217;s a preview of what this Python tutorial entails:</p>



<ol class="wp-block-list">
<li>A brief overview of how Random Search operates and instances where it might be preferable to Grid Search.</li>



<li>A hands-on Python tutorial featuring a public house price dataset from Kaggle.com. The aim here is to train a regression model capable of predicting US house prices based on various properties.</li>



<li>Training a &#8216;best-guess&#8217; model in Python, followed by using Random Search to discover a model with enhanced performance.</li>



<li>Finally, we&#8217;ll implement cross-validation to validate our models&#8217; performance.</li>
</ol>



<p class="wp-block-paragraph">By the end of this tutorial, you&#8217;ll be well-equipped to let Random Search efficiently fine-tune your model&#8217;s hyperparameters, freeing up your time for other crucial tasks.</p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%"></div>
</div>



<h2 class="wp-block-heading" id="h-hyperparameter-tuning">Hyperparameter Tuning </h2>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">Hyperparameters are configuration options that allow us to customize machine learning models and improve their performance. While normal parameters are the internal coefficients that the model learns during training, we need to specify hyperparameters before the training. It is usually impossible to find the best configuration without testing different configurations. </p>



<p class="wp-block-paragraph">Searching for a suitable model configuration is called &#8220;hyperparameter tuning&#8221; or &#8220;hyperparameter optimization.&#8221; Machine learning algorithms have varying hyperparameters and parameter values. For example, a random decision forest classifier allows us to configure varying parameters such as the number of trees, the maximum tree depth, and the minimum number of nodes required for a new branch. </p>



<p class="wp-block-paragraph">The hyperparameters and the range of possible parameter values span a search space in which we seek to identify the best configuration. The larger the search space, the more difficult it gets to find an optimal model. We can use random search to automatize this process.</p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%">
<figure class="wp-block-image size-full"><img decoding="async" width="506" height="501" data-attachment-id="12416" data-permalink="https://www.relataly.com/using-random-search-to-tune-the-hyperparameters-of-a-random-decision-forest-with-python/6875/random-hyperparameter-tuning-machine-learning/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2023/02/random-hyperparameter-tuning-machine-learning.png" data-orig-size="506,501" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="random-hyperparameter-tuning-machine-learning" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2023/02/random-hyperparameter-tuning-machine-learning.png" src="https://www.relataly.com/wp-content/uploads/2023/02/random-hyperparameter-tuning-machine-learning.png" alt="Random search can be an efficient way to tune the hyperparameters of a machine learning model. Image generated with Midjourney. Image of exploding dices with different colors. " class="wp-image-12416" srcset="https://www.relataly.com/wp-content/uploads/2023/02/random-hyperparameter-tuning-machine-learning.png 506w, https://www.relataly.com/wp-content/uploads/2023/02/random-hyperparameter-tuning-machine-learning.png 300w, https://www.relataly.com/wp-content/uploads/2023/02/random-hyperparameter-tuning-machine-learning.png 140w" sizes="(max-width: 506px) 100vw, 506px" /><figcaption class="wp-element-caption">Random search can be an efficient way to tune the hyperparameters of a machine learning model. Image generated with <a href="http://www.midjourney.com" target="_blank" rel="noreferrer noopener">Midjourney</a></figcaption></figure>
</div>
</div>



<h2 class="wp-block-heading">Techniques for Tuning Hyperparameters</h2>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">Hyperparameter tuning is the process of adjusting the hyperparameters of a machine learning algorithm to optimize its performance on a specific dataset or task. Several techniques can be used for hyperparameter tuning, including:</p>



<ol class="wp-block-list">
<li><strong><a href="https://www.relataly.com/hyperparameter-tuning-with-grid-search/2261/" target="_blank" rel="noreferrer noopener">Grid Search</a>:</strong> grid search is a brute-force search algorithm that systematically evaluates a given set of hyperparameter values by training and evaluating a model for each combination of values. It is a simple and effective technique, but it can be computationally expensive, especially for large or complex datasets.</li>



<li><strong>Random Search: </strong>As mentioned, random search is an alternative to grid search that randomly samples a given set of hyperparameter values rather than evaluating all possible combinations. It can be more efficient than grid search, but it may not find the optimal set of hyperparameters.</li>



<li><strong>Bayesian Optimization:</strong> A bayesian optimization is a probabilistic approach to hyperparameter tuning, which uses Bayesian inference to model the distribution of hyperparameter values that are likely to produce a good performance. It can be more efficient and effective than grid search or random search, but it can be more challenging to implement and interpret.</li>



<li><strong>Genetic Algorithms:</strong> genetic algorithms are optimization algorithms inspired by the principles of natural selection and genetics. They use a population of candidate solutions, which are iteratively evolved and selected based on their fitness or performance, to find the optimal set of hyperparameters.</li>
</ol>



<p class="wp-block-paragraph">In this article, we specifically look at the Random Search technique.</p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%">
<figure class="wp-block-image size-full"><img decoding="async" width="892" height="512" data-attachment-id="12417" data-permalink="https://www.relataly.com/using-random-search-to-tune-the-hyperparameters-of-a-random-decision-forest-with-python/6875/image-of-an-old-mechanic-tuning-a-car-hyperparameter-tuning-machine-learning-python-tutorial-scikit-learn/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2023/02/Image-of-an-old-mechanic-tuning-a-car.-Hyperparameter-Tuning-Machine-Learning-Python-Tutorial-Scikit-Learn.png" data-orig-size="892,512" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="Image-of-an-old-mechanic-tuning-a-car.-Hyperparameter-Tuning-Machine-Learning-Python-Tutorial-Scikit-Learn" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2023/02/Image-of-an-old-mechanic-tuning-a-car.-Hyperparameter-Tuning-Machine-Learning-Python-Tutorial-Scikit-Learn.png" src="https://www.relataly.com/wp-content/uploads/2023/02/Image-of-an-old-mechanic-tuning-a-car.-Hyperparameter-Tuning-Machine-Learning-Python-Tutorial-Scikit-Learn.png" alt="You can spend much time tuning a machine learning model. Image generated with Midjourney. portrait of an old mechanic working on a car. relataly.com" class="wp-image-12417" srcset="https://www.relataly.com/wp-content/uploads/2023/02/Image-of-an-old-mechanic-tuning-a-car.-Hyperparameter-Tuning-Machine-Learning-Python-Tutorial-Scikit-Learn.png 892w, https://www.relataly.com/wp-content/uploads/2023/02/Image-of-an-old-mechanic-tuning-a-car.-Hyperparameter-Tuning-Machine-Learning-Python-Tutorial-Scikit-Learn.png 300w, https://www.relataly.com/wp-content/uploads/2023/02/Image-of-an-old-mechanic-tuning-a-car.-Hyperparameter-Tuning-Machine-Learning-Python-Tutorial-Scikit-Learn.png 768w" sizes="(max-width: 892px) 100vw, 892px" /><figcaption class="wp-element-caption">You can spend much time tuning a machine learning model. Image generated with Midjourney.</figcaption></figure>
</div>
</div>



<h2 class="wp-block-heading">What is Random Search?</h2>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">The random search algorithm generates models from hyperparameter permutations randomly selected from a grid of parameter values. The idea behind the randomized approach is that testing random configurations efficiently identifies a good model. We can use random search both for regression and classification models.</p>



<p class="wp-block-paragraph">Random Search and Grid Search are the most popular techniques for hyperparametric tuning, and both methods are often compared. Unlike random search, grid search covers the search space exhaustively by trying all possible variants. The technique works well for testing a small number of configurations already known to work well. </p>



<p class="wp-block-paragraph">As long as both search space and training time are small, the grid search technique is excellent for finding the best model. However, the number of model variants increases exponentially with the size of the search space. It is often more efficient for large search spaces or complex models to use random search.</p>



<p class="wp-block-paragraph">Since random search does not exhaustively cover the search space, it does not necessarily yield the best model. However, it is also much faster than grid search and efficient in delivering a suitable model in a short time.</p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%"><div class="wp-block-image">
<figure class="alignright size-full is-resized"><img decoding="async" data-attachment-id="6924" data-permalink="https://www.relataly.com/using-random-search-to-tune-the-hyperparameters-of-a-random-decision-forest-with-python/6875/image-16-9/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/04/image-16.png" data-orig-size="640,469" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-16" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/04/image-16.png" src="https://www.relataly.com/wp-content/uploads/2022/04/image-16.png" alt="random decision forest python,
hyperparameter tuning,
comparison between random search and grid search" class="wp-image-6924" width="409" height="301" srcset="https://www.relataly.com/wp-content/uploads/2022/04/image-16.png 640w, https://www.relataly.com/wp-content/uploads/2022/04/image-16.png 300w" sizes="(max-width: 409px) 100vw, 409px" /><figcaption class="wp-element-caption">Random Search vs. Exhaustive Grid Search</figcaption></figure>
</div></div>
</div>



<div style="height:48px" aria-hidden="true" class="wp-block-spacer"></div>



<h2 class="wp-block-heading" id="h-tuning-the-hyperparameters-of-a-random-decision-forest-regressor-in-python-using-random-search">Tuning the Hyperparameters of a Random Decision Forest Regressor in Python using Random Search</h2>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph"><br>In this tutorial, we delve into the use of the Random Search algorithm in Python, specifically for predicting house prices. We&#8217;ll be using a dataset rich in diverse house characteristics. Various elements, such as data quality and quantity, model intricacy, the selection of machine learning algorithms, and housing market stability, significantly influence the accuracy of house price predictions.</p>



<p class="wp-block-paragraph">Our initial model employs a Random Decision Forest algorithm, which we&#8217;ll optimize using a random search approach for hyperparameters tuning. By identifying and implementing a more advantageous configuration, we aim to enhance our model&#8217;s performance significantly.</p>



<p class="wp-block-paragraph">Here&#8217;s a concise outline of the steps we&#8217;ll undertake:</p>



<ol class="wp-block-list">
<li>Loading the house price dataset</li>



<li>Exploring the dataset intricacies</li>



<li>Preparing the data for modeling</li>



<li>Training a baseline Random Decision Forest model</li>



<li>Implementing a random search approach for model optimization</li>



<li>Measuring and evaluating the performance of our optimized model</li>
</ol>



<p class="wp-block-paragraph">Through this step-by-step guide, you&#8217;ll learn to enhance model performance, further refining your understanding of Random Search algorithm implementation in Python.</p>



<p class="wp-block-paragraph">The Python code is available in the relataly GitHub repository. </p>



<div class="wp-block-kadence-advancedbtn kb-buttons-wrap kb-btns_f9d778-26"><a class="kb-button kt-button button kb-btn_b5d394-00 kt-btn-size-standard kt-btn-width-type-full kb-btn-global-inherit kt-btn-has-text-true kt-btn-has-svg-true wp-block-button__link wp-block-kadence-singlebtn" href="https://github.com/flo7up/relataly-public-python-tutorials/blob/master/11%20Hyperparamter%20Tuning/016%20Hyperparameter%20Tuning%20of%20Random%20Decision%20Forests%20using%20Grid%20Search.ipynb" target="_blank" rel="noreferrer noopener"><span class="kb-svg-icon-wrap kb-svg-icon-fe_eye kt-btn-icon-side-left"><svg viewBox="0 0 24 24"  fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"  aria-hidden="true"><path d="M1 12s4-8 11-8 11 8 11 8-4 8-11 8-11-8-11-8z"/><circle cx="12" cy="12" r="3"/></svg></span><span class="kt-btn-inner-text">View on GitHub </span></a>

<a class="kb-button kt-button button kb-btn_ce738a-fc kt-btn-size-standard kt-btn-width-type-full kb-btn-global-inherit kt-btn-has-text-true kt-btn-has-svg-true wp-block-button__link wp-block-kadence-singlebtn" href="https://github.com/flo7up/relataly-public-python-API-tutorials" target="_blank" rel="noreferrer noopener"><span class="kb-svg-icon-wrap kb-svg-icon-fa_github kt-btn-icon-side-left"><svg viewBox="0 0 496 512"  fill="currentColor" xmlns="http://www.w3.org/2000/svg"  aria-hidden="true"><path d="M165.9 397.4c0 2-2.3 3.6-5.2 3.6-3.3.3-5.6-1.3-5.6-3.6 0-2 2.3-3.6 5.2-3.6 3-.3 5.6 1.3 5.6 3.6zm-31.1-4.5c-.7 2 1.3 4.3 4.3 4.9 2.6 1 5.6 0 6.2-2s-1.3-4.3-4.3-5.2c-2.6-.7-5.5.3-6.2 2.3zm44.2-1.7c-2.9.7-4.9 2.6-4.6 4.9.3 2 2.9 3.3 5.9 2.6 2.9-.7 4.9-2.6 4.6-4.6-.3-1.9-3-3.2-5.9-2.9zM244.8 8C106.1 8 0 113.3 0 252c0 110.9 69.8 205.8 169.5 239.2 12.8 2.3 17.3-5.6 17.3-12.1 0-6.2-.3-40.4-.3-61.4 0 0-70 15-84.7-29.8 0 0-11.4-29.1-27.8-36.6 0 0-22.9-15.7 1.6-15.4 0 0 24.9 2 38.6 25.8 21.9 38.6 58.6 27.5 72.9 20.9 2.3-16 8.8-27.1 16-33.7-55.9-6.2-112.3-14.3-112.3-110.5 0-27.5 7.6-41.3 23.6-58.9-2.6-6.5-11.1-33.3 2.6-67.9 20.9-6.5 69 27 69 27 20-5.6 41.5-8.5 62.8-8.5s42.8 2.9 62.8 8.5c0 0 48.1-33.6 69-27 13.7 34.7 5.2 61.4 2.6 67.9 16 17.7 25.8 31.5 25.8 58.9 0 96.5-58.9 104.2-114.8 110.5 9.2 7.9 17 22.9 17 46.4 0 33.7-.3 75.4-.3 83.6 0 6.5 4.6 14.4 17.3 12.1C428.2 457.8 496 362.9 496 252 496 113.3 383.5 8 244.8 8zM97.2 352.9c-1.3 1-1 3.3.7 5.2 1.6 1.6 3.9 2.3 5.2 1 1.3-1 1-3.3-.7-5.2-1.6-1.6-3.9-2.3-5.2-1zm-10.8-8.1c-.7 1.3.3 2.9 2.3 3.9 1.6 1 3.6.7 4.3-.7.7-1.3-.3-2.9-2.3-3.9-2-.6-3.6-.3-4.3.7zm32.4 35.6c-1.6 1.3-1 4.3 1.3 6.2 2.3 2.3 5.2 2.6 6.5 1 1.3-1.3.7-4.3-1.3-6.2-2.2-2.3-5.2-2.6-6.5-1zm-11.4-14.7c-1.6 1-1.6 3.6 0 5.9 1.6 2.3 4.3 3.3 5.6 2.3 1.6-1.3 1.6-3.9 0-6.2-1.4-2.3-4-3.3-5.6-2z"/></svg></span><span class="kt-btn-inner-text">Relataly Github Repo </span></a></div>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%">
<figure class="wp-block-image size-full"><img decoding="async" width="505" height="504" data-attachment-id="12422" data-permalink="https://www.relataly.com/using-random-search-to-tune-the-hyperparameters-of-a-random-decision-forest-with-python/6875/isometric-house-min/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2023/02/isometric-house-min.png" data-orig-size="505,504" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="isometric-house-min" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2023/02/isometric-house-min.png" src="https://www.relataly.com/wp-content/uploads/2023/02/isometric-house-min.png" alt="Now that we have trained a house price prediction model, we can use it to asses the price of new houses. Image generated with Midjourney. Python machine learning tutorial. relataly.com" class="wp-image-12422" srcset="https://www.relataly.com/wp-content/uploads/2023/02/isometric-house-min.png 505w, https://www.relataly.com/wp-content/uploads/2023/02/isometric-house-min.png 300w, https://www.relataly.com/wp-content/uploads/2023/02/isometric-house-min.png 140w" sizes="(max-width: 505px) 100vw, 505px" /><figcaption class="wp-element-caption">Once we have trained a house price prediction model, we can use it to asses the price of new houses. Image generated with <a href="http://www.midjourney.com" target="_blank" rel="noreferrer noopener">Midjourney</a>.</figcaption></figure>
</div>
</div>



<h3 class="wp-block-heading" id="h-prerequisites">Prerequisites</h3>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">Before starting the coding part, ensure that you have set up your Python (3.8 or higher) environment and required packages. If you don&#8217;t have an environment, follow&nbsp;<a href="https://www.relataly.com/anaconda-python-environment-machine-learning/1663/" target="_blank" rel="noreferrer noopener">this tutorial</a>&nbsp;to set up the&nbsp;<a href="https://www.anaconda.com/products/individual" target="_blank" rel="noreferrer noopener">Anaconda environment</a>.</p>



<p class="wp-block-paragraph">Also, make sure you install all required packages. In this tutorial, we will be working with the following standard packages:&nbsp;</p>



<ul class="wp-block-list">
<li><em><a href="https://pandas.pydata.org/" target="_blank" rel="noreferrer noopener">pandas</a></em></li>



<li><em><a href="https://numpy.org/" target="_blank" rel="noreferrer noopener">NumPy</a></em></li>



<li><a href="https://docs.python.org/3/library/math.html" target="_blank" rel="noreferrer noopener">math</a></li>



<li><em><a href="https://matplotlib.org/" target="_blank" rel="noreferrer noopener">matplotlib</a></em></li>
</ul>



<p class="wp-block-paragraph">In addition, we will be using the Python Machine Learning library <a href="https://scikit-learn.org/stable/" target="_blank" rel="noreferrer noopener">Scikit-learn</a> to implement the random forest and the grid search technique. </p>



<p class="wp-block-paragraph">You can install packages using console commands:</p>



<ul class="wp-block-list">
<li><em>pip install &lt;package name&gt;</em></li>



<li><em>conda install &lt;package name&gt;</em>&nbsp;(if you are using the anaconda packet manager)</li>
</ul>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%"></div>
</div>



<h3 class="wp-block-heading" id="h-house-price-prediction-about-the-use-case-and-the-data">House Price Prediction: About the Use Case and the Data</h3>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">House price prediction is the process of using statistical and machine learning techniques to predict the future value of a house. This can be useful for a variety of applications, such as helping homeowners and real estate professionals to make informed decisions about buying and selling properties. In order to make accurate predictions, it is important to have access to high-quality data about the housing market.</p>



<p class="wp-block-paragraph">In this tutorial, we will work with a house price dataset from the <a href="//www.kaggle.com/c/house-prices-advanced-regression-techniques" target="_blank" rel="noreferrer noopener">house price regression challenge on Kaggle.com</a>. The dataset is available via a git hub repository. It contains information about 4800 houses sold between 2016 and 2020 in the US. The data includes the sale price and a list of 48 house characteristics, such as:</p>



<ul class="wp-block-list">
<li>Year &#8211; The year of construction, </li>



<li>SaleYear &#8211; The year in which the house was sold </li>



<li>Lot Area &#8211; The lot area of the house </li>



<li>Quality &#8211; The overall quality of the house from one (lowest) to ten (highest)</li>



<li>Road &#8211; The type of road, e.g., paved, etc. </li>



<li>Utility &#8211; The type of the utility </li>



<li>Park Lot Area &#8211; The parking space included with the property </li>



<li>Room number &#8211; The number of rooms </li>
</ul>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%">
<figure class="wp-block-image size-full"><img decoding="async" width="505" height="505" data-attachment-id="12421" data-permalink="https://www.relataly.com/using-random-search-to-tune-the-hyperparameters-of-a-random-decision-forest-with-python/6875/isometric-house-collection-python-machine-learning-min/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2023/02/isometric-house-collection-python-machine-learning-min.png" data-orig-size="505,505" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="isometric-house-collection-python-machine-learning-min" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2023/02/isometric-house-collection-python-machine-learning-min.png" src="https://www.relataly.com/wp-content/uploads/2023/02/isometric-house-collection-python-machine-learning-min.png" alt="Predicting house prices with machine learning. Image generated with Midjourney. Isometric view of houses. relataly.com" class="wp-image-12421" srcset="https://www.relataly.com/wp-content/uploads/2023/02/isometric-house-collection-python-machine-learning-min.png 505w, https://www.relataly.com/wp-content/uploads/2023/02/isometric-house-collection-python-machine-learning-min.png 300w, https://www.relataly.com/wp-content/uploads/2023/02/isometric-house-collection-python-machine-learning-min.png 140w" sizes="(max-width: 505px) 100vw, 505px" /><figcaption class="wp-element-caption">Predicting house prices with machine learning. Image generated with <a href="http://www.midjourney.com" target="_blank" rel="noreferrer noopener">Midjourney</a>.</figcaption></figure>
</div>
</div>



<h3 class="wp-block-heading" id="h-step-1-load-the-data">Step #1 Load the Data</h3>



<p class="wp-block-paragraph">We begin by loading the house price data from the relataly GitHub repository. A separate download is not required.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># A tutorial for this file is available at www.relataly.com

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import RandomizedSearchCV, train_test_split
from sklearn.metrics import mean_absolute_error, mean_absolute_percentage_error
from sklearn import svm

# Source: 
# https://www.kaggle.com/c/house-prices-advanced-regression-techniques

# Load train and test datasets
path = &quot;https://raw.githubusercontent.com/flo7up/relataly_data/main/house_prices/train.csv&quot;
df = pd.read_csv(path)
print(df.columns)
df.head()</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">Index(['Id', 'MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street',
       'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig',
       'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType',
       'HouseStyle', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd',
       'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType',
       'MasVnrArea', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual',
       'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1',
       'BsmtFinType2', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating',
       'HeatingQC', 'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF',
       'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath',
       'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual',
       'TotRmsAbvGrd', 'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType',
       'GarageYrBlt', 'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual',
       'GarageCond', 'PavedDrive', 'WoodDeckSF', 'OpenPorchSF',
       'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'PoolQC',
       'Fence', 'MiscFeature', 'MiscVal', 'MoSold', 'YrSold', 'SaleType',
       'SaleCondition', 'SalePrice'],
      dtype='object')
	Id	MSSubClass	MSZoning	LotFrontage	LotArea	Street	Alley	LotShape	LandContour	Utilities	...	PoolArea	PoolQC	Fence	MiscFeature	MiscVal	MoSold	YrSold	SaleType	SaleCondition	SalePrice
0	1	60			RL			65.0		8450	Pave	NaN		Reg			Lvl			AllPub		...	0			NaN		NaN		NaN			0		2		2008	WD			Normal			208500
1	2	20			RL			80.0		9600	Pave	NaN		Reg			Lvl			AllPub		...	0			NaN		NaN		NaN			0		5		2007	WD			Normal			181500
2	3	60			RL			68.0		11250	Pave	NaN		IR1			Lvl			AllPub		...	0			NaN		NaN		NaN			0		9		2008	WD			Normal			223500
3	4	70			RL			60.0		9550	Pave	NaN		IR1			Lvl			AllPub		...	0			NaN		NaN		NaN			0		2		2006	WD			Abnorml			140000
4	5	60			RL			84.0		14260	Pave	NaN		IR1			Lvl			AllPub		...	0			NaN		NaN		NaN			0		12		2008	WD			Normal			250000
5 rows × 81 columns</pre></div>



<h3 class="wp-block-heading" id="h-step-2-explore-the-data">Step #2 Explore the Data</h3>



<p class="wp-block-paragraph">Before jumping into preprocessing and model training, let&#8217;s quickly explore the data. A distribution plot can help us understand our dataset&#8217;s frequency of regression values.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Create histograms for feature columns separated by prediction label value
ax = sns.displot(data=df[['SalePrice']].dropna(), height=6, aspect=2)
plt.title('Sale Price Distribution')</pre></div>



<figure class="wp-block-image size-full"><img decoding="async" width="944" height="440" data-attachment-id="8424" data-permalink="https://www.relataly.com/using-random-search-to-tune-the-hyperparameters-of-a-random-decision-forest-with-python/6875/histplot/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/05/histplot.png" data-orig-size="944,440" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="histplot" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/05/histplot.png" src="https://www.relataly.com/wp-content/uploads/2022/05/histplot.png" alt="random search hyperparameter tuning python. random forest regression,
sale price distribution" class="wp-image-8424" srcset="https://www.relataly.com/wp-content/uploads/2022/05/histplot.png 944w, https://www.relataly.com/wp-content/uploads/2022/05/histplot.png 300w, https://www.relataly.com/wp-content/uploads/2022/05/histplot.png 768w" sizes="(max-width: 944px) 100vw, 944px" /></figure>



<p class="wp-block-paragraph">For feature selection, it is helpful to understand the predictive power of the different variables in a dataset. We can use scatterplots to estimate the predictive power of specific features. Running the code below will create a scatterplot that visualizes the relation between the sale price, lot area, and the house&#8217;s overall quality.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Create histograms for feature columns separated by prediction label value
plt.figure(figsize=(16,6))
df_features = df[['SalePrice', 'LotArea', 'OverallQual']]
sns.scatterplot(data=df_features, x='LotArea', y='SalePrice', hue='OverallQual')
plt.title('Sale Price Distribution')</pre></div>



<figure class="wp-block-image size-full"><img decoding="async" width="966" height="387" data-attachment-id="8430" data-permalink="https://www.relataly.com/using-random-search-to-tune-the-hyperparameters-of-a-random-decision-forest-with-python/6875/overall-quality-vs-lotarea-depending-on-sale-price/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/05/Overall-Quality-vs-LotArea-depending-on-Sale-Price.png" data-orig-size="966,387" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="Overall-Quality-vs-LotArea-depending-on-Sale-Price" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/05/Overall-Quality-vs-LotArea-depending-on-Sale-Price.png" src="https://www.relataly.com/wp-content/uploads/2022/05/Overall-Quality-vs-LotArea-depending-on-Sale-Price.png" alt="random search hyperparameter tuning python. random forest regression,
scatter plot, feature selection" class="wp-image-8430" srcset="https://www.relataly.com/wp-content/uploads/2022/05/Overall-Quality-vs-LotArea-depending-on-Sale-Price.png 966w, https://www.relataly.com/wp-content/uploads/2022/05/Overall-Quality-vs-LotArea-depending-on-Sale-Price.png 300w, https://www.relataly.com/wp-content/uploads/2022/05/Overall-Quality-vs-LotArea-depending-on-Sale-Price.png 768w" sizes="(max-width: 966px) 100vw, 966px" /></figure>



<p class="wp-block-paragraph">As expected, the scatterplot shows that the sale price increases with the overall quality. On the other hand, the LotArea has only a minor effect on the sale price. </p>



<h3 class="wp-block-heading">Step #3 Data Preprocessing</h3>



<p class="wp-block-paragraph">Next, we prepare the data for use as input to train a regression model. Because we want to keep things simple, we reduce the number of variables and use only a small set of features. In addition, we encode categorical variables with integer dummy values.</p>



<p class="wp-block-paragraph">To ensure that our regression model does not know the target variable, we separate house price (y) from features (x). Last, we split the data into separate datasets for training and testing. The result is four different data sets: x_train, y_train, x_test, and y_test.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}">def preprocessFeatures(df):   
    # Define a list of relevant features
    feature_list = ['SalePrice', 'OverallQual', 'Utilities', 'GarageArea', 'LotArea', 'OverallCond']
    df_dummy = pd.get_dummies(df[feature_list])
    # Cleanse records with na values
    #df_prep = df_prep.dropna()
    return df_dummy

df_base = preprocessFeatures(df)

# Split the data into x_train and y_train data sets
x_train, x_test, y_train, y_test = train_test_split( df_base.copy(), df_base['SalePrice'].copy(), train_size=0.7, random_state=0)
x_train</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">		OverallQual	GarageArea	LotArea	OverallCond	Utilities_AllPub	Utilities_NoSeWa
682		6			431			2887	5			1					0
960		5			0			7207	7			1					0
1384	6			280			9060	5			1					0
1100	2			246			8400	5			1					0
416		6			440			7844	7			1					0</pre></div>



<h3 class="wp-block-heading" id="h-step-4-train-different-regression-models-using-random-search">Step #4 Train Different Regression Models using Random Search</h3>



<p class="wp-block-paragraph">Now that the dataset is ready, we can train the random decision forest regressor. To do this, we first define a dictionary with different parameter ranges. In addition, we need to define the number of model variants (n) that the algorithm should try. The random search algorithm then selects n random permutations from the grid and uses them to train the model. </p>



<p class="wp-block-paragraph">We use the RandomSearchCV algorithm from the scikit-learn package. The &#8220;CV&#8221; in the function name stands for cross-validation. Cross-validation involves splitting the data into subsets (folds) and rotating them between training and validation runs. This way, each model is trained and tested multiple times on different data partitions. When the search algorithm finally evaluates the model configuration, it summarizes these results into a test score.</p>



<p class="wp-block-paragraph">We use a Random Decision Forest &#8211; a robust machine learning algorithm that can handle classification and regression tasks. As a so-called ensemble model, the Random Forest considers predictions from a set of multiple independent estimators. The estimator is an important parameter to pass to the RandomSearchCV function. Random decision forests have several hyperparameters that we can use to influence their behavior. We define the following parameter ranges:</p>



<ul class="wp-block-list">
<li>max_leaf_nodes = [2, 3, 4, 5, 6, 7]</li>



<li>min_samples_split = [5, 10, 20, 50]</li>



<li>max_depth = [5,10,15,20]</li>



<li>max_features = [3,4,5]</li>



<li>n_estimators = [50, 100, 200]</li>
</ul>



<p class="wp-block-paragraph">These parameter ranges define the search space from which the randomized search algorithm (RandomSearchCV) will select random configurations. Other parameters will use default values as defined by <a href="https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html?highlight=random%20forest#sklearn.ensemble.RandomForestClassifier" target="_blank" rel="noreferrer noopener">scikit-learn</a>.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Define the Estimator and the Parameter Ranges
dt = RandomForestRegressor()
number_of_iterations = 20
max_leaf_nodes = [2, 3, 4, 5, 6, 7]
min_samples_split = [5, 10, 20, 50]
max_depth = [5,10,15,20]
max_features = [3,4,5]
n_estimators = [50, 100, 200]

# Define the param distribution dictionary
param_distributions = dict(max_leaf_nodes=max_leaf_nodes, 
                           min_samples_split=min_samples_split, 
                           max_depth=max_depth,
                           max_features=max_features,
                           n_estimators=n_estimators)

# Build the gridsearch
grid = RandomizedSearchCV(estimator=dt, 
                          param_distributions=param_distributions, 
                          n_iter=number_of_iterations, 
                          cv = 5)

grid_results = grid.fit(x_train, y_train)

# Summarize the results in a readable format
print(&quot;Best params: {0}, using {1}&quot;.format(grid_results.cv_results_['mean_test_score'], grid_results.best_params_))
results_df = pd.DataFrame(grid_results.cv_results_)</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">Best params: [0.68738293 0.49581669 0.52138751 0.61235299 0.65360944 0.61165147
 0.70392285 0.52278886 0.67687248 0.68219638 0.70031536 0.65842909
 0.51939338 0.70801017 0.70911805 0.69543885 0.67983801 0.60744371
 0.68270285 0.70741042], using {'n_estimators': 100, 'min_samples_split': 5, 'max_leaf_nodes': 7, 'max_features': 3, 'max_depth': 15}
	mean_fit_time	std_fit_time	mean_score_time	std_score_time	param_n_estimators	param_min_samples_split	param_max_leaf_nodes	param_max_features	param_max_depth	params	split0_test_score	split1_test_score	split2_test_score	split3_test_score	split4_test_score	mean_test_score	std_test_score	rank_test_score
0	0.049196		0.002071		0.004074		0.000820		50					20						5	4	15	{'n_estimators': 50, 'min_samples_split': 20, ...	0.662973	0.705533	0.669520	0.702608	0.696280	0.687383	0.017637	7
1	0.041115		0.000554		0.003046		0.000094		50					50						2	3	10	{'n_estimators': 50, 'min_samples_split': 50, ...	0.490984	0.527231	0.426270	0.523086	0.511513	0.495817	0.036978	20
2	0.043325		0.000779		0.003486		0.000447		50					50						2	5	20	{'n_estimators': 50, 'min_samples_split': 50, ...	0.484524	0.559358	0.485459	0.517253	0.560343	0.521388	0.033545	18
3	0.162083		0.005665		0.012420		0.004788		200					5						3	3	20	{'n_estimators': 200, 'min_samples_split': 5, ...	0.586586	0.638341	0.573437	0.626793	0.636608	0.612353	0.027021	14
4	0.166659		0.003026		0.010958		0.000084		200					10						4	3	15	{'n_estimators': 200, 'min_samples_split': 10,...	0.633305	0.679161	0.623236	0.661864	0.670481	0.653609	0.021636	13</pre></div>



<p class="wp-block-paragraph">These are the five best models and their respective hyperparameter configurations.</p>



<h3 class="wp-block-heading" id="h-step-5-select-the-best-model-and-measure-performance"><strong>Step #5 Select the best Model and Measure Performance</strong></h3>



<p class="wp-block-paragraph">Finally, we will choose the best model from the list using the &#8220;best_model&#8221; function. We then calculate the MAE and the MAPE to understand how the model performs on the overall test dataset. We then print a comparison between actual sale prices and predicted sale prices.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Select the best Model and Measure Performance
best_model = grid_results.best_estimator_
y_pred = best_model.predict(x_test)
y_df = pd.DataFrame(y_test)
y_df['PredictedPrice']=y_pred
y_df.head()</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">	SalePrice	PredictedPrice
529	200624		166037.831002
491	133000		135860.757958
459	110000		123030.336177
279	192000		206488.444327
655	88000		130453.604206</pre></div>



<p class="wp-block-paragraph">Next, let&#8217;s take a look at the classification errors.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Mean Absolute Error (MAE)
MAE = mean_absolute_error(y_pred, y_test)
print('Mean Absolute Error (MAE): ' + str(np.round(MAE, 2)))

# Mean Absolute Percentage Error (MAPE)
MAPE = mean_absolute_percentage_error(y_pred, y_test)
print('Median Absolute Percentage Error (MAPE): ' + str(np.round(MAPE*100, 2)) + ' %')</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">Mean Absolute Error (MAE): 29591.56 
Median Absolute Percentage Error (MAPE): 15.57 %</pre></div>



<p class="wp-block-paragraph">On average, the model deviates from the actual value by 16 %. Considering we only used a fraction of the available features and defined a small search space, there is much room for improvement.</p>



<h2 class="wp-block-heading" id="h-summary">Summary</h2>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">This article has shown how we can use grid Search in Python to efficiently search for the optimal hyperparameter configuration of a machine learning model. In the conceptual part, you learned about hyperparameters and how to use random search to try out all permutations of a predefined parameter grid. The second part was a Python hands-on tutorial, in which you learned to use random search to tune the hyperparameters of a regression model. We worked with a house price dataset and trained a random decision forest regressor that predicts the sale price for houses depending on several characteristics. Then we defined parameter ranges and tested random permutations. In this way, we quickly identified a configuration that outperforms our initial baseline model. </p>



<p class="wp-block-paragraph">Remember that a random search efficiently identifies a good-performing model but does not necessarily return the best-performing one. Tech random search techniques can be used to tune the hyperparameters of both regression and classification models.</p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%"></div>
</div>



<p class="wp-block-paragraph"></p>



<h2 class="wp-block-heading">Sources and Further Reading</h2>



<p class="wp-block-paragraph">I hope this article was helpful. If you have any questions or suggestions, please write them in the comments. </p>



<div style="display: inline-block;">
  <iframe sandbox="allow-popups allow-scripts allow-modals allow-forms allow-same-origin" style="width:120px;height:240px;" marginwidth="0" marginheight="0" scrolling="no" frameborder="0" src="//ws-eu.amazon-adsystem.com/widgets/q?ServiceVersion=20070822&amp;OneJS=1&amp;Operation=GetAdHtml&amp;MarketPlace=DE&amp;source=ss&amp;ref=as_ss_li_til&amp;ad_type=product_link&amp;tracking_id=flo7up-21&amp;language=de_DE&amp;marketplace=amazon&amp;region=DE&amp;placement=3030181162&amp;asins=3030181162&amp;linkId=669e46025028259138fbb5ccec12dfbe&amp;show_border=true&amp;link_opens_in_new_window=true"></iframe>
<iframe sandbox="allow-popups allow-scripts allow-modals allow-forms allow-same-origin" style="width:120px;height:240px;" marginwidth="0" marginheight="0" scrolling="no" frameborder="0" src="//ws-eu.amazon-adsystem.com/widgets/q?ServiceVersion=20070822&amp;OneJS=1&amp;Operation=GetAdHtml&amp;MarketPlace=DE&amp;source=ss&amp;ref=as_ss_li_til&amp;ad_type=product_link&amp;tracking_id=flo7up-21&amp;language=de_DE&amp;marketplace=amazon&amp;region=DE&amp;placement=1999579577&amp;asins=1999579577&amp;linkId=91d862698bf9010ff4c09539e4c49bf4&amp;show_border=true&amp;link_opens_in_new_window=true"></iframe>
<iframe sandbox="allow-popups allow-scripts allow-modals allow-forms allow-same-origin" style="width:120px;height:240px;" marginwidth="0" marginheight="0" scrolling="no" frameborder="0" src="//ws-eu.amazon-adsystem.com/widgets/q?ServiceVersion=20070822&amp;OneJS=1&amp;Operation=GetAdHtml&amp;MarketPlace=DE&amp;source=ss&amp;ref=as_ss_li_til&amp;ad_type=product_link&amp;tracking_id=flo7up-21&amp;language=de_DE&amp;marketplace=amazon&amp;region=DE&amp;placement=1839217715&amp;asins=1839217715&amp;linkId=356ba074068849ff54393f527190825d&amp;show_border=true&amp;link_opens_in_new_window=true"></iframe>
<iframe sandbox="allow-popups allow-scripts allow-modals allow-forms allow-same-origin" style="width:120px;height:240px;" marginwidth="0" marginheight="0" scrolling="no" frameborder="0" src="//ws-eu.amazon-adsystem.com/widgets/q?ServiceVersion=20070822&amp;OneJS=1&amp;Operation=GetAdHtml&amp;MarketPlace=DE&amp;source=ss&amp;ref=as_ss_li_til&amp;ad_type=product_link&amp;tracking_id=flo7up-21&amp;language=de_DE&amp;marketplace=amazon&amp;region=DE&amp;placement=1492032646&amp;asins=1492032646&amp;linkId=2214804dd039e7103577abd08722abac&amp;show_border=true&amp;link_opens_in_new_window=true"></iframe>
</div>



<p class="has-contrast-2-color has-base-3-background-color has-text-color has-background wp-block-paragraph"><em>The links above to Amazon are affiliate links. By buying through these links, you support the Relataly.com blog and help to cover the hosting costs. Using the links does not affect the price.</em></p>
<p>The post <a href="https://www.relataly.com/using-random-search-to-tune-the-hyperparameters-of-a-random-decision-forest-with-python/6875/">Using Random Search to Tune the Hyperparameters of a Random Decision Forest with Python</a> appeared first on <a href="https://www.relataly.com">relataly.com</a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://www.relataly.com/using-random-search-to-tune-the-hyperparameters-of-a-random-decision-forest-with-python/6875/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">6875</post-id>	</item>
		<item>
		<title>How to Measure the Performance of a Machine Learning Classifier with Python and Scikit-Learn?</title>
		<link>https://www.relataly.com/measuring-classification-performance-in-machine-learning-with-python-and-scikit-learn/846/</link>
					<comments>https://www.relataly.com/measuring-classification-performance-in-machine-learning-with-python-and-scikit-learn/846/#comments</comments>
		
		<dc:creator><![CDATA[Florian Follonier]]></dc:creator>
		<pubDate>Fri, 31 Dec 2021 17:37:00 +0000</pubDate>
				<category><![CDATA[Classification (multi-class)]]></category>
		<category><![CDATA[Classification (two-class)]]></category>
		<category><![CDATA[Healthcare]]></category>
		<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[Measuring Model Performance]]></category>
		<category><![CDATA[Random Decision Forests]]></category>
		<category><![CDATA[Beginner Tutorials]]></category>
		<category><![CDATA[Confusion Matrix]]></category>
		<guid isPermaLink="false">https://www.relataly.com/?p=846</guid>

					<description><![CDATA[<p>Have you ever received a spam email and wondered how your email provider was able to identify it as spam? Well, the answer is likely machine learning! One common type of machine learning problem is called classification. The goal is to predict the correct class labels for a given set of observations. For example, we ... <a title="How to Measure the Performance of a Machine Learning Classifier with Python and Scikit-Learn?" class="read-more" href="https://www.relataly.com/measuring-classification-performance-in-machine-learning-with-python-and-scikit-learn/846/" aria-label="Read more about How to Measure the Performance of a Machine Learning Classifier with Python and Scikit-Learn?">Read more</a></p>
<p>The post <a href="https://www.relataly.com/measuring-classification-performance-in-machine-learning-with-python-and-scikit-learn/846/">How to Measure the Performance of a Machine Learning Classifier with Python and Scikit-Learn?</a> appeared first on <a href="https://www.relataly.com">relataly.com</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">Have you ever received a spam email and wondered how your email provider was able to identify it as spam? Well, the answer is likely machine learning! One common type of machine learning problem is called classification. The goal is to predict the correct class labels for a given set of observations. For example, we could train a classifier to identify whether an email is spam or not or to classify images of animals into different species. But before we can use a classifier in a real-world setting, we need to evaluate its performance to understand how well it can correctly classify observations. There are several tools and techniques we can use to do this, including the confusion matrix, error metrics, and the ROC curve. In this article, we&#8217;ll dive into these evaluation methods and see how they can help us understand the capabilities of our classifier.</p>



<p class="wp-block-paragraph">This tutorial is divided into two parts: a conceptual introduction to evaluating classification performance and a hands-on example using Python and Scikit-Learn. In the first part, we will discuss some of the common error metrics that are used to evaluate the performance of a classifier. This includes the confusion matrix, error metrics, and the ROC curve. The second part of the tutorial is hands-on. We use Python and Scikit-Learn to build a breast cancer detection model classifying tissue samples as benign or malignant. We then apply various techniques to evaluate the model&#8217;s performance.</p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%">
<figure class="wp-block-image size-full"><img decoding="async" width="500" height="496" data-attachment-id="12651" data-permalink="https://www.relataly.com/measuring-classification-performance-in-machine-learning-with-python-and-scikit-learn/846/target-machine-learning-error-prediction-midjourney-relataly/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2023/03/target-machine-learning-error-prediction-midjourney-relataly.png" data-orig-size="500,496" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="target-machine-learning-error-prediction-midjourney-relataly" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2023/03/target-machine-learning-error-prediction-midjourney-relataly.png" src="https://www.relataly.com/wp-content/uploads/2023/03/target-machine-learning-error-prediction-midjourney-relataly.png" alt="" class="wp-image-12651" srcset="https://www.relataly.com/wp-content/uploads/2023/03/target-machine-learning-error-prediction-midjourney-relataly.png 500w, https://www.relataly.com/wp-content/uploads/2023/03/target-machine-learning-error-prediction-midjourney-relataly.png 300w, https://www.relataly.com/wp-content/uploads/2023/03/target-machine-learning-error-prediction-midjourney-relataly.png 140w" sizes="(max-width: 500px) 100vw, 500px" /><figcaption class="wp-element-caption">Models can be wrong, but we should know how often they are. Image created with <a href="http://www.midjourney.com" target="_blank" rel="noreferrer noopener">Midjourney</a>.</figcaption></figure>
</div>
</div>



<h2 class="wp-block-heading">Why even bother Measuring Classification Performance?</h2>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">Measuring classification performance in machine learning is important because it allows us to evaluate how well a model is able to predict the class of a given input accurately. This is important because the ultimate goal of many machine learning models is to make accurate predictions in real-world applications.</p>



<p class="wp-block-paragraph">There are several reasons why it is important to measure classification performance. First, by measuring performance, we can determine whether a model is able to make accurate predictions. If a model cannot make accurate predictions, it may not be useful for the task it was designed for. Second, by measuring performance, we can compare the performance of different models and choose the best one for a given task. This can be especially important when working with large, complex datasets where multiple models may be applicable.</p>



<p class="wp-block-paragraph">In order to measure classification performance, we need to use a performance metric appropriate for the task at hand. Next&#8217;s let&#8217;s understand what this means.</p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%"><div class="wp-block-image">
<figure class="aligncenter size-full is-resized"><img decoding="async" data-attachment-id="7738" data-permalink="https://www.relataly.com/measuring-classification-performance-in-machine-learning-with-python-and-scikit-learn/846/breast-cancer-classifier-confusion-matrix/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/04/breast-cancer-classifier-confusion-matrix.png" data-orig-size="419,385" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="breast-cancer-classifier-confusion-matrix" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/04/breast-cancer-classifier-confusion-matrix.png" src="https://www.relataly.com/wp-content/uploads/2022/04/breast-cancer-classifier-confusion-matrix.png" alt="Confusion matrix for a two-class classifier, measuring model performance, classification error metrics, Scikit-learn, python, breast cancer dataset" class="wp-image-7738" width="334" height="307" srcset="https://www.relataly.com/wp-content/uploads/2022/04/breast-cancer-classifier-confusion-matrix.png 419w, https://www.relataly.com/wp-content/uploads/2022/04/breast-cancer-classifier-confusion-matrix.png 300w" sizes="(max-width: 334px) 100vw, 334px" /><figcaption class="wp-element-caption">Example confusion matrix of a two-class classifier</figcaption></figure>
</div></div>
</div>



<p class="wp-block-paragraph"></p>



<h2 class="wp-block-heading">Techniques for Measuring Classification Performance</h2>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">This first part of the tutorial presents essential techniques for measuring the performance of classification models, including confusion matrix, error metrics, and roc curves. But why are there so many different techniques? Isn&#8217;t it enough to calculate the rate between correct and false classifications? </p>



<p class="wp-block-paragraph">The answer depends on the balance of the class labels and their importance. Let&#8217;s compare a simple two-class case vs. a more complex one. In the most simple case, the following applies:</p>



<ul class="wp-block-list">
<li>The class labels in the sample are perfectly balanced (for example, 50 positives and 50 negatives).</li>



<li>Both class labels are equally important, so it does not matter if the model is better at predicting class one or two.</li>
</ul>



<p class="wp-block-paragraph">In this case, we can measure the model performance as the rate between correctly predicted labels and those that a model falsely predicted. It is as simple as that. However, most classification problems are more complex:</p>



<ul class="wp-block-list">
<li>The class labels are imbalanced, so the model encounters one class more often than the other.</li>



<li>One class is more important than the other. For example, consider a binary classification problem that aims to identify the few positive cases from a sample with many negative ones. Especially in disease detection, it is crucial that the model correctly identifies the few positive cases, even if some of the observations classified as positive are negative.</li>
</ul>



<p class="wp-block-paragraph">Confusion matrix and error techniques help us objectively evaluate such models built for more complex problems.</p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%"></div>
</div>



<div style="height:14px" aria-hidden="true" class="wp-block-spacer"></div>



<h3 class="wp-block-heading">The Confusion Matrix</h3>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">A confusion matrix is an essential tool for evaluating a classification model. The confusion matrix is a table with four combinations of predicted and actual values for a problem where the output may include two classes (negative and positive). As a result, each prediction falls into one of the following four squares:</p>



<ul class="wp-block-list">
<li><strong>True Positives (TP)</strong>: the outcome from a prediction is&nbsp;<em>&#8220;positive,&#8221; </em>and the actual value is also&nbsp;&#8220;positive.&#8221;</li>



<li><strong>False Positives (FP):</strong> The model predicted a positive value, but this prediction is false.</li>



<li><strong>True Negatives (TN):</strong> Predicted was a negative value, which is correct.</li>



<li><strong>False Negatives (FN):</strong> The model predicted a negative value while the actual class was positive.</li>
</ul>



<p class="wp-block-paragraph">We can assign each classification to a cell in the matrix. The diagonal contains the correctly classified cases whose actual class matches the predicted class. All other cells outside the diagonal represent possible errors. Using the confusion matrix, you can see at a glance how well the model works and what errors it makes. </p>



<p class="wp-block-paragraph">The confusion matrix is the basis for calculating various error metrics, which we will look at in more detail in the following section.</p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%">
<figure class="wp-block-image size-full"><img decoding="async" width="642" height="570" data-attachment-id="7705" data-permalink="https://www.relataly.com/measuring-classification-performance-in-machine-learning-with-python-and-scikit-learn/846/image-24/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/04/image-24.png" data-orig-size="642,570" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-24" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/04/image-24.png" src="https://www.relataly.com/wp-content/uploads/2022/04/image-24.png" alt="Confusion matrix" class="wp-image-7705" srcset="https://www.relataly.com/wp-content/uploads/2022/04/image-24.png 642w, https://www.relataly.com/wp-content/uploads/2022/04/image-24.png 300w" sizes="(max-width: 642px) 100vw, 642px" /><figcaption class="wp-element-caption">Confusion matrix</figcaption></figure>
</div>
</div>



<h3 class="wp-block-heading">Metrics for Measuring Classification Errors</h3>



<p class="wp-block-paragraph">To objectively measure the performance of a classifier, we can count up the cases in the different squares and use this information to calculate essential error metrics, including accuracy, precision, recall, f-1 score, and specificity.</p>



<p class="wp-block-paragraph"></p>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow">
<h4 class="wp-block-heading has-base-2-background-color has-background" style="font-size:24px">Precision</h4>



<p class="wp-block-paragraph">Precision is a metric for the rate of missed positive values. Mathematically, it is the sum of true positives divided by the sum of False Positives and True Positives. </p>



<p class="wp-block-paragraph">In other words, it measures the ability of a classification model to identify the relevant data points without misclassifying too many irrelevant cases.&nbsp;</p>



<div class="wp-block-mathml-mathmlblock">\[Precision = {TP  \over FP + TP}\]<script id="wp-hooks-js" src="https://www.relataly.com/wp-includes/js/dist/hooks.min.js?ver=7496969728ca0f95732d"></script>
<script id="wp-i18n-js" src="https://www.relataly.com/wp-includes/js/dist/i18n.min.js?ver=781d11515ad3d91786ec"></script>
<script id="wp-i18n-js-after">
wp.i18n.setLocaleData( { 'text direction\u0004ltr': [ 'ltr' ] } );
//# sourceURL=wp-i18n-js-after
</script>
<script  async id="mathjax-js" src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.7/MathJax.js?config=TeX-MML-AM_CHTML"></script>
</div>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow">
<h4 class="wp-block-heading has-base-2-background-color has-background" style="font-size:24px">Accuracy</h4>



<p class="wp-block-paragraph">Accuracy tells us the rate of the positive values that were classified correctly. It is calculated as the sum of all correct classifications divided by the number of false positives. </p>



<p class="wp-block-paragraph">The usefulness of Accuracy ends when the class labels are imbalanced so that one class is underrepresented. The Accuracy can be misleading as it can become nearly 100% even if the classification model has not identified any of the data points in the underrepresented class. If your data is imbalanced, you should combine accuracy with the Recall.</p>



<div class="wp-block-mathml-mathmlblock">\[Accuracy= {TP + TN \over TP + FN + FP + TN}\]</div>



<div class="wp-block-mathml-mathmlblock">\[= {Correct Classifications \over Total  Sample Size}\]</div>
</div>
</div>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow">
<h4 class="wp-block-heading has-base-2-background-color has-background" style="font-size:24px">F1-Score</h4>



<p class="wp-block-paragraph">The&nbsp;F1-Score&nbsp;combines Precision and Recall into a single metric. It is calculated as the harmonic mean of Precision and Recall. </p>



<p class="wp-block-paragraph">The F1-Score is a single overall metric based on precision and recall. We can use this metric to compare the performance of two classifiers with different recall and precision. </p>



<div class="wp-block-mathml-mathmlblock">\[F1Score = {TP + TN \over FN}\] </div>



<div class="wp-block-mathml-mathmlblock">\[= {2 * Precision * Recall\over Precision + Recall}\]</div>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow">
<h4 class="wp-block-heading has-base-2-background-color has-background" style="font-size:24px;font-style:normal;font-weight:500">Recall (Sensitivity)</h4>



<p class="wp-block-paragraph">Recall, sometimes called &#8220;Sensitivity,&#8221; measures the percentage of correctly classified positives among the entire sum of actual positives. We calculate it as the number of True Positives divided by the False Negatives and True Positives.</p>



<p class="wp-block-paragraph">The Recall is particularly helpful if we deal with an imbalanced dataset, for example, when the goal is to identify a few critical cases among a large sample. </p>



<div class="wp-block-mathml-mathmlblock">\[Recall= {TP \over FN + TP}\]</div>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow">
<h4 class="wp-block-heading has-base-2-background-color has-background" style="font-size:24px">Specificity </h4>



<p class="wp-block-paragraph">We calculate the number of negative samples. It is also called the True-Negative Rate and plays a vital role in the ROC Curve, which we will look at in more detail in the following section.</p>



<div class="wp-block-mathml-mathmlblock">\[Specificity= {TP \over FN + TP}\]</div>
</div>
</div>



<p class="wp-block-paragraph">None of the five metrics is sufficient to measure the model performance. We, therefore, use different metrics in combination. Note the following rules:</p>



<ul class="wp-block-list">
<li>If the classes in the dataset are balanced, measure performance using Accuracy.</li>



<li>If the dataset is imbalanced or one class is more important than the other, look at Recall and Precision. </li>



<li>For classification problems where you want to compare different models with similar recall and precision, use the F1Score.</li>
</ul>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow">
<h3 class="wp-block-heading" id="h-decision-boundary">Decision Boundary</h3>



<p class="wp-block-paragraph">A classifier determines class labels by calculating the probabilities of samples falling into a particular category. Since the probabilities are continuous values between 0.0 and 1.0, we use a decision boundary to convert them to class labels. The default threshold for a binary classifier is 0.5. Samples with probabilities above 0.5 are assigned to the first class, and samples below 0.5 to the second class.</p>



<p class="wp-block-paragraph">In practice, we often encounter classification problems, where the cost of an error varies between class labels. In such cases, we can alter the decision boundary to give one of the classes a higher priority. Consider the case of credit card fraud detection. In this case, it is critical for service providers to reliably detect the few fraud cases among the many legitimate credit card transactions. We can alter the decision threshold to increase the probability that the model detects fraud (high True Positive rate). The cost of detecting more fraud is a higher number of transactions that the model misclassifies as fraud. However, in this particular example, this is acceptable because the service provider can quickly resolve misunderstandings with the customer.</p>


<div class="wp-block-image">
<figure class="aligncenter size-large is-resized"><img decoding="async" data-attachment-id="7860" data-permalink="https://www.relataly.com/measuring-classification-performance-in-machine-learning-with-python-and-scikit-learn/846/image-29/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/04/image-29.png" data-orig-size="2350,1319" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-29" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/04/image-29.png" src="https://www.relataly.com/wp-content/uploads/2022/04/image-29-1024x575.png" alt="Comparison of different decision boundaries (0.5 vs 0.25 vs 0.9) and illustration of the effects on the classification error and confusion matrix, python tutorial" class="wp-image-7860" width="866" height="494"/><figcaption class="wp-element-caption">Comparison of different decision boundaries (0.5 vs. 0.25 vs. 0.9) and illustration of the effects on the classification error and confusion matrix</figcaption></figure>
</div></div>
</div>



<h3 class="wp-block-heading">The ROC Curve</h3>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow">
<p class="wp-block-paragraph">The ROC curve is another helpful tool to measure classification performance and is particularly useful for comparing different classification models&#8217; performance. ROC stands for &#8220;Receiver Operating Characteristic.&#8221; The ROC curve is a graphical plot that illustrates the diagnostic ability of a binary classifier system as its discrimination threshold is varied. The curve emerges when we plot the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings. </p>



<p class="wp-block-paragraph">The more the ROC curve tends to the upper left corner, the better the performance of the classification model. A perfect classifier would show a point in the upper left corner or coordinate (0,1), which is the ideal point for a diagnostic test. This is because a point at (0,1) indicates that the classifier has a 100% true positive rate and a 0% false positive rate. A curve near the diagonal indicates that the True Positive Rate and False Positive Rate are equal, which corresponds to the expected prediction result of a random classifier with no predictive power. If the ROC curve remains significantly below the diagonal, this indicates a classifier with inverse prediction power.</p>



<p class="wp-block-paragraph">The ROC for classification models is not necessarily a curve and often runs as a jumpy line with several plateaus.  Plateaus range where changes to the threshold do not change the classification results. Curves with plateaus can signify tiny sample sizes, but they may also have other reasons.</p>
</div>
</div>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow"><div class="wp-block-image">
<figure class="aligncenter size-large is-resized"><img decoding="async" data-attachment-id="7847" data-permalink="https://www.relataly.com/measuring-classification-performance-in-machine-learning-with-python-and-scikit-learn/846/image-27-2/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/04/image-27.png" data-orig-size="915,1047" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-27" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/04/image-27.png" src="https://www.relataly.com/wp-content/uploads/2022/04/image-27-895x1024.png" alt="classification performance tutorial python machine learning roc curve based on confusion matrix" class="wp-image-7847" width="345" height="395" srcset="https://www.relataly.com/wp-content/uploads/2022/04/image-27.png 895w, https://www.relataly.com/wp-content/uploads/2022/04/image-27.png 262w, https://www.relataly.com/wp-content/uploads/2022/04/image-27.png 768w, https://www.relataly.com/wp-content/uploads/2022/04/image-27.png 915w" sizes="(max-width: 345px) 100vw, 345px" /><figcaption class="wp-element-caption">Example of an ROC curve </figcaption></figure>
</div></div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow"><div class="wp-block-image">
<figure class="aligncenter size-large is-resized"><img decoding="async" data-attachment-id="7846" data-permalink="https://www.relataly.com/measuring-classification-performance-in-machine-learning-with-python-and-scikit-learn/846/image-26-2/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/04/image-26.png" data-orig-size="1033,1022" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-26" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/04/image-26.png" src="https://www.relataly.com/wp-content/uploads/2022/04/image-26-1024x1013.png" alt="Interpretation of the ROC Curve, classification performance tutorial python machine learning roc curve based on confusion matrix" class="wp-image-7846" width="391" height="387" srcset="https://www.relataly.com/wp-content/uploads/2022/04/image-26.png 1024w, https://www.relataly.com/wp-content/uploads/2022/04/image-26.png 300w, https://www.relataly.com/wp-content/uploads/2022/04/image-26.png 768w, https://www.relataly.com/wp-content/uploads/2022/04/image-26.png 1033w" sizes="(max-width: 391px) 100vw, 391px" /><figcaption class="wp-element-caption">Interpretation of the ROC curve</figcaption></figure>
</div></div>
</div>



<h2 class="wp-block-heading">Measuring Classification Performance in Python (Two-Class)</h2>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">In this tutorial, we will show how to implement various techniques for evaluating classification models using a breast cancer dataset and a simple logistic regression model in Python with Scikit-Learn. Abnormal changes in the breast may be a sign of cancer and need to be investigated. However, changes are not necessarily malignant and, in many cases, are benign. We will work with a breast cancer dataset and train a machine learning classifier to make this distinction (benign/malignant). We will use the model to predict the type of breast cancer based on various characteristics and explore how machine learning can be applied in the life sciences to support medical diagnostics. After training the model, we will use the Confusion Matrix, Error Metrics, and the ROC Curve to measure its performance.</p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%"></div>
</div>



<div class="wp-block-kadence-advancedbtn kb-buttons-wrap kb-btns_e50ee7-21"><a class="kb-button kt-button button kb-btn_74527c-1b kt-btn-size-standard kt-btn-width-type-full kb-btn-global-inherit kt-btn-has-text-true kt-btn-has-svg-true wp-block-button__link wp-block-kadence-singlebtn" href="https://github.com/flo7up/relataly-public-python-tutorials/blob/master/02%20Classification/020%20Measuring%20Classifier%20Performance%20with%20Confusion%20Matrix%2C%20Error%20Metrics%20and%20ROC%20Curve.ipynb" target="_blank" rel="noreferrer noopener"><span class="kb-svg-icon-wrap kb-svg-icon-fe_eye kt-btn-icon-side-left"><svg viewBox="0 0 24 24"  fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"  aria-hidden="true"><path d="M1 12s4-8 11-8 11 8 11 8-4 8-11 8-11-8-11-8z"/><circle cx="12" cy="12" r="3"/></svg></span><span class="kt-btn-inner-text">View on GitHub </span></a>

<a class="kb-button kt-button button kb-btn_1ba336-60 kt-btn-size-standard kt-btn-width-type-full kb-btn-global-inherit kt-btn-has-text-true kt-btn-has-svg-true wp-block-button__link wp-block-kadence-singlebtn" href="https://github.com/flo7up/relataly-public-python-API-tutorials" target="_blank" rel="noreferrer noopener"><span class="kb-svg-icon-wrap kb-svg-icon-fa_github kt-btn-icon-side-left"><svg viewBox="0 0 496 512"  fill="currentColor" xmlns="http://www.w3.org/2000/svg"  aria-hidden="true"><path d="M165.9 397.4c0 2-2.3 3.6-5.2 3.6-3.3.3-5.6-1.3-5.6-3.6 0-2 2.3-3.6 5.2-3.6 3-.3 5.6 1.3 5.6 3.6zm-31.1-4.5c-.7 2 1.3 4.3 4.3 4.9 2.6 1 5.6 0 6.2-2s-1.3-4.3-4.3-5.2c-2.6-.7-5.5.3-6.2 2.3zm44.2-1.7c-2.9.7-4.9 2.6-4.6 4.9.3 2 2.9 3.3 5.9 2.6 2.9-.7 4.9-2.6 4.6-4.6-.3-1.9-3-3.2-5.9-2.9zM244.8 8C106.1 8 0 113.3 0 252c0 110.9 69.8 205.8 169.5 239.2 12.8 2.3 17.3-5.6 17.3-12.1 0-6.2-.3-40.4-.3-61.4 0 0-70 15-84.7-29.8 0 0-11.4-29.1-27.8-36.6 0 0-22.9-15.7 1.6-15.4 0 0 24.9 2 38.6 25.8 21.9 38.6 58.6 27.5 72.9 20.9 2.3-16 8.8-27.1 16-33.7-55.9-6.2-112.3-14.3-112.3-110.5 0-27.5 7.6-41.3 23.6-58.9-2.6-6.5-11.1-33.3 2.6-67.9 20.9-6.5 69 27 69 27 20-5.6 41.5-8.5 62.8-8.5s42.8 2.9 62.8 8.5c0 0 48.1-33.6 69-27 13.7 34.7 5.2 61.4 2.6 67.9 16 17.7 25.8 31.5 25.8 58.9 0 96.5-58.9 104.2-114.8 110.5 9.2 7.9 17 22.9 17 46.4 0 33.7-.3 75.4-.3 83.6 0 6.5 4.6 14.4 17.3 12.1C428.2 457.8 496 362.9 496 252 496 113.3 383.5 8 244.8 8zM97.2 352.9c-1.3 1-1 3.3.7 5.2 1.6 1.6 3.9 2.3 5.2 1 1.3-1 1-3.3-.7-5.2-1.6-1.6-3.9-2.3-5.2-1zm-10.8-8.1c-.7 1.3.3 2.9 2.3 3.9 1.6 1 3.6.7 4.3-.7.7-1.3-.3-2.9-2.3-3.9-2-.6-3.6-.3-4.3.7zm32.4 35.6c-1.6 1.3-1 4.3 1.3 6.2 2.3 2.3 5.2 2.6 6.5 1 1.3-1.3.7-4.3-1.3-6.2-2.2-2.3-5.2-2.6-6.5-1zm-11.4-14.7c-1.6 1-1.6 3.6 0 5.9 1.6 2.3 4.3 3.3 5.6 2.3 1.6-1.3 1.6-3.9 0-6.2-1.4-2.3-4-3.3-5.6-2z"/></svg></span><span class="kt-btn-inner-text">Relataly Github Repo </span></a></div>



<h3 class="wp-block-heading">About the Breast Cancer Dataset</h3>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow">
<p class="wp-block-paragraph">The breast cancer dataset contains 569 samples, with 30 features derived from digitized images of tissue samples. The features in the dataset describe the characteristics of the cell nuclei present in the image, including color, size, and symmetry. In addition, the dataset includes a binary target variable that indicates whether the sample is benign or malignant. 212 Samples are malignant, and 357 are benign. </p>



<p class="wp-block-paragraph">You can find more information on the dataset on the <a href="https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)" target="_blank" rel="noreferrer noopener">UCI.edu webpage</a>. The breast cancer dataset is included in the scikit-learn package, so there is no need to download the data upfront.</p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow">
<figure class="wp-block-image is-resized"><img decoding="async" src="https://www.kurzweilai.net/images/breast-cancer-images-enlarged.png" alt="benign tissue samples vs malignant tissue samples, machine learning classification, measuring model performance, python, Scikit-learn, random decision forest classifier" width="524" height="275"/><figcaption class="wp-element-caption">Exemplary images of benign and malignant samples. Source: <a href="https://www.kurzweilai.net/pigeons-diagnose-breast-cancer-on-x-rays-as-well-as-radiologists" target="_blank" rel="noreferrer noopener">kurzweilai </a></figcaption></figure>
</div>
</div>



<h3 class="wp-block-heading" id="h-prerequisites">Prerequisites</h3>



<p class="wp-block-paragraph">Before starting the coding part, make sure that you have set up your Python 3 environment and required packages. If you don’t have an environment, follow&nbsp;<a href="https://www.relataly.com/anaconda-python-environment-machine-learning/1663/" target="_blank" rel="noreferrer noopener">this tutorial</a>&nbsp;to set up the&nbsp;<a href="https://www.anaconda.com/products/individual" target="_blank" rel="noreferrer noopener">Anaconda environment</a>. Also, make sure you install all required packages. In this tutorial, we will be working with the following standard packages:&nbsp;</p>



<ul class="wp-block-list">
<li>pandas</li>



<li>NumPy</li>



<li>math</li>



<li>matplotlib</li>



<li>scikit-learn</li>
</ul>



<p class="wp-block-paragraph">You can install packages using console commands:</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">pip install &lt;package name&gt;
conda install &lt;package name&gt; (if you are using the anaconda packet manager)</pre></div>



<h3 class="wp-block-heading">Step #1 Loading the Data</h3>



<p class="wp-block-paragraph">We begin by loading the cancer dataset from scikit-learn. Then we display a list of the features and plot the balance of our classification target, the two tissue types. &#8220;1&#8221; is type &#8220;benign,&#8221; and 0 corresponds to type &#8220;malignant.&#8221;</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># A tutorial for this file is available at www.relataly.com

import numpy as np 
import pandas as pd 
import seaborn as sns
import matplotlib.pyplot as plt 
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, classification_report, roc_auc_score, plot_roc_curve
from sklearn.model_selection import cross_val_predict
from sklearn import datasets

df = datasets.load_breast_cancer(as_frame=True)

df_dia = df.data
df_dia['cancer_type'] = df.target

plt.figure(figsize=(16,2))
plt.title(f'labels')
fig = sns.countplot(y=&quot;cancer_type&quot;, data=df_dia)

df_dia.head()</pre></div>



<figure class="wp-block-image size-full"><img decoding="async" width="934" height="170" data-attachment-id="7759" data-permalink="https://www.relataly.com/measuring-classification-performance-in-machine-learning-with-python-and-scikit-learn/846/label-balance/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/04/label-balance.png" data-orig-size="934,170" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="label-balance" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/04/label-balance.png" src="https://www.relataly.com/wp-content/uploads/2022/04/label-balance.png" alt="" class="wp-image-7759" srcset="https://www.relataly.com/wp-content/uploads/2022/04/label-balance.png 934w, https://www.relataly.com/wp-content/uploads/2022/04/label-balance.png 300w, https://www.relataly.com/wp-content/uploads/2022/04/label-balance.png 768w" sizes="(max-width: 934px) 100vw, 934px" /></figure>



<p class="wp-block-paragraph">The barplot shows more benign observations among the sample than malignant ones.</p>



<h3 class="wp-block-heading" id="h-step-2-data-preparation-and-model-training">Step #2<strong> </strong>Data Preparation and Model Training</h3>



<p class="wp-block-paragraph">Next, we will prepare the data and use it for training a random decision forest classifier. It is important to remember that the performance of a classifier is dependent on the specific data it is trained on. Therefore, it is crucial to evaluate the classifier using a separate, unseen test dataset to avoid overfitting and ensure that the classifier generalizes well to new data. The code below therefore splits the data into train and test datasets.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Select a small number of features that we use as input to the classification model
features = ['carwidth', 'carlength']
df_base = df[features + ['Price_label']]

# Separate labels from training data
X = df_base[features] #Training data
y = df_base['Price_label'] #Prediction label

# Split the data into x_train and y_train data sets
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.5, random_state=0)</pre></div>



<p class="wp-block-paragraph">Now that we have prepared the data, it is time to train our classifier. We use a random forest algorithm from the Scikit-learn package. If you want to learn more about this topic, check out the <a href="https://www.relataly.com/category/machine-learning-algorithms/random-decision-forests/" target="_blank" rel="noreferrer noopener">relataly tutorials on random forests</a>. </p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Create the Random Forest Classifier
dfrst = RandomForestClassifier(n_estimators=3, max_depth=4, min_samples_split=6, class_weight='balanced')
ranfor = dfrst.fit(X_train, y_train)
y_pred = ranfor.predict(X_test)</pre></div>



<p class="wp-block-paragraph">After running the code, you have a trained classifier.</p>



<h3 class="wp-block-heading">Step #3 Creating a Confusion Matrix</h3>



<p class="wp-block-paragraph">Next, we will create the confusion matrix and several standard error metrics. First, we create the matrix by running the code below. Remember that the matrix will contain only the tabular data without any visualization. To illustrate the results in a heatmap, we first need to plot the matrix. We will use the heatmap function from the seaborn package for this task.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Create heatmap from the confusion matrix
def createConfMatrix(class_names, matrix):
    class_names=[0, 1] 
    tick_marks = [0.5, 1.5]
    fig, ax = plt.subplots(figsize=(7, 6))
    sns.heatmap(pd.DataFrame(matrix), annot=True, cmap=&quot;Blues&quot;, fmt='g')
    ax.xaxis.set_label_position(&quot;top&quot;)
    plt.title('Confusion matrix')
    plt.ylabel('Actual label'); plt.xlabel('Predicted label')
    plt.yticks(tick_marks, class_names); plt.xticks(tick_marks, class_names)
    
# Create a confusion matrix
cnf_matrix = confusion_matrix(y_test, y_pred)
createConfMatrix(matrix=cnf_matrix, class_names=[0, 1])</pre></div>



<figure class="wp-block-image size-full is-resized"><img decoding="async" data-attachment-id="7738" data-permalink="https://www.relataly.com/measuring-classification-performance-in-machine-learning-with-python-and-scikit-learn/846/breast-cancer-classifier-confusion-matrix/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/04/breast-cancer-classifier-confusion-matrix.png" data-orig-size="419,385" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="breast-cancer-classifier-confusion-matrix" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/04/breast-cancer-classifier-confusion-matrix.png" src="https://www.relataly.com/wp-content/uploads/2022/04/breast-cancer-classifier-confusion-matrix.png" alt="Confusion matrix for a two-class classifier, measuring model performance, classification error metrics, Scikit-learn, python, breast cancer dataset" class="wp-image-7738" width="455" height="418" srcset="https://www.relataly.com/wp-content/uploads/2022/04/breast-cancer-classifier-confusion-matrix.png 419w, https://www.relataly.com/wp-content/uploads/2022/04/breast-cancer-classifier-confusion-matrix.png 300w" sizes="(max-width: 455px) 100vw, 455px" /><figcaption class="wp-element-caption">The confusion matrix shows the following: In 93 samples, the model correctly predicted a malignant label, and in 181 cases the model predicted that the tissue sample was benign. In 3 cases, the model failed to recognize a malignant sample, and in 8 cases the model raised a false alarm.</figcaption></figure>



<p class="wp-block-paragraph">Next, we calculate the error metrics (accuracy, precision, recall, f1-score). You can do this by using the separate functions from the Scikit-learn package. Alternatively, you can also use the classification report, which contains all these error metrics.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Calculate Standard Error Metrics
print('accuracy: {:.2f}'.format(accuracy_score(y_test, y_pred)))
print('precision: {:.2f}'.format(precision_score(y_test, y_pred)))
print('recall: {:.2f}'.format(recall_score(y_test, y_pred)))
print('f1_score: {:.2f}'.format(f1_score(y_test, y_pred)))

# Classification Report (Alternative)
results_log = classification_report(y_test, y_pred, output_dict=True)
results_df_log = pd.DataFrame(results_log).transpose()
print(results_df_log)</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">accuracy: 0.94 
precision: 0.97 
recall: 0.94
f1_score: 0.95</pre></div>



<h3 class="wp-block-heading">Step #4 ROC and AUC</h3>



<p class="wp-block-paragraph">Finally, let&#8217;s calculate the ROC and the Area under the Curve (AUC). </p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Compute ROC curve
fig, ax = plt.subplots(figsize=(10, 6))
RocCurveDisplay.from_estimator(ranfor, X_test, y_test, ax=ax)
plt.title('ROC Curve for the Car Price Classifier')
plt.show()</pre></div>



<figure class="wp-block-image size-full"><img decoding="async" width="609" height="387" data-attachment-id="7751" data-permalink="https://www.relataly.com/measuring-classification-performance-in-machine-learning-with-python-and-scikit-learn/846/output-7/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/04/output-7.png" data-orig-size="609,387" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="output-7" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/04/output-7.png" src="https://www.relataly.com/wp-content/uploads/2022/04/output-7.png" alt="ROC Curve for the Breast Cancer Classifier" class="wp-image-7751" srcset="https://www.relataly.com/wp-content/uploads/2022/04/output-7.png 609w, https://www.relataly.com/wp-content/uploads/2022/04/output-7.png 300w" sizes="(max-width: 609px) 100vw, 609px" /></figure>



<p class="wp-block-paragraph">The ROC tells us, that the model already performs quite well. However, we want to know it precisely. By running the code below, you can calculate the AUC.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Calculate probability scores 
y_scores = cross_val_predict(ranfor, X_test, y_test, cv=3, method='predict_proba')
# Because of the structure of how the model returns the y_scores, we need to convert them into binary values
y_scores_binary = [1 if x[0] &lt; 0.5 else 0 for x in y_scores]
# Now, we can calculate the area under the ROC curve
auc = roc_auc_score(y_test, y_scores_binary, average=&quot;macro&quot;)
auc # Be aware that due to the random nature of cross validation, the results will change when you run the code</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">0.9035191562634525</pre></div>



<h2 class="wp-block-heading">Summary</h2>



<p class="wp-block-paragraph">This tutorial has shown how to evaluate the performance of a two-label classification model. We started by introducing the concept of the confusion matrix and how it can be used to evaluate the performance of a classifier. We then discussed various error metrics, such as accuracy, precision, and recall, and how we can use them to gain a better understanding of the classifier&#8217;s performance. Next, we discussed the ROC curve and how it can be used to visualize the trade-offs between precision and recall for different thresholds of the classifier. We also discussed how we could use the ROC curve to compare the performance of different classifiers. In the second part, we have applied the different tools and techniques to the practical example of a breast cancer classifier. We used the confusion matrix and error metrics to evaluate the classifier and the ROC curve to compare its performance. </p>



<p class="wp-block-paragraph">Overall, this tutorial has provided an overview of the tools and techniques that are commonly used to evaluate the performance of a classification model. By understanding and applying these tools and techniques, we can gain a better understanding of how well a classifier is performing and make informed decisions about whether it is ready for production.</p>



<p class="wp-block-paragraph">I hope this article helped you understand how to measure the performance of classification models. If you have any questions or feedback, please let me know. And if you are looking for error metrics to measure regression performance, check out <a href="https://www.relataly.com/category/data-science/measuring-model-performance/" target="_blank" rel="noreferrer noopener">this tutorial on regression errors</a>.</p>



<h2 class="wp-block-heading" id="h-sources-and-further-reading">Sources and Further Reading</h2>



<ol class="wp-block-list">
<li><a href="https://amzn.to/3EKidwE" target="_blank" rel="noreferrer noopener">David Forsyth (2019) Applied Machine Learning Springer</a></li>



<li><a href="https://amzn.to/3MAy8j5" target="_blank" rel="noreferrer noopener">Andriy Burkov (2020) Machine Learning Engineering</a></li>



<li><a href="https://www.kurzweilai.net/pigeons-diagnose-breast-cancer-on-x-rays-as-well-as-radiologists" target="_blank" rel="noreferrer noopener">https://www.kurzweilai.net/pigeons-diagnose-breast-cancer-on-x-rays-as-well-as-radiologists</a></li>
</ol>



<p class="has-contrast-2-color has-base-3-background-color has-text-color has-background wp-block-paragraph"><em>The links above to Amazon are affiliate links. By buying through these links, you support the Relataly.com blog and help to cover the hosting costs. Using the links does not affect the price.</em></p>
<p>The post <a href="https://www.relataly.com/measuring-classification-performance-in-machine-learning-with-python-and-scikit-learn/846/">How to Measure the Performance of a Machine Learning Classifier with Python and Scikit-Learn?</a> appeared first on <a href="https://www.relataly.com">relataly.com</a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://www.relataly.com/measuring-classification-performance-in-machine-learning-with-python-and-scikit-learn/846/feed/</wfw:commentRss>
			<slash:comments>1</slash:comments>
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">846</post-id>	</item>
		<item>
		<title>Predictive Policing: Preventing Crime in San Francisco using XGBoost and Python</title>
		<link>https://www.relataly.com/predicting-crimes-in-san-francisco-creatingsf-crime-map-using-xgboost/2960/</link>
					<comments>https://www.relataly.com/predicting-crimes-in-san-francisco-creatingsf-crime-map-using-xgboost/2960/#respond</comments>
		
		<dc:creator><![CDATA[Florian Follonier]]></dc:creator>
		<pubDate>Sun, 07 Mar 2021 16:16:19 +0000</pubDate>
				<category><![CDATA[Algorithms]]></category>
		<category><![CDATA[Classification (multi-class)]]></category>
		<category><![CDATA[Decision Trees]]></category>
		<category><![CDATA[Fighting Crime]]></category>
		<category><![CDATA[Gradient Boosting]]></category>
		<category><![CDATA[Insurance]]></category>
		<category><![CDATA[Kaggle Competitions]]></category>
		<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[mplleaflet]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[Random Decision Forests]]></category>
		<category><![CDATA[Scikit-Learn]]></category>
		<category><![CDATA[Seaborn]]></category>
		<category><![CDATA[Classic Machine Learning]]></category>
		<category><![CDATA[Crime Data]]></category>
		<category><![CDATA[Geographic Maps]]></category>
		<category><![CDATA[Intermediate Tutorials]]></category>
		<category><![CDATA[Kaggle]]></category>
		<category><![CDATA[Multivariate Models]]></category>
		<category><![CDATA[Smart City]]></category>
		<category><![CDATA[Spatial Data]]></category>
		<category><![CDATA[Supervised Learning]]></category>
		<guid isPermaLink="false">https://www.relataly.com/?p=2960</guid>

					<description><![CDATA[<p>In this tutorial, we&#8217;ll be using machine learning to predict and map out crime in San Francisco. We&#8217;ll be working with a dataset from Kaggle that contains information on 39 different types of crimes, including everything from vehicle theft to drug offenses. Using Python and the powerful Scikit-Learn library, we&#8217;ll train a classification model using ... <a title="Predictive Policing: Preventing Crime in San Francisco using XGBoost and Python" class="read-more" href="https://www.relataly.com/predicting-crimes-in-san-francisco-creatingsf-crime-map-using-xgboost/2960/" aria-label="Read more about Predictive Policing: Preventing Crime in San Francisco using XGBoost and Python">Read more</a></p>
<p>The post <a href="https://www.relataly.com/predicting-crimes-in-san-francisco-creatingsf-crime-map-using-xgboost/2960/">Predictive Policing: Preventing Crime in San Francisco using XGBoost and Python</a> appeared first on <a href="https://www.relataly.com">relataly.com</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">In this tutorial, we&#8217;ll be using machine learning to predict and map out crime in San Francisco. We&#8217;ll be working with a dataset from Kaggle that contains information on 39 different types of crimes, including everything from vehicle theft to drug offenses. Using Python and the powerful Scikit-Learn library, we&#8217;ll train a classification model using the <a href="https://www.relataly.com/category/machine-learning-algorithms/gradient-boosting/" target="_blank" rel="noreferrer noopener">XGboost algorithm</a> to predict 39 types of crimes based on when and where it occurred. We&#8217;ll then use the Plotly library to visualize the results on a map of the city, highlighting areas with higher rates of certain crimes. This type of prediction and mapping is similar to what the San Francisco Police Department uses in their practice of predictive policing, where they allocate resources to at-risk areas in an effort to prevent crime.</p>



<p class="wp-block-paragraph">As we embark on this thrilling journey, we&#8217;ll start by downloading and preprocessing the San Francisco crime data. Next, we&#8217;ll channel the data to train two distinct classification models. The first model will utilize a standard Random Forest Classifier, while the second will leverage the exceptional XGBoost package. We&#8217;ll experiment with various models that boast different hyperparameters. Ultimately, we&#8217;ll visualize our predictions on a striking SF crime map and assess the performance of our diverse models. So, buckle up and let&#8217;s dive into the exhilarating world of crime prediction and mapping!</p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%">
<figure class="wp-block-image size-full"><img decoding="async" width="508" height="513" data-attachment-id="12478" data-permalink="https://www.relataly.com/predicting-crimes-in-san-francisco-creatingsf-crime-map-using-xgboost/2960/crime-prediction-san-francisco-city-map-xgboost-python-tutorial-min/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2023/02/crime-prediction-san-francisco-city-map-xgboost-python-tutorial-min.png" data-orig-size="508,513" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="crime-prediction-san-francisco-city-map-xgboost-python-tutorial-min" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2023/02/crime-prediction-san-francisco-city-map-xgboost-python-tutorial-min.png" src="https://www.relataly.com/wp-content/uploads/2023/02/crime-prediction-san-francisco-city-map-xgboost-python-tutorial-min.png" alt="crime prediction san francisco city map xgboost python tutorial. Image generated using Midjourney. relataly.com" class="wp-image-12478" srcset="https://www.relataly.com/wp-content/uploads/2023/02/crime-prediction-san-francisco-city-map-xgboost-python-tutorial-min.png 508w, https://www.relataly.com/wp-content/uploads/2023/02/crime-prediction-san-francisco-city-map-xgboost-python-tutorial-min.png 297w, https://www.relataly.com/wp-content/uploads/2023/02/crime-prediction-san-francisco-city-map-xgboost-python-tutorial-min.png 140w" sizes="(max-width: 508px) 100vw, 508px" /><figcaption class="wp-element-caption">Predictive policing can make police work much more efficient and effective. Image generated using <a href="http://www.midjourney.com" target="_blank" rel="noreferrer noopener">Midjourney</a>.</figcaption></figure>
</div>
</div>



<h2 class="wp-block-heading">What is Predictive Policing?</h2>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">The use case we are looking at in this article falls into predictive policing. Predictive policing uses data, algorithms, and other technological tools to predict where and when crimes are likely to occur. The goal of predictive policing is to help law enforcement agencies better allocate their resources and focus their efforts on areas where crime is likely to happen, with the ultimate goal of reducing crime and improving public safety. This approach to policing is based on the idea that by using data and other tools to identify patterns and trends, law enforcement agencies can better anticipate where crimes are likely to occur and take steps to prevent them from happening.</p>



<p class="wp-block-paragraph">The benefits of predictive policing include the ability to allocate law enforcement resources better, the potential to reduce crime and improve public safety, and the ability to identify trends and patterns that may not be immediately obvious to law enforcement officers. Additionally, by using data and other tools to anticipate where crimes are likely to occur, law enforcement agencies can take proactive steps to prevent those crimes from happening, which can save time and money.</p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%"></div>
</div>



<h2 class="wp-block-heading" id="h-creating-a-crime-map-for-predictive-policing-using-xgboost-in-python">Creating a Crime Map for Predictive Policing using XGBoost in Python</h2>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">In this practical tutorial, we&#8217;ll construct an XGBoost multi-label classifier to predict crime types in San Francisco. Urban crime, such as in San Francisco, is a dynamic and multifaceted issue that can dramatically vary based on location, time, and other factors. Our aim is to develop a predictive algorithm capable of forecasting specific crime types based on a given location and time parameters. The end product is an interactive San Francisco crime map providing a snapshot of crime hotspots throughout the city.</p>



<p class="wp-block-paragraph">Law enforcement agencies, like the San Francisco Police Department, use similar maps for strategic resource allocation to curb crime rates effectively. Additionally, this SF crime map will underscore crime clusters &#8211; areas notorious for particular types of crime incidents. By the end of this tutorial, you&#8217;ll have a deeper understanding of using machine learning in practical scenarios and aiding real-world decision-making.</p>



<p class="wp-block-paragraph">The code is available on the GitHub repository.</p>



<div class="wp-block-kadence-advancedbtn kb-buttons-wrap kb-btns_262c97-e5"><a class="kb-button kt-button button kb-btn_e6ce86-27 kt-btn-size-standard kt-btn-width-type-full kb-btn-global-inherit kt-btn-has-text-true kt-btn-has-svg-true wp-block-button__link wp-block-kadence-singlebtn" href="https://github.com/flo7up/relataly-public-python-tutorials/blob/master/02%20Classification/018%20Forecasting%20Criminal%20Activity%20in%20San%20Francisco.ipynb" target="_blank" rel="noreferrer noopener"><span class="kb-svg-icon-wrap kb-svg-icon-fe_eye kt-btn-icon-side-left"><svg viewBox="0 0 24 24"  fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"  aria-hidden="true"><path d="M1 12s4-8 11-8 11 8 11 8-4 8-11 8-11-8-11-8z"/><circle cx="12" cy="12" r="3"/></svg></span><span class="kt-btn-inner-text">View on GitHub </span></a>

<a class="kb-button kt-button button kb-btn_3b4a4f-fe kt-btn-size-standard kt-btn-width-type-full kb-btn-global-inherit kt-btn-has-text-true kt-btn-has-svg-true wp-block-button__link wp-block-kadence-singlebtn" href="https://github.com/flo7up/relataly-public-python-API-tutorials" target="_blank" rel="noreferrer noopener"><span class="kb-svg-icon-wrap kb-svg-icon-fa_github kt-btn-icon-side-left"><svg viewBox="0 0 496 512"  fill="currentColor" xmlns="http://www.w3.org/2000/svg"  aria-hidden="true"><path d="M165.9 397.4c0 2-2.3 3.6-5.2 3.6-3.3.3-5.6-1.3-5.6-3.6 0-2 2.3-3.6 5.2-3.6 3-.3 5.6 1.3 5.6 3.6zm-31.1-4.5c-.7 2 1.3 4.3 4.3 4.9 2.6 1 5.6 0 6.2-2s-1.3-4.3-4.3-5.2c-2.6-.7-5.5.3-6.2 2.3zm44.2-1.7c-2.9.7-4.9 2.6-4.6 4.9.3 2 2.9 3.3 5.9 2.6 2.9-.7 4.9-2.6 4.6-4.6-.3-1.9-3-3.2-5.9-2.9zM244.8 8C106.1 8 0 113.3 0 252c0 110.9 69.8 205.8 169.5 239.2 12.8 2.3 17.3-5.6 17.3-12.1 0-6.2-.3-40.4-.3-61.4 0 0-70 15-84.7-29.8 0 0-11.4-29.1-27.8-36.6 0 0-22.9-15.7 1.6-15.4 0 0 24.9 2 38.6 25.8 21.9 38.6 58.6 27.5 72.9 20.9 2.3-16 8.8-27.1 16-33.7-55.9-6.2-112.3-14.3-112.3-110.5 0-27.5 7.6-41.3 23.6-58.9-2.6-6.5-11.1-33.3 2.6-67.9 20.9-6.5 69 27 69 27 20-5.6 41.5-8.5 62.8-8.5s42.8 2.9 62.8 8.5c0 0 48.1-33.6 69-27 13.7 34.7 5.2 61.4 2.6 67.9 16 17.7 25.8 31.5 25.8 58.9 0 96.5-58.9 104.2-114.8 110.5 9.2 7.9 17 22.9 17 46.4 0 33.7-.3 75.4-.3 83.6 0 6.5 4.6 14.4 17.3 12.1C428.2 457.8 496 362.9 496 252 496 113.3 383.5 8 244.8 8zM97.2 352.9c-1.3 1-1 3.3.7 5.2 1.6 1.6 3.9 2.3 5.2 1 1.3-1 1-3.3-.7-5.2-1.6-1.6-3.9-2.3-5.2-1zm-10.8-8.1c-.7 1.3.3 2.9 2.3 3.9 1.6 1 3.6.7 4.3-.7.7-1.3-.3-2.9-2.3-3.9-2-.6-3.6-.3-4.3.7zm32.4 35.6c-1.6 1.3-1 4.3 1.3 6.2 2.3 2.3 5.2 2.6 6.5 1 1.3-1.3.7-4.3-1.3-6.2-2.2-2.3-5.2-2.6-6.5-1zm-11.4-14.7c-1.6 1-1.6 3.6 0 5.9 1.6 2.3 4.3 3.3 5.6 2.3 1.6-1.3 1.6-3.9 0-6.2-1.4-2.3-4-3.3-5.6-2z"/></svg></span><span class="kt-btn-inner-text">Relataly Github Repo </span></a></div>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%">
<figure class="wp-block-image size-full"><img decoding="async" width="500" height="487" data-attachment-id="12476" data-permalink="https://www.relataly.com/predicting-crimes-in-san-francisco-creatingsf-crime-map-using-xgboost/2960/san-francisco-crime-prediction-machine-learning-python-tutorial-crime-map/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2023/02/san-francisco-crime-prediction-machine-learning-python-tutorial-crime-map.png" data-orig-size="500,487" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="san-francisco-crime-prediction-machine-learning-python-tutorial-crime-map" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2023/02/san-francisco-crime-prediction-machine-learning-python-tutorial-crime-map.png" src="https://www.relataly.com/wp-content/uploads/2023/02/san-francisco-crime-prediction-machine-learning-python-tutorial-crime-map.png" alt="" class="wp-image-12476" srcset="https://www.relataly.com/wp-content/uploads/2023/02/san-francisco-crime-prediction-machine-learning-python-tutorial-crime-map.png 500w, https://www.relataly.com/wp-content/uploads/2023/02/san-francisco-crime-prediction-machine-learning-python-tutorial-crime-map.png 300w" sizes="(max-width: 500px) 100vw, 500px" /><figcaption class="wp-element-caption">Crime doesn&#8217;t sleep in San Francisco. That&#8217;s why predictive policing can make a real impact. Image generated with <a href="http://www.midjourney.com" target="_blank" rel="noreferrer noopener">Midjourney</a></figcaption></figure>
</div>
</div>



<h3 class="wp-block-heading" id="h-prerequisites">Prerequisites</h3>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">Before starting the Python coding part, ensure that you have set up your <a href="https://www.python.org/downloads/" target="_blank" rel="noreferrer noopener">Python 3</a> environment and required packages. If you don&#8217;t have an environment, follow&nbsp;<a href="https://www.relataly.com/anaconda-python-environment-machine-learning/1663/" target="_blank" rel="noreferrer noopener">this tutorial</a>&nbsp;to set up the&nbsp;<a href="https://www.anaconda.com/products/individual" target="_blank" rel="noreferrer noopener">Anaconda environment</a>.</p>



<p class="wp-block-paragraph">Also, make sure you install all required packages. In this tutorial, we will be working with the following standard packages:&nbsp;</p>



<ul class="wp-block-list">
<li><em><a href="https://pandas.pydata.org/" target="_blank" rel="noreferrer noopener">pandas</a></em></li>



<li><em><a href="https://numpy.org/" target="_blank" rel="noreferrer noopener">NumPy</a></em></li>



<li><em><a href="https://matplotlib.org/" target="_blank" rel="noreferrer noopener">matplotlib</a></em></li>



<li>Seaborn</li>
</ul>



<p class="wp-block-paragraph">In addition, we will be using XGBoost (&#8216;xgboost&#8217;) and the machine learning library scikit-learn. </p>



<p class="wp-block-paragraph">You can install packages using console commands:</p>



<ul class="wp-block-list">
<li><em>pip install &lt;package name&gt;</em></li>



<li><em>conda install &lt;package name&gt;</em>&nbsp;(if you are using the anaconda packet manager)</li>
</ul>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%"></div>
</div>



<h3 class="wp-block-heading" id="h-step-1-load-the-data">Step #1 Load the Data</h3>



<p class="wp-block-paragraph">We begin by downloading the San Francisco crime challenge data on kaggle.com. Once you have downloaded the dataset, place the CSV files (train.csv) into your Python working folder.</p>



<p class="wp-block-paragraph">The dataset was collected by the SFO police department between 2003 and 2015. According to the data description from the SF crime challenge, the dataset contains the following variables:</p>



<ul class="wp-block-list">
<li><strong>Dates</strong>: timestamp of the crime incident</li>



<li><strong>Category</strong>: Category of the crime incident (only in train.csv) that we will use as the target variable</li>



<li><strong>Descript</strong>: detailed description of the crime incident&nbsp;(only in train.csv)</li>



<li><strong>DayOfWeek:</strong> the day of the week</li>



<li><strong>PdDistrict</strong>: the name of the Police Department District</li>



<li><strong>Resolution</strong>: how the crime incident was resolved&nbsp;(only in train.csv)</li>



<li><strong>Address</strong>: the approximate street address of the crime incident&nbsp;</li>



<li><strong>X</strong>: Longitude</li>



<li><strong>Y</strong>: Latitude</li>
</ul>



<p class="wp-block-paragraph">The next step is to load the data into a dataframe. Then we use the head() command to print the first five lines and ensure you can see the data.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}">import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import matplotlib.pyplot as plt 
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix
from xgboost import XGBClassifier
import plotly.express as px

# The Data is part of the Kaggle Competition: https://www.kaggle.com/c/sf-crime/data
df_base = pd.read_csv(&quot;data/crime/sf-crime/train.csv&quot;)

print(df_base.describe())
df_base.head()</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">		X              Y
count  	878049.000000  878049.000000
mean     -122.422616      37.771020
std         0.030354       0.456893
min      -122.513642      37.707879
25%      -122.432952      37.752427
50%      -122.416420      37.775421
75%      -122.406959      37.784369
max      -120.500000      90.000000</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">	Dates				Category	Descript			DayOfWeek	PdDistrict	Resolution	Address				X			Y
0	2015-05-13 23:53:00	WARRANTS	WARRANT ARREST		Wednesday	NORTHERN	ARREST, 	OAK ST / ...		-122.425892	37.774599
1	2015-05-13 23:53:00	OTHER ...	TRAFFIC ...			Wednesday	NORTHERN	ARREST, 	OAK ST / ...		-122.425892	37.774599
2	2015-05-13 23:33:00	OTHER ...	TRAFFIC ...			Wednesday	NORTHERN	ARREST, 	VANNESS AV... ST	-122.424363	37.800414
3	2015-05-13 23:30:00	LARCENY/THEFT	GRAND THEFT...	Wednesday	NORTHERN	NONE		1500 Block... ST	-122.426995	37.800873
4	2015-05-13 23:30:00	LARCENY/THEFT	GRAND THEFT ...	Wednesday	PARK		NONE		100 Block... ST		-122.438738	37.771541</pre></div>



<p class="wp-block-paragraph">If the data was loaded correctly, you should see the first five records of the dataframe, as shown above.</p>



<h3 class="wp-block-heading" id="h-step-2-explore-the-data">Step #2 Explore the Data</h3>



<p class="wp-block-paragraph">At the beginning of a new project, we usually don&#8217;t understand the data well and need to acquire that understanding. Therefore, next, we will explore the data and familiarize ourselves with its characteristics. </p>



<p class="wp-block-paragraph">The following examples will help us better understand our data&#8217;s characteristics. For example, you can use whisker charts and a correlation matrix to understand better the correlation between variables, such as between weekdays and prediction categories. Feel free to create more charts.</p>



<h4 class="wp-block-heading" id="h-2-1-prediction-labels">2.1 Prediction Labels</h4>



<p class="wp-block-paragraph">Running the code below shows a bar plot of the prediction labels. The plot shows the frequency in which the class labels occur in the data.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># print the value counts of the categories
plt.figure(figsize=(15,5))
ax = sns.countplot(x = df_base['Category'], orient='v', order = df_base['Category'].value_counts().index)
ax.set_xticklabels(ax.get_xticklabels(),rotation = 90)</pre></div>



<figure class="wp-block-image size-large is-resized"><img decoding="async" data-attachment-id="3167" data-permalink="https://www.relataly.com/predicting-crimes-in-san-francisco-creatingsf-crime-map-using-xgboost/2960/image-13-8/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2021/04/image-13.png" data-orig-size="910,467" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-13" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2021/04/image-13.png" src="https://www.relataly.com/wp-content/uploads/2021/04/image-13.png" alt="crime types in San Francisco" class="wp-image-3167" width="914" height="470" srcset="https://www.relataly.com/wp-content/uploads/2021/04/image-13.png 910w, https://www.relataly.com/wp-content/uploads/2021/04/image-13.png 300w, https://www.relataly.com/wp-content/uploads/2021/04/image-13.png 768w" sizes="(max-width: 914px) 100vw, 914px" /></figure>



<p class="wp-block-paragraph">As shown above, our class labels are highly imbalanced, affecting model accuracy. When we evaluate the performance of our model, we need to consider this.</p>



<h4 class="wp-block-heading" id="h-2-2-when-a-crime-occured-considering-dates-and-time">2.2 When a Crime Occured &#8211; Considering Dates and Time</h4>



<p class="wp-block-paragraph">We assume that when a crime occurs impacts the type of crime. For this reason, we look at how crimes distribute across different days of the week and times of the day. First, we look at crime numbers per weekday. </p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Print Crime Counts per Weekday
plt.figure(figsize=(6,3))
ax = sns.countplot(y = df_base['DayOfWeek'], orient='h', order = df_base['DayOfWeek'].value_counts().index)
ax.set_xticklabels(ax.get_xticklabels(),rotation = 90)</pre></div>



<figure class="wp-block-image size-large is-resized"><img decoding="async" data-attachment-id="3164" data-permalink="https://www.relataly.com/predicting-crimes-in-san-francisco-creatingsf-crime-map-using-xgboost/2960/image-12-8/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2021/04/image-12.png" data-orig-size="437,238" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-12" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2021/04/image-12.png" src="https://www.relataly.com/wp-content/uploads/2021/04/image-12.png" alt="" class="wp-image-3164" width="669" height="364" srcset="https://www.relataly.com/wp-content/uploads/2021/04/image-12.png 437w, https://www.relataly.com/wp-content/uploads/2021/04/image-12.png 300w" sizes="(max-width: 669px) 100vw, 669px" /></figure>



<p class="wp-block-paragraph">Fewer crimes happen on Sundays, and most are on Fridays. So it seems that even criminals like to have a weekend. For the sake of clarity, we thereby limit the categories. Let&#8217;s take a look at the time when certain crimes are reported.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Convert the time to minutes
df_base['Hour_Min'] = pd.to_datetime(df_base['Dates']).dt.hour  + pd.to_datetime(df_base['Dates']).dt.minute / 60

# Print Crime Counts per Time and Category
df_base_filtered = df_base[df_base['Category'].isin([
    'PROSTITUTION', 
    'VEHICLE THEFT', 
    'DRUG/NARCOTIC', 
    'WARRENTS', 
    'BURGLERY', 
    'FRAUD', 
    'ASSAULT',
    'LARCENY/THEFT',
    'VANDALISM'])]

plt.figure(figsize=(16,10))
ax = sns.displot(x = 'Hour_Min', hue=&quot;Category&quot;, data = df_base_filtered, kind=&quot;kde&quot;, height=8, aspect=1.5)</pre></div>



<figure class="wp-block-image size-large is-resized"><img decoding="async" data-attachment-id="3174" data-permalink="https://www.relataly.com/predicting-crimes-in-san-francisco-creatingsf-crime-map-using-xgboost/2960/image-17-7/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2021/04/image-17.png" data-orig-size="983,568" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-17" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2021/04/image-17.png" src="https://www.relataly.com/wp-content/uploads/2021/04/image-17.png" alt="different crime types in San Francisco and how often they occur during the day" class="wp-image-3174" width="906" height="522"/></figure>



<p class="wp-block-paragraph">In addition, the time when a crime happens affects the likelihood of certain types. For example, we can see that FRAUD rarely occurs at night and usually during the day. We can see that criminals often go to work in the afternoon and at midnight. On the other hand, certain crimes, such as VEHICLE THEFT, mainly occur at night and late afternoon but less often in the morning.</p>



<p class="wp-block-paragraph">If you want to gain an overview of additional features, you can use the pair plot function. Because our dataset is large, we reduce the computation time by plotting 1/100 of the data.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}">sns.pairplot(data = df_base_filtered[0::100], height=4, aspect=1.5, hue='Category')</pre></div>



<figure class="wp-block-image size-large is-resized"><img decoding="async" data-attachment-id="8386" data-permalink="https://www.relataly.com/predicting-crimes-in-san-francisco-creatingsf-crime-map-using-xgboost/2960/pairplot-by-category-san-francisco-crime-map/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/05/pairplot-by-category-san-francisco-crime-map.png" data-orig-size="1464,844" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="pairplot-by-category-san-francisco-crime-map" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/05/pairplot-by-category-san-francisco-crime-map.png" src="https://www.relataly.com/wp-content/uploads/2022/05/pairplot-by-category-san-francisco-crime-map-1024x590.png" alt="pairplot by category, san francisco crime map" class="wp-image-8386" width="1080" height="622" srcset="https://www.relataly.com/wp-content/uploads/2022/05/pairplot-by-category-san-francisco-crime-map.png 1024w, https://www.relataly.com/wp-content/uploads/2022/05/pairplot-by-category-san-francisco-crime-map.png 300w, https://www.relataly.com/wp-content/uploads/2022/05/pairplot-by-category-san-francisco-crime-map.png 768w, https://www.relataly.com/wp-content/uploads/2022/05/pairplot-by-category-san-francisco-crime-map.png 1464w" sizes="(max-width: 1080px) 100vw, 1080px" /></figure>



<h4 class="wp-block-heading" id="h-2-3-where-a-crime-occured-considering-address">2.3 Where a Crime Occured &#8211; Considering Address</h4>



<p class="wp-block-paragraph">Next, we look at the address information, from which we can often extract additional information. We do this by printing some sample address values.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Extracting information from the streetnames
for i in df_base['Address'][0:10]:
    print(i)</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">OAK ST / LAGUNA ST
OAK ST / LAGUNA ST
VANNESS AV / GREENWICH ST
1500 Block of LOMBARD ST
100 Block of BRODERICK ST
0 Block of TEDDY AV
AVALON AV / PERU AV
KIRKWOOD AV / DONAHUE ST
600 Block of 47TH AV
JEFFERSON ST / LEAVENWORTH ST</pre></div>



<p class="wp-block-paragraph">The street names alone are not so helpful. However, the address data does provide additional information. For example, it tells us whether the location is a street intersection or not. In addition, it contains the type of street. This information is valuable because now we can extract parts of the text and use them as separate features.</p>



<p class="wp-block-paragraph">We could do a lot more, but we&#8217;ve got a good enough idea of the data.</p>



<h3 class="wp-block-heading" id="h-step-3-data-preprocessing">Step #3 Data Preprocessing</h3>



<p class="wp-block-paragraph">Probably the most exciting and important aspect of model development is feature engineering. Compared to model parameterization, the right features can often achieve more significant leaps in performance. </p>



<h4 class="wp-block-heading" id="h-3-1-remarks-on-data-preprocessing-for-xgboost">3.1 Remarks on Data Preprocessing for XGBoost </h4>



<p class="wp-block-paragraph">When preprocessing the data, it is helpful to know which algorithms to use because some algorithms are picky about the shape of the data. We will prepare the data to train a gradient-boosting model (XGBoost). This algorithm uses a random forest ensemble, which can only handle integer and Boolean values, but no categorical data. Therefore we need to encode our values. We also need to map the categorical labels to integer values.</p>



<p class="wp-block-paragraph">We don&#8217;t need to scale the continuous feature variables because gradient boosting and decision trees, generally, are not sensitive to variables that have different scales.</p>



<h4 class="wp-block-heading" id="h-3-2-feature-engineering">3.2 Feature Engineering</h4>



<p class="wp-block-paragraph">Based on the data exploration that we have done in the previous section, we create three feature types:</p>



<ul class="wp-block-list">
<li><strong>Date &amp; Time:</strong> When a crime happens is essential. For example, when there is a lot of traffic on the street, there is a higher likelihood of traffic-related crimes. For example, when it is Saturday, more people will usually come to the nightlife district, which attracts certain crimes, e.g., drug-related. Therefore, we will create different features for the time, the day, the month, and the year. </li>



<li><strong>Address</strong>: As mentioned, we will extract additional features from the address column. First, we create different features for the street type (for example, ST, AV, WY, TR, DR). In addition, we check whether the address contains the word &#8220;Block.&#8221; In addition, we will let our model know whether the address is a street crossing.</li>



<li><strong>Latitude &amp; <strong>Longitude</strong></strong>: We will transform the latitude and longitude values into polar coordinates. We will also remove some outliers from the dataset whose latitude is far off the grid. Above all, this will make it easier for our model to make sense of the location.</li>
</ul>



<p class="wp-block-paragraph">Considering these features, the primary input to our crime-type prediction model is the information on when and where a crime occurs. </p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Processing Function for Features
def cart2polar(x, y):
    dist = np.sqrt(x**2 + y**2)
    phi = np.arctan2(y, x)
    return dist, phi

def preprocessFeatures(dfx):
    
    # Time Feature Engineering
    df = pd.get_dummies(dfx[['DayOfWeek' , 'PdDistrict']])
    df['Hour_Min'] = pd.to_datetime(dfx['Dates']).dt.hour + pd.to_datetime(dfx['Dates']).dt.minute / 60
    # We add a feature that contains the expontential time
    df['Hour_Min_Exp'] = np.exp(df['Hour_Min'])
    
    df['Day'] = pd.to_datetime(dfx['Dates']).dt.day
    df['Month'] = pd.to_datetime(dfx['Dates']).dt.month
    df['Year'] = pd.to_datetime(dfx['Dates']).dt.year

    month_one_hot_encoded = pd.get_dummies(pd.to_datetime(dfx['Dates']).dt.month, prefix='Month')
    df = pd.concat([df, month_one_hot_encoded], axis=1, join=&quot;inner&quot;)
    
    # Convert Carthesian Coordinates to Polar Coordinates
    df[['X', 'Y']] = dfx[['X', 'Y']] # we maintain the original coordindates as additional features
    df['dist'], df['phi'] = cart2polar(dfx['X'], dfx['Y'])
  
    # Extracting Street Types
    df['Is_ST'] = dfx['Address'].str.contains(&quot; ST&quot;, case=True)
    df['Is_AV'] = dfx['Address'].str.contains(&quot; AV&quot;, case=True)
    df['Is_WY'] = dfx['Address'].str.contains(&quot; WY&quot;, case=True)
    df['Is_TR'] = dfx['Address'].str.contains(&quot; TR&quot;, case=True)
    df['Is_DR'] = dfx['Address'].str.contains(&quot; DR&quot;, case=True)
    df['Is_Block'] = dfx['Address'].str.contains(&quot; Block&quot;, case=True)
    df['Is_crossing'] = dfx['Address'].str.contains(&quot; / &quot;, case=True)
    
    return df

# Processing Function for Labels
def encodeLabels(dfx):
    df = pd.DataFrame (columns = [])
    factor = pd.factorize(dfx['Category'])
    return factor

# Remove Outliers by Longitude
df_cleaned = df_base[df_base['Y']&lt;70]

# Encode Labels as Integer
factor = encodeLabels(df_cleaned)
y_df = factor[0]
labels = list(factor[1])
# for val, i in enumerate(labels):
#     print(val, i)</pre></div>



<p class="wp-block-paragraph">We could also try to further improve our features by using additional data sources, such as weather data. However, there is no guarantee that this will improve the model results, and it did not in the case of criminal records. Therefore, we have omitted this part.</p>



<h3 class="wp-block-heading" id="h-step-4-visualize-crime-types-on-a-map-of-san-francisco">Step #4 Visualize Crime Types on a Map of San Francisco</h3>



<p class="wp-block-paragraph">Next, we create a San Francisco crime map using the cartesian coordinates indicating where a crime has occurred. First, we only plot the data without a geographical map. Later we will use these spatial data to create a dot plot and overlay it with a map of San Francisco. Visualizing the crime types on a map helps us understand how crime types distribute across the city. </p>



<h4 class="wp-block-heading" id="h-4-1-plot-crime-types-using-a-scatter-plot">4.1 Plot Crime Types using a Scatter Plot</h4>



<p class="wp-block-paragraph">Next, we want to gain an overview of possible spatial patterns and hotspots. We expect to see streets and neighborhoods where certain crimes are more common than in the more expensive areas of the city. In addition, we expect to see places in the city where certain crime types occur relatively rarely. To gain an overview of the crime distribution in San Francisco, we use a scatter plot to display the crime coordinates on a blank chart. </p>



<p class="wp-block-paragraph">Running the code below creates the crime map of San Francisco with all crime types. Depending on the speed of your machine, the creation of the map may take several minutes.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Plot Criminal Activities by Lat and Long
df_filtered = df_cleaned.sample(frac=0.05)  
#df_filtered = df_cleaned[df_cleaned['Category'].isin(['PROSTITUTION', 'VEHICLE THEFT', 'FRAUD'])].sample(frac=0.05) # to filter 

groups = df_filtered.groupby('Category')

fig, ax = plt.subplots(sharex=False, figsize=(20, 12))
ax.margins(0.05) # Optional, just adds 5% padding to the autoscaling
for name, group in groups:
    ax.plot(group['X'], group['Y'], marker='.', linestyle='', label=name, alpha=0.9)
ax.legend()
plt.show()</pre></div>



<figure class="wp-block-image size-large is-resized"><img decoding="async" data-attachment-id="8389" data-permalink="https://www.relataly.com/predicting-crimes-in-san-francisco-creatingsf-crime-map-using-xgboost/2960/output-8/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/05/output.png" data-orig-size="1170,683" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="output" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/05/output.png" src="https://www.relataly.com/wp-content/uploads/2022/05/output-1024x598.png" alt="Crime Map of San Francisco (sf crime map) - Kaggle Crime Prediction Challenge. Classification with XGBoost" class="wp-image-8389" width="1111" height="649" srcset="https://www.relataly.com/wp-content/uploads/2022/05/output.png 1024w, https://www.relataly.com/wp-content/uploads/2022/05/output.png 300w, https://www.relataly.com/wp-content/uploads/2022/05/output.png 768w, https://www.relataly.com/wp-content/uploads/2022/05/output.png 1170w" sizes="(max-width: 1111px) 100vw, 1111px" /></figure>



<p class="wp-block-paragraph">The plot shows that certain streets in San Francisco are more prone to specific crime types than others. It is also clear that there are certain crime hotspots in the city, especially in the center. We can also see that few crimes are reported in public park areas. </p>



<h4 class="wp-block-heading" id="h-4-2-create-a-crime-map-of-san-francisco-using-plotly">4.2 Create a Crime Map of San Francisco using Plotly</h4>



<p class="wp-block-paragraph">Next, we will create a San Francisco crime map using the Plotly Python library. Because the plugin can handle a limited amount of data simultaneously, we will reduce our data to a fraction of 1% and a few selected crime types. </p>



<p class="wp-block-paragraph">Running the code below opens a _map.html file in your browser that displays the SF crime map. The result is a zoomable geographic map of San Francisco that shows how the selected crime types distribute across the city.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># 4.2 Create a Crime Map of San Francisco using Plotly
# Limit the data to a fraction and selected categories
df_filtered = df_cleaned.sample(frac=0.01) 
fig = px.scatter_mapbox(df_filtered, lat=&quot;Y&quot;, lon=&quot;X&quot;, hover_name=&quot;Category&quot;, color='Category', hover_data=[&quot;Y&quot;, &quot;X&quot;], zoom=12, height=800)
fig.update_layout(mapbox_style=&quot;open-street-map&quot;)
fig.update_layout(margin={&quot;r&quot;:0,&quot;t&quot;:0,&quot;l&quot;:0,&quot;b&quot;:0})
fig.show()</pre></div>



<figure class="wp-block-image size-large is-resized"><img decoding="async" data-attachment-id="8401" data-permalink="https://www.relataly.com/predicting-crimes-in-san-francisco-creatingsf-crime-map-using-xgboost/2960/map-sf-crime/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/05/map-sf-crime.png" data-orig-size="1600,650" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="map-sf-crime" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/05/map-sf-crime.png" src="https://www.relataly.com/wp-content/uploads/2022/05/map-sf-crime-1024x416.png" alt="Crime Map of San Francisco (SF Crime Map) - Kaggle Crime Prediction Challenge. Crime Classification XGBoost" class="wp-image-8401" width="1097" height="445" srcset="https://www.relataly.com/wp-content/uploads/2022/05/map-sf-crime.png 1024w, https://www.relataly.com/wp-content/uploads/2022/05/map-sf-crime.png 300w, https://www.relataly.com/wp-content/uploads/2022/05/map-sf-crime.png 768w, https://www.relataly.com/wp-content/uploads/2022/05/map-sf-crime.png 1536w, https://www.relataly.com/wp-content/uploads/2022/05/map-sf-crime.png 1600w" sizes="(max-width: 1097px) 100vw, 1097px" /></figure>



<p class="wp-block-paragraph">The SF crime map shows different types of crimes, including prostitution, vehicle theft, and fraud. The interactive map allows you to change zoom levels and filter the type of crime displayed on the map. For example, if you filter DRUG/NARCOTIC-related crimes, you can see that these crimes mainly occur in the city center near the financial district and the nightlife area.</p>



<h3 class="wp-block-heading" id="h-step-5-split-the-data">Step #5 Split the Data</h3>



<p class="wp-block-paragraph">Before training our predictive model, we will split our data into separate datasets for training and testing. For this purpose, we use the train_test_split function of scikit-learn and configure a split ratio of 70%. Then we output the data, which we employ in the next step to train and validate a model.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Create train_df &amp; test_df
x_df = preprocessFeatures(df_cleaned).copy()

# Split the data into x_train and y_train data sets
x_train, x_test, y_train, y_test = train_test_split(x_df, y_df, train_size=0.7, random_state=0)
x_train</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">		DayOfWeek_Friday	DayOfWeek_Monday	DayOfWeek_Saturday	DayOfWeek_Sunday	DayOfWeek_Thursday	DayOfWeek_Tuesday	DayOfWeek_Wednesday	PdDistrict_BAYVIEW	PdDistrict_CENTRAL	PdDistrict_INGLESIDE	...	Y			dist		phi			Is_ST	Is_AV	Is_WY	Is_TR	Is_DR	Is_Block	Is_crossing
276998	0					0					0					0					0					1					0					0					0					0						...	37.785023	128.110900	2.842200	True	False	False	False	False	True		False
81579	0					0					0					0					0					1					0					0					0					0						...	37.748470	128.185052	2.842677	False	True	False	False	False	True		False
206676	0					0					0					1					0					0					0					0					0					0						...	37.762744	128.113657	2.842389	True	False	False	False	False	True		False
732006	0					0					0					0					0					0					1					0					0					0						...	37.784140	128.109653	2.842204	True	False	False	False	False	False		True
796194	1					0					0					0					0					0					0					0					0					0						...	37.791333	128.125982	2.842185	True	False	False	False	False	True		False
5 rows × 45 columns</pre></div>



<h3 class="wp-block-heading" id="h-step-6-train-a-random-forest-classifier">Step #6 Train a Random Forest Classifier</h3>



<p class="wp-block-paragraph">We can train the predictive models now that we have prepared the data. We train a basic model based on the Random Forest algorithm in the first step. The Random Forest is a robust algorithm that can handle regression and classification problems. One of our <a href="https://www.relataly.com/anyone-about-to-leave-predicting-the-customer-churn-of-a-telecommunications-provider/2378/" target="_blank" rel="noreferrer noopener">recent articles provides more information on Random Forests</a> and how you can find the optimal configuration of their hyperparameters. In this tutorial, we use the Random Forest to establish a baseline against which we can measure the performance of our XGboost model.  We, therefore, use the Random Forest with a simple parameter configuration without tuning the hyperparameters.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Train a single random forest classifier - parameters are a best guess
clf = RandomForestClassifier(max_depth=100, random_state=0, n_estimators = 200)
clf.fit(x_train, y_train.ravel())
y_pred = clf.predict(x_test)

results_log = classification_report(y_test, y_pred)
print(results_log)</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">Output exceeds the size limit. Open the full output data in a text editor
              precision    recall  f1-score   support

           0       0.15      0.10      0.12     12657
           1       0.29      0.35      0.32     37898
           2       0.38      0.63      0.47     52237
           3       0.46      0.40      0.43     16136
           4       0.16      0.08      0.10     13426
           5       0.25      0.21      0.23     27798
           6       0.10      0.04      0.06      6850
           7       0.23      0.22      0.23     23087
           8       0.19      0.12      0.15      2586
           9       0.20      0.13      0.15     10942
          10       0.08      0.03      0.05      9559
          11       0.00      0.00      0.00      1300
          12       0.20      0.10      0.14      3200
          13       0.37      0.43      0.40     16282
          14       0.02      0.02      0.02      1350
          15       0.01      0.00      0.00      2912
          16       0.05      0.03      0.04      2217
          17       0.61      0.52      0.56      7865
          18       0.11      0.06      0.08      4954
          19       0.04      0.03      0.03       723
          20       0.28      0.19      0.23       581
          21       0.05      0.02      0.03       708
          22       0.25      0.13      0.17      1333
...
    accuracy                           0.31    263395
   macro avg       0.15      0.12      0.13    263395
weighted avg       0.28      0.31      0.28    263395</pre></div>



<p class="wp-block-paragraph">The baseline model is a random forest classifier with 31% percent accuracy on the test dataset.</p>



<h3 class="wp-block-heading" id="h-step-7-train-an-xgboost-classifier">Step #7 Train an XGBoost Classifier</h3>



<p class="wp-block-paragraph">Now that we have a baseline model, we can train our gradient boosting classifier using the XGBoost package. We expect this model to perform better than the baseline. </p>



<h4 class="wp-block-heading" id="h-7-1-about-gradient-boosting">7.1 About Gradient Boosting</h4>



<p class="wp-block-paragraph">XGBoost is an implementation of a <a href="https://www.relataly.com/category/machine-learning-algorithms/gradient-boosting/" target="_blank" rel="noreferrer noopener">gradient-boosting algorithm</a> that uses a decision-tree-based ensemble machine learning algorithm. The algorithm searches for an optimal ensemble of trees. In this process, the algorithm iteratively adds trees to the model or removes them to reduce the prediction error of the previous tree constellation. The algorithm repeats these steps until it can make no further improvements. Thus, training does not optimize the model against the predictions but the previous model&#8217;s residuals (prediction errors).</p>



<p class="wp-block-paragraph" id="h-xgboost-an-implementation-of-a-gradient-boosting-algorithm-that-creates-a-random-forest-which-is-an-ensemble-of-decision-trees-the-basic-idea-of-a-decision-forest-is-to-have-an-ensemble-of-decision-trees-that-come-to-a-conclusion-on-the-prediction-label-by-casting-a-vote-gradient-boosting-is-used-to-find-an-optimal-ensemble-of-trees-and-parameters-boosting-is-the-process-of-iteratively-adding-trees-to-correct-the-prediction-error-of-a-previous-ensemble-of-trees-until-no-further-improvements-can-be-achieved-the-models-are-thereby-trained-to-predict-the-residuals-prediction-errors-of-the-previous-model">But XGBoost does more! It is an extreme version of gradient boosting that uses additional optimization techniques to achieve the best result with minimal effort. In contrast to the random decision forest, the XGBoost classification algorithm determines an optimal number of trees in the training process. We do not have to specify this number in advance.</p>



<p class="wp-block-paragraph" id="h-xgboost-an-implementation-of-a-gradient-boosting-algorithm-that-creates-a-random-forest-which-is-an-ensemble-of-decision-trees-the-basic-idea-of-a-decision-forest-is-to-have-an-ensemble-of-decision-trees-that-come-to-a-conclusion-on-the-prediction-label-by-casting-a-vote-gradient-boosting-is-used-to-find-an-optimal-ensemble-of-trees-and-parameters-boosting-is-the-process-of-iteratively-adding-trees-to-correct-the-prediction-error-of-a-previous-ensemble-of-trees-until-no-further-improvements-can-be-achieved-the-models-are-thereby-trained-to-predict-the-residuals-prediction-errors-of-the-previous-model">A disadvantage of XGBoost is that it tends to overfit the data. Therefore, testing against unseen data is essential. This tutorial will test only against a single test sample for simplicity, but using cross-validation would be a better choice.</p>



<h4 class="wp-block-heading" id="h-7-2-train-the-xgboost-classifier">7.2 Train the XGBoost Classifier</h4>



<p class="wp-block-paragraph">Various Gradient Boosting Algorithms are available for Python, including one from scikit-learn. However, scikit-learn does not support multi-threading, which makes the training process slower than necessary. For this reason, we will use the gradient boosting classifier from the XGBoost package.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Configure the XGBoost model
param = {'booster': 'gbtree', 
         'tree_method': 'gpu_hist',
         'predictor': 'gpu_predictor',
         'max_depth': 140, 
         'eta': 0.3, 
         'objective': '{multi:softmax}', 
         'eval_metric': 'mlogloss', 
         'num_round': 30,
         'feature_selector ': 'cyclic'
        }

xgb_clf = XGBClassifier(param)
xgb_clf.fit(x_train, y_train.ravel())
score = xgb_clf.score(x_test, y_test.ravel())
print(score)

# Create predictions on the test dataset
y_pred = xgb_clf.predict(x_test)

# Print a classification report
results_log = classification_report(y_test, y_pred)
print(results_log)</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">Output exceeds the size limit. Open the full output data in a text editor
0.30852142219859907
              precision    recall  f1-score   support

           0       0.17      0.01      0.02     12657
           1       0.30      0.42      0.35     37898
           2       0.33      0.72      0.46     52237
           3       0.31      0.27      0.29     16136
           4       0.21      0.03      0.05     13426
           5       0.24      0.18      0.21     27798
           6       0.17      0.01      0.01      6850
           7       0.21      0.19      0.20     23087
           8       0.26      0.01      0.02      2586
           9       0.22      0.08      0.12     10942
          10       0.13      0.00      0.00      9559
          11       0.07      0.00      0.01      1300
          12       0.20      0.08      0.11      3200
          13       0.34      0.43      0.38     16282
          14       0.00      0.00      0.00      1350
          15       0.12      0.00      0.01      2912
          16       0.15      0.02      0.03      2217
          17       0.57      0.34      0.43      7865
          18       0.19      0.03      0.05      4954
          19       0.00      0.00      0.00       723
          20       0.50      0.24      0.32       581
          21       0.10      0.01      0.01       708
...
    accuracy                           0.31    263395
   macro avg       0.18      0.11      0.11    263395
weighted avg       0.27      0.31      0.25    263395
</pre></div>



<p class="wp-block-paragraph">Now that we have trained our classification model, let&#8217;s see how it performs. For this purpose, we will generate predictions (y_pred) on the test dataset (x_test). Afterward, we use the predictions and the valid values (y_test) to create a classification report.</p>



<p class="wp-block-paragraph">Our model achieves an accuracy score of 31%. At first hand, this might not look so good, but considering that we have 39 categories and only sparse information available, this performance is quite impressive.   </p>



<h3 class="wp-block-heading" id="h-step-8-measure-model-performance">Step #8 Measure Model Performance</h3>



<p class="wp-block-paragraph">So how well does our XGboost model perform? To measure the performance of our model, we create a confusion matrix that visualizes the performance of the XGboost classifier. If you want to learn more about measuring the performance of classification models, check out<a href="https://www.relataly.com/measuring-classification-performance-with-python-and-scikit-learn/846/" target="_blank" rel="noreferrer noopener"> this tutorial on measuring classification performance</a>.</p>



<p class="wp-block-paragraph">Running the code below creates the confusion matrix that shows the number of correct and false predictions for each crime category.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Print a multi-Class Confusion Matrix
cnf_matrix = confusion_matrix(y_test.reshape(-1), y_pred)
df_cm = pd.DataFrame(cnf_matrix, columns=np.unique(y_test), index = np.unique(y_test))
df_cm.index.name = 'Actual'
df_cm.columns.name = 'Predicted'
plt.figure(figsize = (16,12))
plt.tight_layout()
sns.set(font_scale=1.4) #for label size
sns.heatmap(df_cm, cbar=True, cmap= &quot;inferno&quot;, annot=False, fmt='.0f' #, annot_kws={&quot;size&quot;: 13}
           )</pre></div>



<figure class="wp-block-image size-large is-resized"><img decoding="async" data-attachment-id="3213" data-permalink="https://www.relataly.com/predicting-crimes-in-san-francisco-creatingsf-crime-map-using-xgboost/2960/image-22-5/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2021/04/image-22.png" data-orig-size="903,709" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-22" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2021/04/image-22.png" src="https://www.relataly.com/wp-content/uploads/2021/04/image-22.png" alt="Evaluating the performance of our XGboost classifier; sfo crime map" class="wp-image-3213" width="669" height="525" srcset="https://www.relataly.com/wp-content/uploads/2021/04/image-22.png 903w, https://www.relataly.com/wp-content/uploads/2021/04/image-22.png 300w, https://www.relataly.com/wp-content/uploads/2021/04/image-22.png 768w" sizes="(max-width: 669px) 100vw, 669px" /></figure>



<p class="wp-block-paragraph">The confusion matrix shows that our model frequently predicts crime category two and neglects the other crime types. The reason is the uneven distribution of crime types in the training data. As a result, when we evaluate the model, we need to pay attention to the importance of the different crime types. For example, we might train the model to predict certain crime types accurately, although this might come at a lower accuracy when predicting other crime types. However, such optimizations depend on the technical context and the goals one wants to achieve with the prediction model. </p>



<h2 class="wp-block-heading" id="h-summary">Summary</h2>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">This tutorial has presented the machine learning use case &#8220;Predictive Policing&#8221; and showed how to implement it in Python. We have trained an XGBoost model that predicts crime types in San Francisco based on the information on when and where specific crimes have occurred. We also illustrated our data on an interactive crime map of San Francisco with the Plotly Python library. The Crime Map is an intuitive way of visualizing crime in a city and highlighting particular hotspots. Finally, we have used the prediction model to make test predictions and evaluate the model performance against other algorithms, such as a classic Random Decision Forest. The XGBoost model achieves a prediction accuracy of about 31%—a respectable performance, considering that the prediction problem involves 39 crime classes.</p>



<p class="wp-block-paragraph">We hope this tutorial was helpful. If you have any questions or suggestions on what we could improve, feel free to post them in the comments. We appreciate your feedback.</p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%">
<figure class="wp-block-image size-full is-resized"><img decoding="async" data-attachment-id="8410" data-permalink="https://www.relataly.com/predicting-crimes-in-san-francisco-creatingsf-crime-map-using-xgboost/2960/image-12-13/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/05/image-12.png" data-orig-size="658,592" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-12" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/05/image-12.png" src="https://www.relataly.com/wp-content/uploads/2022/05/image-12.png" alt="Crime Map of San Francisco - Kaggle Crime Prediction Challenge. Crim Classification XGBoost" class="wp-image-8410" width="346" height="311" srcset="https://www.relataly.com/wp-content/uploads/2022/05/image-12.png 658w, https://www.relataly.com/wp-content/uploads/2022/05/image-12.png 300w" sizes="(max-width: 346px) 100vw, 346px" /><figcaption class="wp-element-caption">Predictive policing with machine learning &#8211; Crime map of San Francisco, created with Python and Plotly</figcaption></figure>
</div>
</div>



<h2 class="wp-block-heading">Sources and Further Reading</h2>



<p class="wp-block-paragraph">Looking for more esciting map vizualizations? Consider the relataly tutorial on <a href="https://www.relataly.com/visualize-covid-19-data-on-a-geographic-heat-maps/291/" target="_blank" rel="noreferrer noopener">visualizing COVID-19 data on geographic heatmaps using GeoPandas</a>.</p>



<div style="display: inline-block;">
  <iframe sandbox="allow-popups allow-scripts allow-modals allow-forms allow-same-origin" style="width:120px;height:240px;" marginwidth="0" marginheight="0" scrolling="no" frameborder="0" src="//ws-eu.amazon-adsystem.com/widgets/q?ServiceVersion=20070822&amp;OneJS=1&amp;Operation=GetAdHtml&amp;MarketPlace=DE&amp;source=ss&amp;ref=as_ss_li_til&amp;ad_type=product_link&amp;tracking_id=flo7up-21&amp;language=de_DE&amp;marketplace=amazon&amp;region=DE&amp;placement=3030181162&amp;asins=3030181162&amp;linkId=669e46025028259138fbb5ccec12dfbe&amp;show_border=true&amp;link_opens_in_new_window=true"></iframe>
<iframe sandbox="allow-popups allow-scripts allow-modals allow-forms allow-same-origin" style="width:120px;height:240px;" marginwidth="0" marginheight="0" scrolling="no" frameborder="0" src="//ws-eu.amazon-adsystem.com/widgets/q?ServiceVersion=20070822&amp;OneJS=1&amp;Operation=GetAdHtml&amp;MarketPlace=DE&amp;source=ss&amp;ref=as_ss_li_til&amp;ad_type=product_link&amp;tracking_id=flo7up-21&amp;language=de_DE&amp;marketplace=amazon&amp;region=DE&amp;placement=1999579577&amp;asins=1999579577&amp;linkId=91d862698bf9010ff4c09539e4c49bf4&amp;show_border=true&amp;link_opens_in_new_window=true"></iframe>
<iframe sandbox="allow-popups allow-scripts allow-modals allow-forms allow-same-origin" style="width:120px;height:240px;" marginwidth="0" marginheight="0" scrolling="no" frameborder="0" src="//ws-eu.amazon-adsystem.com/widgets/q?ServiceVersion=20070822&amp;OneJS=1&amp;Operation=GetAdHtml&amp;MarketPlace=DE&amp;source=ss&amp;ref=as_ss_li_til&amp;ad_type=product_link&amp;tracking_id=flo7up-21&amp;language=de_DE&amp;marketplace=amazon&amp;region=DE&amp;placement=1839217715&amp;asins=1839217715&amp;linkId=356ba074068849ff54393f527190825d&amp;show_border=true&amp;link_opens_in_new_window=true"></iframe>
<iframe sandbox="allow-popups allow-scripts allow-modals allow-forms allow-same-origin" style="width:120px;height:240px;" marginwidth="0" marginheight="0" scrolling="no" frameborder="0" src="//ws-eu.amazon-adsystem.com/widgets/q?ServiceVersion=20070822&amp;OneJS=1&amp;Operation=GetAdHtml&amp;MarketPlace=DE&amp;source=ss&amp;ref=as_ss_li_til&amp;ad_type=product_link&amp;tracking_id=flo7up-21&amp;language=de_DE&amp;marketplace=amazon&amp;region=DE&amp;placement=1492032646&amp;asins=1492032646&amp;linkId=2214804dd039e7103577abd08722abac&amp;show_border=true&amp;link_opens_in_new_window=true"></iframe>
</div>



<p class="has-contrast-2-color has-base-3-background-color has-text-color has-background wp-block-paragraph"><em>The links above to Amazon are affiliate links. By buying through these links, you support the Relataly.com blog and help to cover the hosting costs. Using the links does not affect the price.</em></p>
<p>The post <a href="https://www.relataly.com/predicting-crimes-in-san-francisco-creatingsf-crime-map-using-xgboost/2960/">Predictive Policing: Preventing Crime in San Francisco using XGBoost and Python</a> appeared first on <a href="https://www.relataly.com">relataly.com</a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://www.relataly.com/predicting-crimes-in-san-francisco-creatingsf-crime-map-using-xgboost/2960/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">2960</post-id>	</item>
		<item>
		<title>Customer Churn Prediction &#8211; Understanding Models with Feature Permutation Importance using Python</title>
		<link>https://www.relataly.com/predicting-the-customer-churn-of-a-telecommunications-provider/2378/</link>
					<comments>https://www.relataly.com/predicting-the-customer-churn-of-a-telecommunications-provider/2378/#respond</comments>
		
		<dc:creator><![CDATA[Florian Follonier]]></dc:creator>
		<pubDate>Sun, 02 Aug 2020 13:24:28 +0000</pubDate>
				<category><![CDATA[Churn Prediction]]></category>
		<category><![CDATA[Classification (two-class)]]></category>
		<category><![CDATA[Data Science]]></category>
		<category><![CDATA[Data Sources]]></category>
		<category><![CDATA[Feature Permutation Importance]]></category>
		<category><![CDATA[Hyperparameter Tuning]]></category>
		<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[Random Decision Forests]]></category>
		<category><![CDATA[Retail]]></category>
		<category><![CDATA[Scikit-Learn]]></category>
		<category><![CDATA[Seaborn]]></category>
		<category><![CDATA[Use Cases]]></category>
		<category><![CDATA[AI in E-Commerce]]></category>
		<category><![CDATA[AI in Marketing]]></category>
		<category><![CDATA[Digital Transformation]]></category>
		<category><![CDATA[Intermediate Tutorials]]></category>
		<category><![CDATA[Model Interpretation]]></category>
		<category><![CDATA[Multivariate Models]]></category>
		<category><![CDATA[Permutation Feature Importance]]></category>
		<category><![CDATA[Supervised Learning]]></category>
		<category><![CDATA[Two-Label Classification]]></category>
		<guid isPermaLink="false">https://www.relataly.com/?p=2378</guid>

					<description><![CDATA[<p>Customer retention is a prime objective for service companies, and understanding the patterns that lead to customer churn can be the key to maintaining long-lasting client relationships. Businesses incur significant costs when customers discontinue their services, hence it&#8217;s vital to identify potential churn risks and take preemptive actions to retain these customers. Machine Learning models ... <a title="Customer Churn Prediction &#8211; Understanding Models with Feature Permutation Importance using Python" class="read-more" href="https://www.relataly.com/predicting-the-customer-churn-of-a-telecommunications-provider/2378/" aria-label="Read more about Customer Churn Prediction &#8211; Understanding Models with Feature Permutation Importance using Python">Read more</a></p>
<p>The post <a href="https://www.relataly.com/predicting-the-customer-churn-of-a-telecommunications-provider/2378/">Customer Churn Prediction &#8211; Understanding Models with Feature Permutation Importance using Python</a> appeared first on <a href="https://www.relataly.com">relataly.com</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">Customer retention is a prime objective for service companies, and understanding the patterns that lead to customer churn can be the key to maintaining long-lasting client relationships. Businesses incur significant costs when customers discontinue their services, hence it&#8217;s vital to identify potential churn risks and take preemptive actions to retain these customers. Machine Learning models can be instrumental in identifying these patterns and providing valuable insights into customer behavior.</p>



<p class="wp-block-paragraph">An intriguing technique, Permutation Feature Importance, allows us to discern the significance of different features of our machine learning model, thereby shedding light on their influence on customer churn. This tutorial guides you through the intricacies of this technique and its implementation.</p>



<p class="wp-block-paragraph">The structure of this tutorial is as follows:</p>



<ul class="wp-block-list">
<li>We begin by discussing the business problem of customer churn and its implications.</li>



<li>We introduce the concept of Permutation Feature Importance, a powerful tool to identify essential features in our machine learning model.</li>



<li>We transition into the hands-on coding segment, where we build a churn prediction model using Python.</li>



<li>Our model undergoes a classification process and hyperparameter tuning to select the most effective parameters.</li>



<li>Utilizing the trained model, we predict the churn probabilities for a test set of customers.</li>



<li>Finally, we create a feature ranking based on their impact on the model&#8217;s performance.</li>
</ul>



<p class="wp-block-paragraph">By employing permutation feature importance, this tutorial offers a deep-dive into the correlation between input variables and model predictions, providing actionable insights for effective customer churn management.</p>



<p class="wp-block-paragraph">Also: </p>



<ul class="wp-block-list">
<li><a href="https://www.relataly.com/building-fair-machine-machine-learning-models-with-fairlearn/12804/" target="_blank" rel="noreferrer noopener">Using Fairlearn to Build Fair Machine Machine Learning Models with Python: Step-by-Step Towards More Responsible AI</a></li>



<li><a href="https://www.relataly.com/customer-segmentation-using-hierarchical-clustering-in-python/11335/" target="_blank" rel="noreferrer noopener">How to Use Hierarchical Clustering For Customer Segmentation in Python</a></li>
</ul>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%"><div class="wp-block-image">
<figure class="aligncenter size-large is-resized"><img decoding="async" data-attachment-id="2402" data-permalink="https://www.relataly.com/predicting-the-customer-churn-of-a-telecommunications-provider/2378/image-47/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2020/08/image.png" data-orig-size="448,173" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2020/08/image.png" src="https://www.relataly.com/wp-content/uploads/2020/08/image.png" alt="machine learning. It is particularly effective when combined with feature permutation importance" class="wp-image-2402" width="324" height="127"/><figcaption class="wp-element-caption">Customer churn prediction is a compelling use case for machine learning. It is particularly effective when combined with feature permutation importance.</figcaption></figure>
</div></div>
</div>



<div style="height:26px" aria-hidden="true" class="wp-block-spacer"></div>



<h2 class="wp-block-heading" id="h-what-is-churn-prediction">What is Churn Prediction?</h2>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">A company&#8217;s effort to persuade a new customer to sign a contract is many times higher than the costs incurred in retaining existing customers. According to industry experts, winning a new customer is four times more expensive than keeping an existing one. Providers that can identify churn candidates and manage to retain them can significantly reduce costs. </p>



<p class="wp-block-paragraph">A crucial point is whether the provider succeeds in getting the churn candidates to stay. Sometimes it may be enough to contact the churn candidate and inquire about customer satisfaction. In other cases, this may not be enough, and the provider needs to increase the service value, for example, by offering free services or a discount. However, actions should be well thought out, as they can also negatively affect. For instance, if a customer hardly ever uses his contract, a call from the provider may even increase the desire to cancel the contract. Machine learning can help assess cases individually and identify the optimal anti-churn action. </p>



<h2 class="wp-block-heading" id="h-about-permutation-feature-importance">About Permutation Feature Importance</h2>



<p class="wp-block-paragraph">Feature importance is a helpful technique for understanding the contribution of input variables (features) to a predictive model. The results from this technique can be as valuable as the predictions themselves, as they can help us understand the business context better. For example, let&#8217;s say we have trained a model that predicts which of our customers will likely churn. Wouldn&#8217;t it be interesting to know why specific customers are more likely to churn than others? Permutation feature importance can help us answer this question by providing us with a ranking of the input variables in our model by their usefulness. The order can validate assumptions about the business context and uncover causal relations in the data.</p>



<p class="wp-block-paragraph">Compared to neural networks, one of the most significant advantages of traditional prediction models, such as a decision tree, is their interpretability. Neural networks are black boxes because it is tough to understand the relationship between input and model predictions. In traditional models, on the other hand, we can calculate the meaning of the features and use it to interpret the model and optimize its performance, for example, by removing features from the model that are not important. We, therefore, start with a simple model first and move on to more complex models once we understand the data.</p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%"></div>
</div>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<h2 class="wp-block-heading" id="h-implementing-a-customer-churn-prediction-model-in-python">Implementing a Customer Churn Prediction Model in Python</h2>



<p class="wp-block-paragraph">In the following, we will implement a customer churn prediction model. We will train a decision forest model on a data set from Kaggle and optimize it using <a aria-label="undefined (opens in a new tab)" href="https://www.relataly.com/hyperparameter-tuning-with-grid-search/2261/" target="_blank" rel="noreferrer noopener">grid search</a>. The data contains customer-level information for a telecom provider and a binary prediction label of which customers canceled their contracts and did not. Finally, we will calculate the feature importance to understand how the model works. </p>



<p class="wp-block-paragraph">The code is available on the GitHub repository.</p>



<div class="wp-block-kadence-advancedbtn kb-buttons-wrap kb-btns_bddeda-14"><a class="kb-button kt-button button kb-btn_b5bf96-e2 kt-btn-size-standard kt-btn-width-type-full kb-btn-global-inherit kt-btn-has-text-true kt-btn-has-svg-true wp-block-button__link wp-block-kadence-singlebtn" href="https://github.com/flo7up/relataly-public-python-tutorials/blob/master/02%20Classification/017%20Permutation%20Feature%20Importance%20-%20Customer%20Churn%20Prediction%20using%20Random%20Decision%20Forest.ipynb" target="_blank" rel="noreferrer noopener"><span class="kb-svg-icon-wrap kb-svg-icon-fe_eye kt-btn-icon-side-left"><svg viewBox="0 0 24 24"  fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"  aria-hidden="true"><path d="M1 12s4-8 11-8 11 8 11 8-4 8-11 8-11-8-11-8z"/><circle cx="12" cy="12" r="3"/></svg></span><span class="kt-btn-inner-text">View on GitHub </span></a>

<a class="kb-button kt-button button kb-btn_8e2f54-ca kt-btn-size-standard kt-btn-width-type-full kb-btn-global-inherit kt-btn-has-text-true kt-btn-has-svg-true wp-block-button__link wp-block-kadence-singlebtn" href="https://github.com/flo7up/relataly-public-python-API-tutorials" target="_blank" rel="noreferrer noopener"><span class="kb-svg-icon-wrap kb-svg-icon-fa_github kt-btn-icon-side-left"><svg viewBox="0 0 496 512"  fill="currentColor" xmlns="http://www.w3.org/2000/svg"  aria-hidden="true"><path d="M165.9 397.4c0 2-2.3 3.6-5.2 3.6-3.3.3-5.6-1.3-5.6-3.6 0-2 2.3-3.6 5.2-3.6 3-.3 5.6 1.3 5.6 3.6zm-31.1-4.5c-.7 2 1.3 4.3 4.3 4.9 2.6 1 5.6 0 6.2-2s-1.3-4.3-4.3-5.2c-2.6-.7-5.5.3-6.2 2.3zm44.2-1.7c-2.9.7-4.9 2.6-4.6 4.9.3 2 2.9 3.3 5.9 2.6 2.9-.7 4.9-2.6 4.6-4.6-.3-1.9-3-3.2-5.9-2.9zM244.8 8C106.1 8 0 113.3 0 252c0 110.9 69.8 205.8 169.5 239.2 12.8 2.3 17.3-5.6 17.3-12.1 0-6.2-.3-40.4-.3-61.4 0 0-70 15-84.7-29.8 0 0-11.4-29.1-27.8-36.6 0 0-22.9-15.7 1.6-15.4 0 0 24.9 2 38.6 25.8 21.9 38.6 58.6 27.5 72.9 20.9 2.3-16 8.8-27.1 16-33.7-55.9-6.2-112.3-14.3-112.3-110.5 0-27.5 7.6-41.3 23.6-58.9-2.6-6.5-11.1-33.3 2.6-67.9 20.9-6.5 69 27 69 27 20-5.6 41.5-8.5 62.8-8.5s42.8 2.9 62.8 8.5c0 0 48.1-33.6 69-27 13.7 34.7 5.2 61.4 2.6 67.9 16 17.7 25.8 31.5 25.8 58.9 0 96.5-58.9 104.2-114.8 110.5 9.2 7.9 17 22.9 17 46.4 0 33.7-.3 75.4-.3 83.6 0 6.5 4.6 14.4 17.3 12.1C428.2 457.8 496 362.9 496 252 496 113.3 383.5 8 244.8 8zM97.2 352.9c-1.3 1-1 3.3.7 5.2 1.6 1.6 3.9 2.3 5.2 1 1.3-1 1-3.3-.7-5.2-1.6-1.6-3.9-2.3-5.2-1zm-10.8-8.1c-.7 1.3.3 2.9 2.3 3.9 1.6 1 3.6.7 4.3-.7.7-1.3-.3-2.9-2.3-3.9-2-.6-3.6-.3-4.3.7zm32.4 35.6c-1.6 1.3-1 4.3 1.3 6.2 2.3 2.3 5.2 2.6 6.5 1 1.3-1.3.7-4.3-1.3-6.2-2.2-2.3-5.2-2.6-6.5-1zm-11.4-14.7c-1.6 1-1.6 3.6 0 5.9 1.6 2.3 4.3 3.3 5.6 2.3 1.6-1.3 1.6-3.9 0-6.2-1.4-2.3-4-3.3-5.6-2z"/></svg></span><span class="kt-btn-inner-text">Relataly GitHub Repo </span></a></div>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%"></div>
</div>



<h3 class="wp-block-heading">Prerequisites</h3>



<p class="wp-block-paragraph">Before starting the coding part, make sure that you have set up your <a href="https://www.python.org/downloads/" target="_blank" rel="noreferrer noopener">Python 3</a> environment and required packages. If you don&#8217;t have an environment, you can follow&nbsp;<a href="https://www.relataly.com/anaconda-python-environment-machine-learning/1663/" target="_blank" rel="noreferrer noopener">this tutorial</a>&nbsp;to set up the&nbsp;<a href="https://www.anaconda.com/products/individual" target="_blank" rel="noreferrer noopener">Anaconda environment</a>.</p>



<p class="wp-block-paragraph">Make sure you install all required packages. In this tutorial, we will be working with the following packages:&nbsp;</p>



<ul class="wp-block-list">
<li>Pandas</li>



<li>NumPy</li>



<li>Matplotlib</li>



<li>Seaborn</li>
</ul>



<p class="wp-block-paragraph">In addition, we will be using <strong><em>Keras&nbsp;</em></strong>(2.0 or higher) with <strong><em>Tensorflow</em> </strong>backend and the machine learning library <strong><em>Scikit-learn</em></strong>.</p>



<p class="wp-block-paragraph">You can install packages using console commands:</p>



<ul class="wp-block-list">
<li><em>pip install &lt;package name&gt;</em></li>



<li><em>conda install &lt;package name&gt;</em>&nbsp;(if you are using the anaconda packet manager)</li>
</ul>



<h3 class="wp-block-heading" id="h-step-1-loading-the-customer-churn-data">Step #1 Loading the Customer Churn Data</h3>



<p class="wp-block-paragraph">We begin by loading a customer churn <a href="https://www.kaggle.com/barun2104/telecom-churn" target="_blank" rel="noreferrer noopener">dataset from Kaggle</a>. If you work with the Kaggle Python environment, you can directly save the dataset into your Kaggle project. After completing the download, put the dataset under the file path of your choice, but don&#8217;t forget to adjust the file path variable in the code. </p>



<p class="wp-block-paragraph">The dataset contains 3333 records and the following attributes.</p>



<ul class="wp-block-list">
<li><strong>Churn</strong>: The prediction label: 1 if the customer canceled service, 0 if not.</li>



<li><strong>AccountWeeks</strong>: number of weeks the customer has had an active account</li>



<li><strong>ContractRenewal</strong>: 1 if customer recently renewed contract, 0 if not</li>



<li><strong>DataPlan</strong>: 1 if the customer has a data plan, 0 if not</li>



<li><strong>DataUsage</strong>: gigabytes of monthly data usage</li>



<li><strong>CustServCalls</strong>: number of calls into customer service</li>



<li><strong>DayMins</strong>: average daytime minutes per month</li>



<li><strong>DayCalls</strong>: average number of daytime calls</li>



<li><strong>MonthlyCharge</strong>: average monthly bill</li>



<li><strong>OverageFee</strong>: The most considerable overage fee in the last 12 months</li>
</ul>



<p class="wp-block-paragraph">The following code will load the data from your local folder into your anaconda Python project:</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}">import numpy as np 
import pandas as pd 
import math
from pandas.plotting import register_matplotlib_converters
import matplotlib.pyplot as plt 
import matplotlib.colors as mcolors
import matplotlib.dates as mdates 

from sklearn.metrics import confusion_matrix, classification_report
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.inspection import permutation_importance
import seaborn as sns


# set file path
filepath = &quot;data/Churn-prediction/&quot;

# Load train and test datasets
train_df = pd.read_csv(filepath + 'telecom_churn.csv')
train_df.head()</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">	Churn	AccountWeeks	ContractRenewal	DataPlan	DataUsage	CustServCalls	DayMins	DayCalls	MonthlyCharge	OverageFee	RoamMins
0	0		128				1				1			2.7			1				265.1		110		89.0			9.87		10.0
1	0		107				1				1			3.7			1				161.6		123		82.0			9.78		13.7
2	0		137				1				0			0.0			0				243.4		114		52.0			6.06		12.2
3	0		84				0				0			0.0			2				299.4		71		57.0			3.10		6.6
4	0		75				0				0			0.0			3				166.7		113		41.0			7.42		10.1</pre></div>



<h3 class="wp-block-heading" id="h-step-2-exploring-the-data">Step #2 Exploring the Data</h3>



<p class="wp-block-paragraph">Before we begin with the preprocessing, we will quickly explore the data. For this purpose, we will create histograms for the different attributes in our data.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># # Create histograms for feature columns separated by prediction label value
df_plot = train_df.copy()

# class_columnname = 'Churn'
sns.pairplot(df_plot, hue=&quot;Churn&quot;, height=2.5, palette='muted')</pre></div>



<figure class="wp-block-image size-large"><img decoding="async" width="1024" height="990" data-attachment-id="6808" data-permalink="https://www.relataly.com/predicting-the-customer-churn-of-a-telecommunications-provider/2378/pairplots-churn-prediction/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/04/pairplots-churn-prediction.png" data-orig-size="1828,1768" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="pairplots-churn-prediction" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/04/pairplots-churn-prediction.png" src="https://www.relataly.com/wp-content/uploads/2022/04/pairplots-churn-prediction-1024x990.png" alt="" class="wp-image-6808" srcset="https://www.relataly.com/wp-content/uploads/2022/04/pairplots-churn-prediction.png 1024w, https://www.relataly.com/wp-content/uploads/2022/04/pairplots-churn-prediction.png 300w, https://www.relataly.com/wp-content/uploads/2022/04/pairplots-churn-prediction.png 768w, https://www.relataly.com/wp-content/uploads/2022/04/pairplots-churn-prediction.png 1536w, https://www.relataly.com/wp-content/uploads/2022/04/pairplots-churn-prediction.png 1828w" sizes="(max-width: 1024px) 100vw, 1024px" /><figcaption class="wp-element-caption">Histograms of the churn prediction dataset separated by prediction label (red=churn, blue= no churn)</figcaption></figure>



<p class="wp-block-paragraph">We can see that the data distribution for several attributes looks quite good and resembles a normal distribution, for example, for OverageFeed, DayMins, and DayCalls. However, the distribution for the prediction label is unbalanced. This is because more customers remain with their contract (prediction label class = 0) than those that cancel their contract (prediction label class = 1). </p>



<h3 class="wp-block-heading" id="h-step-3-data-preprocessing">Step #3 Data Preprocessing</h3>



<p class="wp-block-paragraph">The next step is to preprocess the data. I have reduced this part to a minimum to keep this tutorial simple. For example, I do not treat the unbalanced label classes. However, this would be appropriate to improve the model performance in a real business context. The imbalanced data is also why I chose a decision forest as a model type. Decision forests can handle unbalanced data relatively well compared to traditional models such as logistic regression. </p>



<p class="wp-block-paragraph">The following code splits the data into the train (x_train) and test data (x_test) and creates the respective datasets, which only contain the label class (y_train, y_test). The ratio is 0.7, resulting in 2333 records in the training dataset and 1000 in the test dataset.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Create Training Dataset
x_df = train_df[train_df.columns[train_df.columns.isin(['AccountWeeks', 'ContractRenewal', 'DataPlan','DataUsage', 'CustServCalls', 'DayCalls', 'MonthlyCharge', 'OverageFee', 'RoamMins'])]].copy()
y_df = train_df['Churn'].copy()

# Split the data into x_train and y_train data sets
x_train, x_test, y_train, y_test = train_test_split(x_df, y_df, train_size=0.7, random_state=0)
x_train</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">		AccountWeeks	ContractRenewal	DataPlan	DataUsage	CustServCalls	DayCalls	MonthlyCharge	OverageFee	RoamMins
2918	58				1				0			0.00		4				112			53.0			13.29		0.0
1884	51				0				1			3.32		2				60			74.2			10.03		12.3
2823	87				1				0			0.00		2				80			50.0			9.35		16.6
2319	83				1				1			2.35		3				105			91.5			12.65		8.7
2980	84				1				0			0.00		3				86			62.0			13.78		14.3
...		...				...				...			...			...				...			...				...			...
835	27	1				0				0.00		1			75				31.0		10.43			9.9
3264	89				1				1			1.59		0				98			50.9			10.36		5.9
1653	93				0				0			0.00		1				78			42.0			10.99		11.1
2607	91				1				0			0.00		3				100			53.0			11.97		9.9
2732	130				0				0			0.00		5				106			68.0			18.19		16.9</pre></div>



<h3 class="wp-block-heading" id="h-step-4-fit-an-optimized-decision-forest-model-for-churn-prediction-using-grid-search">Step #4 Fit an Optimized Decision Forest Model for Churn Prediction using Grid Search</h3>



<p class="wp-block-paragraph">Now comes the exciting part. We will train a series of 36 decision forests and then choose the best-performing model. The technique used in this process is called hyperparameter tuning (more specifically, grid search), and I have recently published <a aria-label="undefined (opens in a new tab)" href="https://www.relataly.com/hyperparameter-tuning-with-grid-search/2261/" target="_blank" rel="noreferrer noopener">a separate article on this topic</a>.</p>



<p class="wp-block-paragraph">The following code defines the parameters the grid search will test (max_depth, n_estimators, and min_samples_split). Then the code runs the grid search and trains the decision forests. Finally, we print out the model ranking along with model parameters. </p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Define parameters
max_depth=[2, 4, 8, 16]
n_estimators = [64, 128, 256]
min_samples_split = [5, 20, 30]

param_grid = dict(max_depth=max_depth, n_estimators=n_estimators, min_samples_split=min_samples_split)

# Build the gridsearch
dfrst = RandomForestClassifier(n_estimators=n_estimators, max_depth=max_depth, min_samples_split=min_samples_split, class_weight='balanced')
grid = GridSearchCV(estimator=dfrst, param_grid=param_grid, cv = 5)
grid_results = grid.fit(x_train, y_train)

# Summarize the results in a readable format
results_df = pd.DataFrame(grid_results.cv_results_)
results_df.sort_values(by=['rank_test_score'], ascending=True, inplace=True)

# Reduce the results to selected columns
results_filtered = results_df[results_df.columns[results_df.columns.isin(['param_max_depth', 'param_min_samples_split', 'param_n_estimators','std_fit_time', 'rank_test_score', 'std_test_score', 'mean_test_score'])]].copy()
results_filtered</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">std_fit_time	param_max_depth	param_min_samples_split	param_n_estimators	mean_test_score	std_test_score	rank_test_score
28				0.004742		16						5					128	0.931415	0.006950		1
27				0.002620		16						5					64	0.925848	0.008177		2
29				0.015711		16						5					256	0.925846	0.006156		3
20				0.006258		8						5					256	0.923704	0.007961		4
19				0.001816		8						5					128	0.921988	0.006458		5
18				0.002161		8						5					64	0.919847	0.007716		6
31				0.003728		16						20					128	0.902690	0.011642		7
30				0.002057		16						20					64	0.901836	0.009789		8
32				0.004940		16						20					256	0.899691	0.009813		9
21				0.001994		8						20					64	0.898408	0.008710		10
22				0.003761		8						20					128	0.897121	0.007529		11
23				0.003828		8						20					256	0.895833	0.009159		12
33				0.003798		16						30					64	0.885546	0.010394		13
26				0.005560		8						30					256	0.885541	0.014937		14
...</pre></div>



<p class="wp-block-paragraph">The best-performing model is model number 29, which scores 92,7 %. Its hyperparameters are as follows:</p>



<ul class="wp-block-list">
<li>max_depth = 16</li>



<li>min_samples_split = 5</li>



<li>n_estimators 256</li>
</ul>



<p class="wp-block-paragraph">We will proceed with this model. So what does this model tell us?</p>



<p class="wp-block-paragraph">We can gain an overview of the distributions of our customers according to their churn probability. Just use the following code:</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Predicting Probabilities
y_pred_prob = best_clf.predict_proba(x_test) 
churnproba = y_pred_prob[:,1]

# Create histograms for feature columns separated by prediction label value
sns.histplot(data=churnproba)</pre></div>



<figure class="wp-block-image size-full"><img decoding="async" width="517" height="324" data-attachment-id="6810" data-permalink="https://www.relataly.com/predicting-the-customer-churn-of-a-telecommunications-provider/2378/image-12-12/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/04/image-12.png" data-orig-size="517,324" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-12" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/04/image-12.png" src="https://www.relataly.com/wp-content/uploads/2022/04/image-12.png" alt="Customer Base According to their Churn Rate" class="wp-image-6810" srcset="https://www.relataly.com/wp-content/uploads/2022/04/image-12.png 517w, https://www.relataly.com/wp-content/uploads/2022/04/image-12.png 300w" sizes="(max-width: 517px) 100vw, 517px" /><figcaption class="wp-element-caption">Customer Base According to their Churn Rate</figcaption></figure>



<p class="wp-block-paragraph">Customers who tend to churn have a churn probability greater than 0.5. They are further to the right in the diagram. So, we don&#8217;t have to worry about the customers on the far left (&lt;0.5).</p>



<h3 class="wp-block-heading" id="h-step-5-best-model-performance-insights">Step #5 Best Model Performance Insights</h3>



<p class="wp-block-paragraph">Let&#8217;s take a more detailed look at the performance of the best model. We do this by calculating the confusion matrix. </p>



<p class="wp-block-paragraph">If you want to learn more about measuring the performance of classification models, check out<a href="https://www.relataly.com/measuring-classification-performance-with-python-and-scikit-learn/846/" target="_blank" rel="noreferrer noopener"> this tutorial</a>.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Extract the best decision forest 
best_clf = grid_results.best_estimator_
y_pred = best_clf.predict(x_test)

# Create a confusion matrix
cnf_matrix = confusion_matrix(y_test, y_pred)

# Create heatmap from the confusion matrix
class_names=[False, True] 
tick_marks = [0.5, 1.5]
fig, ax = plt.subplots(figsize=(7, 6))
sns.heatmap(pd.DataFrame(cnf_matrix), annot=True, cmap=&quot;Blues&quot;, fmt='g')
ax.xaxis.set_label_position(&quot;top&quot;)
plt.tight_layout()
plt.title('Confusion matrix')
plt.ylabel('Actual label'); plt.xlabel('Predicted label')
plt.yticks(tick_marks, class_names); plt.xticks(tick_marks, class_names)</pre></div>



<figure class="wp-block-image size-large"><img decoding="async" width="486" height="452" data-attachment-id="2387" data-permalink="https://www.relataly.com/predicting-the-customer-churn-of-a-telecommunications-provider/2378/image-14-4/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2020/07/image-14.png" data-orig-size="486,452" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-14" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2020/07/image-14.png" src="https://www.relataly.com/wp-content/uploads/2020/07/image-14.png" alt="Confusion matrix on churn probabilities calculated with feature permutation importance" class="wp-image-2387" srcset="https://www.relataly.com/wp-content/uploads/2020/07/image-14.png 486w, https://www.relataly.com/wp-content/uploads/2020/07/image-14.png 300w" sizes="(max-width: 486px) 100vw, 486px" /></figure>



<p class="wp-block-paragraph">From 1000 customers in the test dataset, our model correctly classified 100 customers as churn candidates. For 832 customers, the model accurately predicted that these customers are unlikely to churn. In 30 cases, the model falsely classified customers as churn candidates, and 38 were missed and falsely classified as non-churn candidates. The result is a model accuracy of 93,2 % (based on a 0.5 threshold). </p>



<h3 class="wp-block-heading" id="h-step-6-permutation-feature-importance">Step #6 Permutation Feature Importance</h3>



<p class="wp-block-paragraph">Now that we have trained a model that gives good results, we want to understand the importance of the model&#8217;s features. With the following code, we calculate the Feature Importance score. Then we visualize the results in a barplot.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Load the data
r = permutation_importance(best_clf, x_test, y_test, n_repeats=30, random_state=0)

# Set the color range
clist = [(0, &quot;purple&quot;), (1, &quot;blue&quot;)]
rvb = mcolors.LinearSegmentedColormap.from_list(&quot;&quot;, clist)
colors = rvb(data_im['feature_permuation_score']/len(x_test.columns))

# Plot the barchart
data_im = pd.DataFrame(r.importances_mean, columns=['feature_permuation_score'])
data_im['feature_names'] = x_test.columns
data_im = data_im.sort_values('feature_permuation_score', ascending=False)

fig, ax = plt.subplots(figsize=(16, 5))
sns.barplot(y=data_im['feature_names'], x=&quot;feature_permuation_score&quot;, data=data_im, palette='nipy_spectral')
ax.set_title(&quot;Random Forest Feature Importances&quot;)</pre></div>



<figure class="wp-block-image size-full"><img decoding="async" width="1013" height="334" data-attachment-id="6801" data-permalink="https://www.relataly.com/predicting-the-customer-churn-of-a-telecommunications-provider/2378/output-2-2/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/04/output-2.png" data-orig-size="1013,334" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="output-2" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/04/output-2.png" src="https://www.relataly.com/wp-content/uploads/2022/04/output-2.png" alt="" class="wp-image-6801" srcset="https://www.relataly.com/wp-content/uploads/2022/04/output-2.png 1013w, https://www.relataly.com/wp-content/uploads/2022/04/output-2.png 300w, https://www.relataly.com/wp-content/uploads/2022/04/output-2.png 768w" sizes="(max-width: 1013px) 100vw, 1013px" /></figure>



<p class="wp-block-paragraph">The feature ranking can provide the starting point for deeper analysis. As we can see, the most important features are the monthly fee, data usage, and customer service calls (CustServCalls). Of particular interest is the importance of customer service calls, as this could indicate that customers who encounter customer service have negative experiences.</p>



<h2 class="wp-block-heading" id="h-summary">Summary</h2>



<p class="wp-block-paragraph">This article has shown how to implement a churn prediction model using Python and scikit-learn Machine Learning. We have calculated the permutation feature importance to analyze which features contribute to the performance of our model. You have learned that permutation feature importance can provide data scientists with new insights into the context of a prediction model. Therefore, the technique is often a good starting point for forthleading investigations. </p>



<p class="wp-block-paragraph">I am always interested in improving my articles and learning from my audience. If you liked this article, show your appreciation by leaving a comment. And if you didn&#8217;t, let me know too. Cheers </p>



<h2 class="wp-block-heading">Sources and Further Reading</h2>



<ol class="wp-block-list">
<li><a href="https://amzn.to/3MAy8j5" target="_blank" rel="noreferrer noopener">Andriy Burkov (2020) Machine Learning Engineering</a></li>



<li><a href="https://amzn.to/3D0gB0e" target="_blank" rel="noreferrer noopener">Oliver Theobald (2020) Machine Learning For Absolute Beginners: A Plain English Introduction</a></li>



<li><a href="https://amzn.to/3S9Nfkl" target="_blank" rel="noreferrer noopener">Aurélien Géron (2019) Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems </a></li>



<li><a href="https://amzn.to/3EKidwE" target="_blank" rel="noreferrer noopener">David Forsyth (2019) Applied Machine Learning Springer</a></li>
</ol>



<p class="has-contrast-2-color has-base-3-background-color has-text-color has-background wp-block-paragraph"><em>The links above to Amazon are affiliate links. By buying through these links, you support the Relataly.com blog and help to cover the hosting costs. Using the links does not affect the price.</em></p>



<p class="wp-block-paragraph">And if you are interested in text mining and customer satisfaction, consider taking a look at my recent blog about sentiment analysis:</p>



<figure class="wp-block-embed is-type-wp-embed is-provider-relataly-com wp-block-embed-relataly-com"><div class="wp-block-embed__wrapper">
<blockquote class="wp-embedded-content" data-secret="HQ0lUMzbZR"><a href="https://www.relataly.com/simple-sentiment-analysis-using-naive-bayes-and-logistic-regression/2007/">Sentiment Analysis with Naive Bayes and Logistic Regression in Python</a></blockquote><iframe loading="lazy" class="wp-embedded-content" sandbox="allow-scripts" security="restricted"  title="&#8220;Sentiment Analysis with Naive Bayes and Logistic Regression in Python&#8221; &#8212; relataly.com" src="https://www.relataly.com/simple-sentiment-analysis-using-naive-bayes-and-logistic-regression/2007/embed/#?secret=WMWtohaT3c#?secret=HQ0lUMzbZR" data-secret="HQ0lUMzbZR" width="600" height="338" frameborder="0" marginwidth="0" marginheight="0" scrolling="no"></iframe>
</div></figure>
<p>The post <a href="https://www.relataly.com/predicting-the-customer-churn-of-a-telecommunications-provider/2378/">Customer Churn Prediction &#8211; Understanding Models with Feature Permutation Importance using Python</a> appeared first on <a href="https://www.relataly.com">relataly.com</a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://www.relataly.com/predicting-the-customer-churn-of-a-telecommunications-provider/2378/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">2378</post-id>	</item>
		<item>
		<title>Tuning Model Hyperparameters with Grid Search at the Example of Training a Random Forest Classifier in Python</title>
		<link>https://www.relataly.com/hyperparameter-tuning-with-grid-search/2261/</link>
					<comments>https://www.relataly.com/hyperparameter-tuning-with-grid-search/2261/#comments</comments>
		
		<dc:creator><![CDATA[Florian Follonier]]></dc:creator>
		<pubDate>Mon, 06 Jul 2020 21:16:52 +0000</pubDate>
				<category><![CDATA[Classification (two-class)]]></category>
		<category><![CDATA[Hyperparameter Tuning]]></category>
		<category><![CDATA[Insurance]]></category>
		<category><![CDATA[Logistics]]></category>
		<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[Random Decision Forests]]></category>
		<category><![CDATA[Risk Management]]></category>
		<category><![CDATA[Scikit-Learn]]></category>
		<category><![CDATA[Seaborn]]></category>
		<category><![CDATA[Intermediate Tutorials]]></category>
		<category><![CDATA[Random Forest Classification]]></category>
		<category><![CDATA[Supervised Learning]]></category>
		<category><![CDATA[Titanic Dataset]]></category>
		<category><![CDATA[Tuning Random Decision Forests]]></category>
		<guid isPermaLink="false">https://www.relataly.com/?p=2261</guid>

					<description><![CDATA[<p>Are you struggling to find the best hyperparameters for your machine learning model? With Python&#8217;s Scikit-learn library, you can use grid search to fine-tune your model and improve its performance. In this article, we&#8217;ll guide you through the process of hyperparameter tuning for a classification model, using a random decision forest that predicts the survival ... <a title="Tuning Model Hyperparameters with Grid Search at the Example of Training a Random Forest Classifier in Python" class="read-more" href="https://www.relataly.com/hyperparameter-tuning-with-grid-search/2261/" aria-label="Read more about Tuning Model Hyperparameters with Grid Search at the Example of Training a Random Forest Classifier in Python">Read more</a></p>
<p>The post <a href="https://www.relataly.com/hyperparameter-tuning-with-grid-search/2261/">Tuning Model Hyperparameters with Grid Search at the Example of Training a Random Forest Classifier in Python</a> appeared first on <a href="https://www.relataly.com">relataly.com</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">Are you struggling to find the best hyperparameters for your machine learning model? With Python&#8217;s Scikit-learn library, you can use grid search to fine-tune your model and improve its performance. In this article, we&#8217;ll guide you through the process of hyperparameter tuning for a classification model, using a random decision forest that predicts the survival of Titanic passengers as an example.</p>



<p class="wp-block-paragraph">We&#8217;ll start by explaining the concept of grid search and how it works. Then, we&#8217;ll dive into the development and optimization of the random decision forest using Python. By defining a parameter grid and feeding it to the grid search algorithm, we can explore all possible hyperparameter combinations and find the optimal configuration for our model.</p>



<p class="wp-block-paragraph">Finally, we&#8217;ll compare the performance of different model configurations to determine the best one for our classification task. Whether you&#8217;re new to machine learning or looking to boost the performance of an existing model, this step-by-step guide to hyperparameter tuning with grid search will help you achieve better results. Let&#8217;s get started!</p>



<p class="wp-block-paragraph">Also: <a href="https://www.relataly.com/multivariate-outlier-detection-using-isolation-forests-in-python-detecting-credit-card-fraud/4233/" target="_blank" rel="noreferrer noopener">Multivariate Anomaly Detection on Time-Series Data in Python</a></p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%">
<figure class="wp-block-image size-large is-resized"><img decoding="async" data-attachment-id="2353" data-permalink="https://www.relataly.com/hyperparameter-tuning-with-grid-search/2261/image-7-5/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2020/07/image-7.png" data-orig-size="837,539" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-7" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2020/07/image-7.png" src="https://www.relataly.com/wp-content/uploads/2020/07/image-7.png" alt="Grid Search - parameter grid for hyperparameter tuning" class="wp-image-2353" width="386" height="248" srcset="https://www.relataly.com/wp-content/uploads/2020/07/image-7.png 837w, https://www.relataly.com/wp-content/uploads/2020/07/image-7.png 300w, https://www.relataly.com/wp-content/uploads/2020/07/image-7.png 768w, https://www.relataly.com/wp-content/uploads/2020/07/image-7.png 80w" sizes="(max-width: 386px) 100vw, 386px" /><figcaption class="wp-element-caption">Exemplary parameter grid for the tuning of a random decision forest with four hyperparameters</figcaption></figure>
</div>
</div>



<h2 class="wp-block-heading">What are Hyperparameters?</h2>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">Hyperparameters play a crucial role in the performance of a machine learning model. They are adjustable parameters that influence the model training process and control how a machine learning algorithm learns and how it behaves. </p>



<p class="wp-block-paragraph">Unlike the internal parameters (coefficients, etc.) that the algorithm automatically optimizes during model training, hyperparameters are model characteristics (e.g., the number of estimators for an ensemble model) that we must set in advance. </p>



<p class="wp-block-paragraph">Which hyperparameters are available, depends on the algorithm.  For example, a random decision forest model may have hyperparameters such as the number of trees and tree depth, while a neural network model may have hyperparameters such as the number of hidden layers and nodes in each layer. Finding the optimal configuration of hyperparameters can be a challenging task, as there is often no way to know in advance what the ideal values should be. </p>



<p class="wp-block-paragraph">This requires experimentation with different hyperparameter settings, which can be time-consuming if done manually. Grid search is a useful tool for automating this process and efficiently finding the best hyperparameter configuration for a given model.</p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%">
<figure class="wp-block-image size-full is-resized"><img decoding="async" data-attachment-id="9871" data-permalink="https://www.relataly.com/hyperparameter-tuning-with-grid-search/2261/picture2-min/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/10/Picture2-min.png" data-orig-size="696,472" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="Picture2-min" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/10/Picture2-min.png" src="https://www.relataly.com/wp-content/uploads/2022/10/Picture2-min.png" alt="" class="wp-image-9871" width="385" height="260" srcset="https://www.relataly.com/wp-content/uploads/2022/10/Picture2-min.png 696w, https://www.relataly.com/wp-content/uploads/2022/10/Picture2-min.png 300w" sizes="(max-width: 385px) 100vw, 385px" /><figcaption class="wp-element-caption">Hyperparameters are the little screws that we can adjust to tune a predictive model.</figcaption></figure>
</div>
</div>



<h2 class="wp-block-heading" id="h-efficient-hyperparameter-tuning-with-exhaustive-grid-search">Efficient Hyperparameter Tuning with Exhaustive&nbsp;Grid Search</h2>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">When we train a machine learning model, it is usually unclear which hyperparameters lead to good results. While there are estimates and rules of thumb, there is often no way to avoid trying out hyperparameters in experiments. However, machine learning models often have several hyperparameters that affect the model&#8217;s performance in a nonlinear way.</p>



<p class="wp-block-paragraph">We can use grid search to automate searching for optimal model hyperparameters. The search grid algorithm exhaustively generates models from parameter permutations of a grid of parameter values. Let&#8217;s take a look at how this works.</p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%"></div>
</div>



<h2 class="wp-block-heading">Hyperparameter Tuning with Grid Search: How it Works</h2>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">The idea behind the grid search technique is quite simple. We have a model with parameters, and the challenge is to test various configurations until we are satisfied with the result. Grid search is exhaustive in that it tests all permutations of a parameter grid. The number of model variants results from the parameter grid and the specified parameters.</p>



<p class="wp-block-paragraph">The grid search algorithm requires us to provide the following information:</p>



<ul class="wp-block-list">
<li>The hyperparameters that we want to configure (e.g., tree depth)</li>



<li>For each hyperparameter, a range of values (e.g., [50, 100, 150])</li>



<li>A performance metric so that the algorithm knows how to measure performance (e.g., accuracy for a classification model)</li>
</ul>



<p class="wp-block-paragraph">For example, imagine we have a range of [16, 32, and 64] for n_estimators and a range of [8, 16, and 32] for max_depth. Then, the search grid will test 9 different parameter configurations.</p>



<h3 class="wp-block-heading">Early Stopping</h3>



<p class="wp-block-paragraph">Running parameter optimization against an entire grid can be time-consuming, but there are ways to shorten the process. Depending on how much time you want to invest in the search process, you can test all combinations exhaustively or shorten the process with an early stopping logic. A stopping logic defines that the search ends early when a specific criterion is met. Such a criterion could be, for example, that newly trained models underperform the average performance of previously trained models by a certain value. In this case, the search stops and returns the best models found up to that point. When you define a large search grid with many parameters, defining an early stopping logic is recommended. </p>



<h3 class="wp-block-heading" id="h-strengths-and-weaknesses-of-grid-search">Strengths and Weaknesses of Grid Search</h3>



<p class="wp-block-paragraph">The advantage of the grid search is that the algorithm automatically identifies the optimal parameter configuration from the parameter grid. However, the number of possible configurations increases exponentially with the number of values in the parameter grid. So, in practice, defining a sparse parameter grid or defining stopping criteria is essential.</p>



<p class="wp-block-paragraph">Grid Search is only one of several techniques that can be used to tune the hyperparameters of a predictive model. Alternative techniques include  <a href="https://www.relataly.com/using-random-search-to-tune-the-hyperparameters-of-a-random-decision-forest-with-python/6875/" target="_blank" rel="noreferrer noopener">Random Search</a>. In contrast to Grid Search, Random Search is a none exhaustive hyperparameter-tuning technique, which randomly selects and tests specific configurations from a predefined search space. Further optimization techniques are Bayesian Search and Gradient Descent.</p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%"><div class="wp-block-image">
<figure class="aligncenter size-large is-resized"><img decoding="async" data-attachment-id="2354" data-permalink="https://www.relataly.com/hyperparameter-tuning-with-grid-search/2261/image-8-5/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2020/07/image-8.png" data-orig-size="384,196" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-8" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2020/07/image-8.png" src="https://www.relataly.com/wp-content/uploads/2020/07/image-8.png" alt="Grid Search - A search grid with two hyperparameters and three hyperparameter values" class="wp-image-2354" width="397" height="201"/><figcaption class="wp-element-caption">A parameter grid with two hyperparameters and respectively three hyperparameter values</figcaption></figure>
</div></div>
</div>



<h2 class="wp-block-heading">Evaluation Metrics</h2>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">The question of which metric to optimize against inevitably arises when we talk about optimization. Generally, all common metrics available for classification or regression come into question.</p>



<p class="wp-block-paragraph">Metrics for regression (<a href="https://www.relataly.com/regression-error-metrics-python/923/" target="_blank" rel="noreferrer noopener">more detailed description</a>)</p>



<ul class="wp-block-list">
<li>Mean Absolute Error (MAE) </li>



<li>Root Mean Squared Absolute Error (RMSAE) </li>



<li>Relative Squared Error (RSE).</li>
</ul>



<p class="wp-block-paragraph">Metrics for classification (<a href="https://www.relataly.com/measuring-classification-performance-with-python-and-scikit-learn/846/" target="_blank" rel="noreferrer noopener">more detailed description</a>)</p>



<ul class="wp-block-list">
<li>Accuracy</li>



<li>Precision</li>



<li>F-1 Score</li>



<li>Recall</li>
</ul>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%"></div>
</div>



<h2 class="wp-block-heading">Tuning the Hyperparameters of a Random Decision Forest Classifier in Python using Grid Search</h2>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow">
<p class="wp-block-paragraph">Now that we have familiarized ourselves with the basic concept of hyperparameter tuning, let&#8217;s move on to the Python hands-on part! In this part, we will work with the Titanic dataset. We will apply the grid search optimization technique to a classification model. We will develop our Machine Learning model based on the Titanic dataset.</p>



<p class="wp-block-paragraph">The sinking of the Titanic was one of the most catastrophic ship disasters, leading to more than 1500 casualties (The exact number is unknown due to several passengers being unregistered). The Titanic dataset contains a list of passengers with passenger information such as age, gender, cabin, ticket cost, etc., and whether they survived the Titanic sinking. The information about the passengers shows certain patterns that allow conclusions about the likelihood of the passengers surviving the accident. These data can be used to train a predictive model.</p>



<p class="wp-block-paragraph">In the following, we will use the survival flag as a label and passenger information as input for a classification model. The goal is to predict whether a passenger will survive the Titanic sinking or not. The algorithm will be a random decision forest algorithm that classifies the passengers into two groups, survivors and non-survivors. Once we have trained a baseline model, we will apply grid search to optimize the hyperparameters of this model and select the best model.</p>
</div>
</div>



<p class="wp-block-paragraph">The code is available on the GitHub repository.</p>



<div class="wp-block-group"><div class="wp-block-group__inner-container is-layout-flow wp-block-group-is-layout-flow">
<div class="wp-block-kadence-advancedbtn kb-buttons-wrap kb-btns_9836db-d0"><a class="kb-button kt-button button kb-btn_12cb2e-6a kt-btn-size-standard kt-btn-width-type-full kb-btn-global-inherit kt-btn-has-text-true kt-btn-has-svg-true wp-block-button__link wp-block-kadence-singlebtn" href="https://github.com/flo7up/relataly-public-python-tutorials/blob/master/11%20Hyperparamter%20Tuning/016%20Hyperparameter%20Tuning%20of%20Random%20Decision%20Forests%20using%20Grid%20Search.ipynb" target="_blank" rel="noreferrer noopener"><span class="kb-svg-icon-wrap kb-svg-icon-fe_eye kt-btn-icon-side-left"><svg viewBox="0 0 24 24"  fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"  aria-hidden="true"><path d="M1 12s4-8 11-8 11 8 11 8-4 8-11 8-11-8-11-8z"/><circle cx="12" cy="12" r="3"/></svg></span><span class="kt-btn-inner-text">View on GitHub </span></a>

<a class="kb-button kt-button button kb-btn_f9b732-8b kt-btn-size-standard kt-btn-width-type-full kb-btn-global-inherit kt-btn-has-text-true kt-btn-has-svg-true wp-block-button__link wp-block-kadence-singlebtn" href="https://github.com/flo7up/relataly-public-python-API-tutorials" target="_blank" rel="noreferrer noopener"><span class="kb-svg-icon-wrap kb-svg-icon-fa_github kt-btn-icon-side-left"><svg viewBox="0 0 496 512"  fill="currentColor" xmlns="http://www.w3.org/2000/svg"  aria-hidden="true"><path d="M165.9 397.4c0 2-2.3 3.6-5.2 3.6-3.3.3-5.6-1.3-5.6-3.6 0-2 2.3-3.6 5.2-3.6 3-.3 5.6 1.3 5.6 3.6zm-31.1-4.5c-.7 2 1.3 4.3 4.3 4.9 2.6 1 5.6 0 6.2-2s-1.3-4.3-4.3-5.2c-2.6-.7-5.5.3-6.2 2.3zm44.2-1.7c-2.9.7-4.9 2.6-4.6 4.9.3 2 2.9 3.3 5.9 2.6 2.9-.7 4.9-2.6 4.6-4.6-.3-1.9-3-3.2-5.9-2.9zM244.8 8C106.1 8 0 113.3 0 252c0 110.9 69.8 205.8 169.5 239.2 12.8 2.3 17.3-5.6 17.3-12.1 0-6.2-.3-40.4-.3-61.4 0 0-70 15-84.7-29.8 0 0-11.4-29.1-27.8-36.6 0 0-22.9-15.7 1.6-15.4 0 0 24.9 2 38.6 25.8 21.9 38.6 58.6 27.5 72.9 20.9 2.3-16 8.8-27.1 16-33.7-55.9-6.2-112.3-14.3-112.3-110.5 0-27.5 7.6-41.3 23.6-58.9-2.6-6.5-11.1-33.3 2.6-67.9 20.9-6.5 69 27 69 27 20-5.6 41.5-8.5 62.8-8.5s42.8 2.9 62.8 8.5c0 0 48.1-33.6 69-27 13.7 34.7 5.2 61.4 2.6 67.9 16 17.7 25.8 31.5 25.8 58.9 0 96.5-58.9 104.2-114.8 110.5 9.2 7.9 17 22.9 17 46.4 0 33.7-.3 75.4-.3 83.6 0 6.5 4.6 14.4 17.3 12.1C428.2 457.8 496 362.9 496 252 496 113.3 383.5 8 244.8 8zM97.2 352.9c-1.3 1-1 3.3.7 5.2 1.6 1.6 3.9 2.3 5.2 1 1.3-1 1-3.3-.7-5.2-1.6-1.6-3.9-2.3-5.2-1zm-10.8-8.1c-.7 1.3.3 2.9 2.3 3.9 1.6 1 3.6.7 4.3-.7.7-1.3-.3-2.9-2.3-3.9-2-.6-3.6-.3-4.3.7zm32.4 35.6c-1.6 1.3-1 4.3 1.3 6.2 2.3 2.3 5.2 2.6 6.5 1 1.3-1.3.7-4.3-1.3-6.2-2.2-2.3-5.2-2.6-6.5-1zm-11.4-14.7c-1.6 1-1.6 3.6 0 5.9 1.6 2.3 4.3 3.3 5.6 2.3 1.6-1.3 1.6-3.9 0-6.2-1.4-2.3-4-3.3-5.6-2z"/></svg></span><span class="kt-btn-inner-text">Relataly GitHub Repo </span></a></div>
</div></div>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%"><div class="wp-block-image">
<figure class="aligncenter is-resized"><img decoding="async" src="https://cdn.britannica.com/s:800x1000/72/153172-050-EB2F2D95/Titanic.jpg" alt="Operated by the White Star Line, RMS Titanic was the largest and most luxurious ocean liner of her time." width="357" height="227"/><figcaption class="wp-element-caption">Source: <em>The National Archives/Heritage-Images/Imagestate</em></figcaption></figure>
</div></div>
</div>



<h3 class="wp-block-heading" id="h-prerequisites">Prerequisites</h3>



<p class="wp-block-paragraph">Before starting the coding part, make sure that you have set up your <a href="https://www.python.org/downloads/" target="_blank" rel="noreferrer noopener">Python 3</a> environment and required packages. If you don&#8217;t have a Python environment, follow&nbsp;<a href="https://www.relataly.com/anaconda-python-environment-machine-learning/1663/" target="_blank" rel="noreferrer noopener">this tutorial</a>&nbsp;to set up the&nbsp;<a href="https://www.anaconda.com/products/individual" target="_blank" rel="noreferrer noopener">Anaconda environment</a>.</p>



<p class="wp-block-paragraph">Also, make sure you install all required packages. In this tutorial, we will be working with the following standard packages:&nbsp;</p>



<ul class="wp-block-list">
<li><em><a href="https://pandas.pydata.org/" target="_blank" rel="noreferrer noopener">pandas</a></em></li>



<li><em><a href="https://numpy.org/" target="_blank" rel="noreferrer noopener">NumPy</a></em></li>



<li><a href="https://docs.python.org/3/library/math.html" target="_blank" rel="noreferrer noopener">math</a></li>



<li><em><a href="https://matplotlib.org/" target="_blank" rel="noreferrer noopener">matplotlib</a></em></li>
</ul>



<p class="wp-block-paragraph">In addition, we will be using the Python Machine Learning library <a href="https://scikit-learn.org/stable/" target="_blank" rel="noreferrer noopener">Scikit-learn</a> to implement the random forest and the grid search technique. </p>



<p class="wp-block-paragraph">You can install packages using console commands:</p>



<ul class="wp-block-list">
<li><em>pip install &lt;package name&gt;</em></li>



<li><em>conda install &lt;package name&gt;</em>&nbsp;(if you are using the anaconda packet manager)</li>
</ul>



<h3 class="wp-block-heading" id="h-about-the-titanic-dataset">About the Titanic Dataset</h3>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">In this article, we will be working with the popular titanic dataset for classification. The Titanic dataset is a well-known dataset that contains information about the passengers on the Titanic, a British passenger liner that sank in the North Atlantic Ocean in 1912 after colliding with an iceberg. The dataset includes variables such as the passenger&#8217;s name, age, fare, and class, as well as whether or not the passenger survived.</p>



<p class="wp-block-paragraph">The titanic dataset contains the following information on passengers of the titanic:</p>



<ul class="wp-block-list">
<li><strong>Survival</strong>: Survival 0 = No, 1 = Yes (Prediction Label)</li>



<li><strong>Pclass</strong>: Ticket class 1 = 1st, 2 = 2nd, 3 = 3rd</li>



<li><strong>Sex</strong>: Sex</li>



<li><strong>Age</strong>: Age in years</li>



<li><strong>SibSp</strong>: # of siblings/spouses aboard the Titanic</li>



<li><strong>Parch</strong>: # of parents/children aboard the Titanic</li>



<li><strong>Ticket</strong>: Ticket number</li>



<li><strong>Fare</strong>: Passenger fare</li>



<li><strong>Cabin</strong>: Cabin number</li>



<li><strong>Embarked</strong>: Port of Embarkation C = Cherbourg, Q = Queenstown, S = Southampton</li>
</ul>



<p class="wp-block-paragraph">The Survival column contains the prediction label, which states whether a passenger survived the sinking of the Titanic or not.</p>



<p class="wp-block-paragraph">You can download the titanic dataset from <a href="https://www.kaggle.com/c/titanic" target="_blank" rel="noreferrer noopener">the Kaggle website</a>. Once you have completed the download, you can place the dataset in the file path of your choice. Using the Kaggle Python environment, you can directly save the dataset into your Kaggle project.</p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%">
<figure class="wp-block-image size-full is-resized"><img decoding="async" data-attachment-id="7036" data-permalink="https://www.relataly.com/hyperparameter-tuning-with-grid-search/2261/picture28/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/04/Picture28.png" data-orig-size="693,720" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="Picture28" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/04/Picture28.png" src="https://www.relataly.com/wp-content/uploads/2022/04/Picture28.png" alt="We can assume that the cabin location of the passengers had an impact on their chance to survive the sinking. Developing a machine learning model for prediction of titanic passenger survival and optimizing its hyperparameters using grid search" class="wp-image-7036" width="377" height="391" srcset="https://www.relataly.com/wp-content/uploads/2022/04/Picture28.png 693w, https://www.relataly.com/wp-content/uploads/2022/04/Picture28.png 289w" sizes="(max-width: 377px) 100vw, 377px" /><figcaption class="wp-element-caption">We can assume that the cabin location of the passengers had an impact on their chance of surviving the sinking. </figcaption></figure>
</div>
</div>



<h2 class="wp-block-heading">Step #1 Load the Titanic Data</h2>



<p class="wp-block-paragraph">The following code will load the titanic data into our python project. If you have placed the data outside the path shown below, don&#8217;t forget to adjust the file path in the code.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}">import math 
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt 
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
import seaborn as sns
from pandas.plotting import register_matplotlib_converters

# set file path
filepath = &quot;data/titanic-grid-search/&quot;

# Load train and test datasets
titanic_train_df = pd.read_csv(filepath + 'titanic-train.csv')
titanic_test_df = pd.read_csv(filepath + 'titanic-test.csv')
titanic_train_df.head()</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">	PassengerId	Survived	Pclass	Name							Sex	Age	SibSp	Parch	Ticket				Fare	Cabin	Embarked
0	1			0			3		Braund, Mr. Owen Harris			male	22.0	1	0	A/5 21171			7.2500	NaN		S
1	2			1			1		Cumings, Mrs. John Bradley ...	female	38.0	1	0	PC 17599			71.2833	C85		C
2	3			1			3		Heikkinen, Miss. Laina			female	26.0	0	0	STON/O2. 3101282	7.9250	NaN		S
3	4			1			1		Futrelle, Mrs. Jacques ...		female	35.0	1	0	113803				53.1000	C123	S
4	5			0			3		Allen, Mr. William Henry		male	35.0	0	0	373450				8.0500	NaN		S</pre></div>



<h3 class="wp-block-heading" id="h-step-2-preprocessing-and-exploring-the-data">Step #2 Preprocessing and Exploring the Data</h3>



<p class="wp-block-paragraph">Before we can train a model, we preprocess the data: </p>



<ul class="wp-block-list">
<li>Firstly, we clean the missing values in the data and replace them with the mean. </li>



<li>Second, we transform categorical features (<em>Embarked </em>and <em>Sex</em>) into numeric values. In addition, we will delete some columns to reduce model complexity. </li>



<li>Finally, we delete the prediction label from the training dataset and place it into a separate dataset named y_df.</li>
</ul>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Define a function for preprocessing the train and test data 
def preprocess(df):
    
    # Delete some columns that we will not use
    new_df = df[df.columns[~df.columns.isin(['Cabin', 'PassengerId', 'Name', 'Ticket'])]].copy()
    
    # Replace missing values
    for i in new_df.select_dtypes(include=['int16', 'int32', 'int64', 'float16', 'float32', 'float64']).columns:
        new_df[i].fillna(new_df[i].mean(), inplace=True)
    new_df['Embarked'].fillna('C', inplace=True)
    
    # Decode categorical values as integer values
    new_df_b = new_df.copy()
    new_df_b['Sex'] = np.where(new_df_b['Sex']=='male', 0, 1) 
    
    cleanups = {&quot;Sex&quot;:     {&quot;m&quot;: 0, &quot;f&quot;: 1},
                &quot;Embarked&quot;: {&quot;S&quot;: 1, &quot;Q&quot;: 2, &quot;C&quot;: 3}}
    new_df_b.replace(cleanups, inplace=True)
    x = new_df_b.drop(columns=['Survived'])
    y = new_df_b['Survived']  
    
    return x, y

# Create the training dataset train_df and the label dataset
x_df, y_df = preprocess(train_df)
x_df.head()</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">		Pclass	Sex	Age		SibSp	Parch	Fare	Embarked
0		3		0	22.0	1		0		7.2500	1
1		1		1	38.0	1		0		71.2833	3
2		3		1	26.0	0		0		7.9250	1
3		1		1	35.0	1		0		53.1000	1
4		3		0	35.0	0		0		8.0500	1</pre></div>



<p class="wp-block-paragraph">Let&#8217;s take a quick look at the data by creating paired plots for the columns of our data set. Pair plots help us to understand the relationships between pairs of variables in a dataset. </p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># # Create histograms for feature columns separated by prediction label value
df_plot = titanic_train_df.copy()

# class_columnname = 'Churn'
sns.pairplot(df_plot, hue=&quot;Survived&quot;, height=2.5, palette='muted')</pre></div>



<figure class="wp-block-image size-large is-resized"><img decoding="async" data-attachment-id="6803" data-permalink="https://www.relataly.com/hyperparameter-tuning-with-grid-search/2261/hyperparameter-tuning-random-decision-forests/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/04/hyperparameter-tuning-random-decision-forests.png" data-orig-size="1124,1062" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="hyperparameter-tuning-random-decision-forests" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/04/hyperparameter-tuning-random-decision-forests.png" src="https://www.relataly.com/wp-content/uploads/2022/04/hyperparameter-tuning-random-decision-forests-1024x968.png" alt="paired plot created with seaborn" class="wp-image-6803" width="768" height="726" srcset="https://www.relataly.com/wp-content/uploads/2022/04/hyperparameter-tuning-random-decision-forests.png 1024w, https://www.relataly.com/wp-content/uploads/2022/04/hyperparameter-tuning-random-decision-forests.png 300w, https://www.relataly.com/wp-content/uploads/2022/04/hyperparameter-tuning-random-decision-forests.png 768w, https://www.relataly.com/wp-content/uploads/2022/04/hyperparameter-tuning-random-decision-forests.png 1124w" sizes="(max-width: 768px) 100vw, 768px" /></figure>



<p class="wp-block-paragraph">The histograms tell us various things. For example, most passengers were between 25 and 35 years old. In addition, we can see that most passengers had low-fare tickets, while some passengers had significantly more expensive tickets. </p>



<h3 class="wp-block-heading" id="h-step-3-splitting-the-data">Step #3 Splitting the Data</h3>



<p class="wp-block-paragraph">Next, we will split the data set into training data (x_train, y_train) and test data (x_test, y_test) using a split ratio of 70/30.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Split the data into x_train and y_train data sets
x_train, x_test, y_train, y_test = train_test_split(x_df, y_df, train_size=0.7, random_state=0)</pre></div>



<h3 class="wp-block-heading" id="h-step-4-building-a-single-random-forest-model">Step #4 Building a Single Random Forest Model</h3>



<p class="wp-block-paragraph">After completing the preprocessing, we can train the first model. The model uses a random forest algorithm. The random forest algorithm has a large number of hyperparameters.</p>



<h4 class="wp-block-heading" id="h-4-1-about-the-random-forest-algorithm">4.1 About the Random Forest Algorithm</h4>



<p class="wp-block-paragraph">A random forest is a robust predictive algorithm that can handle classification and regression tasks. As a so-called ensemble model, the random forest considers predictions from a group of several independent estimators. </p>



<p class="wp-block-paragraph">Random decision forests have several hyperparameters that we can use to influence their behavior. However, not all of these hyperparameters have the same influence on model performance. Limiting the number of models by defining a sparse parameter grid is essential to reduce the amount of time needed to test the hyperparameters. </p>



<p class="wp-block-paragraph">Therefore, we restrict the hyperparameters optimized by the grid search approach to the following two:</p>



<ul class="wp-block-list">
<li><strong>n_estimators</strong> determine the number of decision trees in the forest</li>



<li><strong>max_depth</strong> defines the maximum number of branches in each decision tree</li>
</ul>



<p class="wp-block-paragraph">In the <a href="https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html?highlight=random%20forest#sklearn.ensemble.RandomForestClassifier" target="_blank" rel="noreferrer noopener">scikit-learn documentation</a>, you also find a full list of available hyperparameters. For the rest of these hyperparameters, we will use the default value defined by scikit-learn.</p>



<h4 class="wp-block-heading" id="h-4-2-implementing-a-random-forest-model">4.2 Implementing a Random Forest Model</h4>



<p class="wp-block-paragraph">We train a simple baseline model and make a test prediction with the x_test dataset. Then we visualize the performance of the baseline model in a confusion matrix:</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Train a single random forest classifier
clf = RandomForestClassifier(max_depth=2, random_state=0, n_estimators = 100)
clf.fit(x_train, y_train)
y_pred = clf.predict(x_test)

# Create a confusion matrix
cnf_matrix = confusion_matrix(y_test, y_pred)

# Create heatmap from the confusion matrix
%matplotlib inline
class_names=[False, True] # name  of classes
fig, ax = plt.subplots(figsize=(7, 6))
sns.heatmap(pd.DataFrame(cnf_matrix), annot=True, cmap=&quot;YlGnBu&quot;, fmt='g')
ax.xaxis.set_label_position(&quot;top&quot;)
plt.tight_layout()
plt.title('Confusion matrix')
plt.ylabel('Actual label')
plt.xlabel('Predicted label')
tick_marks = [0.5, 1.5]
plt.xticks(tick_marks, class_names)
plt.yticks(tick_marks, class_names)</pre></div>



<figure class="wp-block-image size-full"><img decoding="async" width="486" height="452" data-attachment-id="8471" data-permalink="https://www.relataly.com/hyperparameter-tuning-with-grid-search/2261/output-1-2/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/05/output-1.png" data-orig-size="486,452" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="output-1" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/05/output-1.png" src="https://www.relataly.com/wp-content/uploads/2022/05/output-1.png" alt="Confusion matrix of the best-guess random forest model before hyperparameter tuning" class="wp-image-8471" srcset="https://www.relataly.com/wp-content/uploads/2022/05/output-1.png 486w, https://www.relataly.com/wp-content/uploads/2022/05/output-1.png 300w" sizes="(max-width: 486px) 100vw, 486px" /><figcaption class="wp-element-caption">Confusion matrix of the best-guess random forest model</figcaption></figure>



<p class="wp-block-paragraph">Our best-guess model accurately predicted that 151 passengers would not survive. The dark-blue number in the top-left is the group of titanic passengers that did not survive the sinking, and our model classified them correctly as non-survivors. The green area below shows the passengers who survived the sinking and were correctly classified. The other sections show the number of times our model was wrong. </p>



<p class="wp-block-paragraph">In total, these results correspond to a model accuracy of 80%. Considering that this was a best-guess model, these results are pretty good. However, we can further optimize these results by using the grid search approach for hyperparameter tuning.</p>



<h3 class="wp-block-heading" id="h-step-5-hyperparameter-tuning-a-classification-model-using-the-grid-search-technique">Step #5 Hyperparameter Tuning a Classification Model using the Grid Search Technique</h3>



<p class="wp-block-paragraph">By comparing the performance of different model configurations, we can find the best set of hyperparameters that yields the highest accuracy. This approach is a powerful tool for fine-tuning machine learning models and improving their performance. So let&#8217;s get started and see if we can beat the results of our best-guess model using the grid search technique! </p>



<h4 class="wp-block-heading">Training and Tuning the Model</h4>



<p class="wp-block-paragraph">Next, we will use the grid search technique to optimize a random decision forest model that predicts the survival of Titanic passengers. We&#8217;ll define a grid of hyperparameter values in Python and then use the Scikit-learn library to train and test the model with different hyperparameter configurations. First, we will define a parameter range:</p>



<ul class="wp-block-list">
<li>max_depth = [2, 8, 16]</li>



<li>n_estimators = [64, 128, 256]</li>
</ul>



<p class="wp-block-paragraph">We leave the other parameters at their default value. In addition, we need to define against which metric we want the grid search algorithm to evaluate the model performance. Since we have no personal preference and our dataset is well-balanced, we choose the mean test score as the evaluation metric. Then we run the grid search algorithm. </p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Define Parameters
max_depth=[2, 8, 16]
n_estimators = [64, 128, 256]
param_grid = dict(max_depth=max_depth, n_estimators=n_estimators)

# Build the grid search
dfrst = RandomForestClassifier(n_estimators=n_estimators, max_depth=max_depth)
grid = GridSearchCV(estimator=dfrst, param_grid=param_grid, cv = 5)
grid_results = grid.fit(x_train, y_train)

# Summarize the results in a readable format
print(&quot;Best: {0}, using {1}&quot;.format(grid_results.cv_results_['mean_test_score'], grid_results.best_params_))
results_df = pd.DataFrame(grid_results.cv_results_)
results_df</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">Best: [0.79611613 0.78005161 0.79290323 0.81387097 0.82187097 0.81867097 
 0.78818065 0.78816774 0.78498065], using {'max_depth': 8, 'n_estimators': 128}
 
 	mean_fit_time	std_fit_time	mean_score_time	std_score_time	param_max_depth	param_n_estimators	params	split0_test_score				split1_test_score	split2_test_score	split3_test_score	split4_test_score	mean_test_score	std_test_score	rank_test_score
0	0.057045		0.001108		0.005001		0.000001		2				64					{'max_depth': 2, 'n_estimators': 64}	0.824				0.800				0.784				0.774194			0.798387		0.796116	0.016883	4
1	0.112051		0.002088		0.009490		0.000775		2				128					{'max_depth': 2, 'n_estimators': 128}	0.760				0.824				0.784				0.750000			0.782258		0.780052	0.025523	9
2	0.221600		0.003740		0.016487		0.000448		2				256					{'max_depth': 2, 'n_estimators': 256}	0.792				0.824				0.784				0.774194			0.790323		0.792903	0.016756	5
3	0.061998		0.001410		0.005801		0.000400		8				64					{'max_depth': 8, 'n_estimators': 64}	0.784				0.824				0.792				0.806452			0.862903		0.813871	0.028044	3
4	0.122886		0.002652		0.009587		0.000480		8				128					{'max_depth': 8, 'n_estimators': 128}	0.784				0.848				0.808				0.806452			0.862903		0.821871	0.029089	1
5	0.250295		0.007654		0.018557		0.000836		8				256					{'max_depth': 8, 'n_estimators': 256}	0.800				0.824				0.800				0.806452			0.862903		0.818671	0.023797	2
6	0.065602		0.000505		0.005800		0.000399		16				64					{'max_depth': 16, 'n_estimators': 64}	0.736				0.808				0.784				0.766129			0.846774		0.788181	0.037557	6
7	0.127662		0.003297		0.008600		0.004080		16				128					{'max_depth': 16, 'n_estimators': 128}	0.752				0.800				0.784				0.758065			0.846774		0.788168	0.034078	7
8	0.259617		0.003121		0.018873		0.000537		16				256					{'max_depth': 16, 'n_estimators': 256}	0.752				0.784				0.776				0.766129			0.846774		0.784981	0.032690	8</pre></div>



<p class="wp-block-paragraph">The list above is an overview of the tested model configurations, ranked by their prediction scores. Model number five achieved the best results. The parameters of this model are a maximum depth of 8 and several estimators of 256. </p>



<h4 class="wp-block-heading">Model Evaluation</h4>



<p class="wp-block-paragraph">We select the best model and use it to predict the test data set. We visualize the results in another confusion matrix. </p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Extract the best decision forest 
best_clf = grid_results.best_estimator_
y_pred = best_clf.predict(x_test)

# Create a confusion matrix
cnf_matrix = confusion_matrix(y_test, y_pred)

# Create heatmap from the confusion matrix
%matplotlib inline
class_names=[False, True] # name  of classes
fig, ax = plt.subplots(figsize=(7, 6))
sns.heatmap(pd.DataFrame(cnf_matrix), annot=True, cmap=&quot;YlGnBu&quot;, fmt='g')
ax.xaxis.set_label_position(&quot;top&quot;)
plt.tight_layout()
plt.title('Confusion matrix')
plt.ylabel('Actual label')
plt.xlabel('Predicted label')
tick_marks = [0.5, 1.5]
plt.xticks(tick_marks, class_names)
plt.yticks(tick_marks, class_names)</pre></div>



<figure class="wp-block-image size-large is-resized"><img decoding="async" data-attachment-id="2297" data-permalink="https://www.relataly.com/hyperparameter-tuning-with-grid-search/2261/image-5-5/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2020/07/image-5.png" data-orig-size="486,452" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-5" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2020/07/image-5.png" src="https://www.relataly.com/wp-content/uploads/2020/07/image-5.png" alt="confusion matrix on the best model returned by the grid search hyperparameter tuning approach in Python" class="wp-image-2297" width="529" height="491" srcset="https://www.relataly.com/wp-content/uploads/2020/07/image-5.png 486w, https://www.relataly.com/wp-content/uploads/2020/07/image-5.png 300w" sizes="(max-width: 529px) 100vw, 529px" /><figcaption class="wp-element-caption">Confusion matrix of the best grid search model</figcaption></figure>



<p class="wp-block-paragraph">The confusion matrix shows the best model results from the grid search technique. The result is an overall model accuracy of 83,5 %, which shows that the best grid search model outperforms our initial best guess model. This optimal model has correctly classified that 148 passengers would not survive and 76 passengers would survive. In 44 cases, the model was wrong.</p>



<h2 class="wp-block-heading" id="h-summary">Summary</h2>



<p class="wp-block-paragraph">This article has shown how we can use grid Search in Python to efficiently search for the optimal hyperparameter configuration of a machine learning model. In the conceptual part, you learned about hyperparameters and how to use grid search to try out all permutations of a predefined parameter grid. </p>



<p class="wp-block-paragraph">In the hands-on part of this article, we developed a random decision forest that predicts the survival of Titanic passengers using Python and scikit-learn. The grid search technique applies not only to classification models but can also be used to optimize the performance of regression models. First, we developed a baseline model with best-guess parameters. Subsequently, we defined a parameter grid and used the grid search technique to tune the hyperparameters of the random decision forest. In this way, we quickly identified a configuration that outperforms our initial baseline model. In this way, we have demonstrated how Gid Search can help optimize the classification model parameters. </p>



<p class="wp-block-paragraph">I hope this article was helpful. I am always interested to learn and improve. So, if you have any questions or suggestions, please write them in the comments. </p>



<h2 class="wp-block-heading">Sources and Further Reading</h2>



<div style="display: inline-block;">
  <iframe sandbox="allow-popups allow-scripts allow-modals allow-forms allow-same-origin" style="width:120px;height:240px;" marginwidth="0" marginheight="0" scrolling="no" frameborder="0" src="//ws-eu.amazon-adsystem.com/widgets/q?ServiceVersion=20070822&amp;OneJS=1&amp;Operation=GetAdHtml&amp;MarketPlace=DE&amp;source=ss&amp;ref=as_ss_li_til&amp;ad_type=product_link&amp;tracking_id=flo7up-21&amp;language=de_DE&amp;marketplace=amazon&amp;region=DE&amp;placement=3030181162&amp;asins=3030181162&amp;linkId=669e46025028259138fbb5ccec12dfbe&amp;show_border=true&amp;link_opens_in_new_window=true"></iframe>
<iframe sandbox="allow-popups allow-scripts allow-modals allow-forms allow-same-origin" style="width:120px;height:240px;" marginwidth="0" marginheight="0" scrolling="no" frameborder="0" src="//ws-eu.amazon-adsystem.com/widgets/q?ServiceVersion=20070822&amp;OneJS=1&amp;Operation=GetAdHtml&amp;MarketPlace=DE&amp;source=ss&amp;ref=as_ss_li_til&amp;ad_type=product_link&amp;tracking_id=flo7up-21&amp;language=de_DE&amp;marketplace=amazon&amp;region=DE&amp;placement=1999579577&amp;asins=1999579577&amp;linkId=91d862698bf9010ff4c09539e4c49bf4&amp;show_border=true&amp;link_opens_in_new_window=true"></iframe>
<iframe sandbox="allow-popups allow-scripts allow-modals allow-forms allow-same-origin" style="width:120px;height:240px;" marginwidth="0" marginheight="0" scrolling="no" frameborder="0" src="//ws-eu.amazon-adsystem.com/widgets/q?ServiceVersion=20070822&amp;OneJS=1&amp;Operation=GetAdHtml&amp;MarketPlace=DE&amp;source=ss&amp;ref=as_ss_li_til&amp;ad_type=product_link&amp;tracking_id=flo7up-21&amp;language=de_DE&amp;marketplace=amazon&amp;region=DE&amp;placement=1839217715&amp;asins=1839217715&amp;linkId=356ba074068849ff54393f527190825d&amp;show_border=true&amp;link_opens_in_new_window=true"></iframe>
<iframe sandbox="allow-popups allow-scripts allow-modals allow-forms allow-same-origin" style="width:120px;height:240px;" marginwidth="0" marginheight="0" scrolling="no" frameborder="0" src="//ws-eu.amazon-adsystem.com/widgets/q?ServiceVersion=20070822&amp;OneJS=1&amp;Operation=GetAdHtml&amp;MarketPlace=DE&amp;source=ss&amp;ref=as_ss_li_til&amp;ad_type=product_link&amp;tracking_id=flo7up-21&amp;language=de_DE&amp;marketplace=amazon&amp;region=DE&amp;placement=1492032646&amp;asins=1492032646&amp;linkId=2214804dd039e7103577abd08722abac&amp;show_border=true&amp;link_opens_in_new_window=true"></iframe>
</div>



<p class="has-contrast-2-color has-base-3-background-color has-text-color has-background wp-block-paragraph"><em>The links above to Amazon are affiliate links. By buying through these links, you support the Relataly.com blog and help to cover the hosting costs. Using the links does not affect the price.</em></p>



<p class="wp-block-paragraph"></p>
<p>The post <a href="https://www.relataly.com/hyperparameter-tuning-with-grid-search/2261/">Tuning Model Hyperparameters with Grid Search at the Example of Training a Random Forest Classifier in Python</a> appeared first on <a href="https://www.relataly.com">relataly.com</a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://www.relataly.com/hyperparameter-tuning-with-grid-search/2261/feed/</wfw:commentRss>
			<slash:comments>2</slash:comments>
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">2261</post-id>	</item>
	</channel>
</rss>
