<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Classification (two-class) Archives - relataly.com</title>
	<atom:link href="https://www.relataly.com/category/machine-learning/classification/feed/" rel="self" type="application/rss+xml" />
	<link>https://www.relataly.com/category/machine-learning/classification/</link>
	<description>The Business AI Blog</description>
	<lastBuildDate>Sat, 27 May 2023 10:38:22 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	<generator>https://wordpress.org/?v=7.0</generator>

<image>
	<url>https://www.relataly.com/wp-content/uploads/2023/04/cropped-AI-cat-Icon-White.png</url>
	<title>Classification (two-class) Archives - relataly.com</title>
	<link>https://www.relataly.com/category/machine-learning/classification/</link>
	<width>32</width>
	<height>32</height>
</image> 
<site xmlns="com-wordpress:feed-additions:1">175977316</site>	<item>
		<title>Using Random Search to Tune the Hyperparameters of a Random Decision Forest with Python</title>
		<link>https://www.relataly.com/using-random-search-to-tune-the-hyperparameters-of-a-random-decision-forest-with-python/6875/</link>
					<comments>https://www.relataly.com/using-random-search-to-tune-the-hyperparameters-of-a-random-decision-forest-with-python/6875/#respond</comments>
		
		<dc:creator><![CDATA[Florian Follonier]]></dc:creator>
		<pubDate>Thu, 07 Apr 2022 17:55:36 +0000</pubDate>
				<category><![CDATA[Algorithms]]></category>
		<category><![CDATA[Classification (two-class)]]></category>
		<category><![CDATA[Cross-Validation]]></category>
		<category><![CDATA[Finance]]></category>
		<category><![CDATA[Hyperparameter Tuning]]></category>
		<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[Random Decision Forests]]></category>
		<category><![CDATA[Sales Forecasting]]></category>
		<category><![CDATA[Scikit-Learn]]></category>
		<category><![CDATA[Seaborn]]></category>
		<category><![CDATA[Use Cases]]></category>
		<category><![CDATA[Classic Machine Learning]]></category>
		<category><![CDATA[Intermediate Tutorials]]></category>
		<category><![CDATA[Multivariate Models]]></category>
		<category><![CDATA[Random Forest Regression]]></category>
		<category><![CDATA[Random Search]]></category>
		<category><![CDATA[Supervised Learning]]></category>
		<category><![CDATA[Tuning Random Decision Forests]]></category>
		<guid isPermaLink="false">https://www.relataly.com/?p=6875</guid>

					<description><![CDATA[<p>Perfecting your machine learning model&#8217;s hyperparameters can often feel like hunting for a proverbial needle in a haystack. But with the Random Search algorithm, this intricate process of hyperparameter tuning can be efficiently automated, saving you valuable time and effort. Hyperparameters are properties intrinsic to your model, like the number of estimators in an ensemble ... <a title="Using Random Search to Tune the Hyperparameters of a Random Decision Forest with Python" class="read-more" href="https://www.relataly.com/using-random-search-to-tune-the-hyperparameters-of-a-random-decision-forest-with-python/6875/" aria-label="Read more about Using Random Search to Tune the Hyperparameters of a Random Decision Forest with Python">Read more</a></p>
<p>The post <a href="https://www.relataly.com/using-random-search-to-tune-the-hyperparameters-of-a-random-decision-forest-with-python/6875/">Using Random Search to Tune the Hyperparameters of a Random Decision Forest with Python</a> appeared first on <a href="https://www.relataly.com">relataly.com</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">Perfecting your machine learning model&#8217;s hyperparameters can often feel like hunting for a proverbial needle in a haystack. But with the Random Search algorithm, this intricate process of hyperparameter tuning can be efficiently automated, saving you valuable time and effort. Hyperparameters are properties intrinsic to your model, like the number of estimators in an ensemble model, and heavily influence its performance. Unlike model parameters, which are discovered during training by the machine learning algorithm, hyperparameters require pre-specification.</p>



<p class="wp-block-paragraph">In this comprehensive Python tutorial, we&#8217;ll guide you on how to harness the power of Random Search to optimize a regression model&#8217;s hyperparameters. Our illustrative example utilizes a Support Vector Machine (SVM) for predicting house prices. However, the fundamental principles you&#8217;ll learn can be seamlessly applied to any model. So why painstakingly fine-tune hyperparameters manually when Random Search can handle the task efficiently?</p>



<p class="wp-block-paragraph">Here&#8217;s a preview of what this Python tutorial entails:</p>



<ol class="wp-block-list">
<li>A brief overview of how Random Search operates and instances where it might be preferable to Grid Search.</li>



<li>A hands-on Python tutorial featuring a public house price dataset from Kaggle.com. The aim here is to train a regression model capable of predicting US house prices based on various properties.</li>



<li>Training a &#8216;best-guess&#8217; model in Python, followed by using Random Search to discover a model with enhanced performance.</li>



<li>Finally, we&#8217;ll implement cross-validation to validate our models&#8217; performance.</li>
</ol>



<p class="wp-block-paragraph">By the end of this tutorial, you&#8217;ll be well-equipped to let Random Search efficiently fine-tune your model&#8217;s hyperparameters, freeing up your time for other crucial tasks.</p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%"></div>
</div>



<h2 class="wp-block-heading" id="h-hyperparameter-tuning">Hyperparameter Tuning </h2>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">Hyperparameters are configuration options that allow us to customize machine learning models and improve their performance. While normal parameters are the internal coefficients that the model learns during training, we need to specify hyperparameters before the training. It is usually impossible to find the best configuration without testing different configurations. </p>



<p class="wp-block-paragraph">Searching for a suitable model configuration is called &#8220;hyperparameter tuning&#8221; or &#8220;hyperparameter optimization.&#8221; Machine learning algorithms have varying hyperparameters and parameter values. For example, a random decision forest classifier allows us to configure varying parameters such as the number of trees, the maximum tree depth, and the minimum number of nodes required for a new branch. </p>



<p class="wp-block-paragraph">The hyperparameters and the range of possible parameter values span a search space in which we seek to identify the best configuration. The larger the search space, the more difficult it gets to find an optimal model. We can use random search to automatize this process.</p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%">
<figure class="wp-block-image size-full"><img fetchpriority="high" decoding="async" width="506" height="501" data-attachment-id="12416" data-permalink="https://www.relataly.com/using-random-search-to-tune-the-hyperparameters-of-a-random-decision-forest-with-python/6875/random-hyperparameter-tuning-machine-learning/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2023/02/random-hyperparameter-tuning-machine-learning.png" data-orig-size="506,501" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="random-hyperparameter-tuning-machine-learning" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2023/02/random-hyperparameter-tuning-machine-learning.png" src="https://www.relataly.com/wp-content/uploads/2023/02/random-hyperparameter-tuning-machine-learning.png" alt="Random search can be an efficient way to tune the hyperparameters of a machine learning model. Image generated with Midjourney. Image of exploding dices with different colors. " class="wp-image-12416" srcset="https://www.relataly.com/wp-content/uploads/2023/02/random-hyperparameter-tuning-machine-learning.png 506w, https://www.relataly.com/wp-content/uploads/2023/02/random-hyperparameter-tuning-machine-learning.png 300w, https://www.relataly.com/wp-content/uploads/2023/02/random-hyperparameter-tuning-machine-learning.png 140w" sizes="(max-width: 506px) 100vw, 506px" /><figcaption class="wp-element-caption">Random search can be an efficient way to tune the hyperparameters of a machine learning model. Image generated with <a href="http://www.midjourney.com" target="_blank" rel="noreferrer noopener">Midjourney</a></figcaption></figure>
</div>
</div>



<h2 class="wp-block-heading">Techniques for Tuning Hyperparameters</h2>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">Hyperparameter tuning is the process of adjusting the hyperparameters of a machine learning algorithm to optimize its performance on a specific dataset or task. Several techniques can be used for hyperparameter tuning, including:</p>



<ol class="wp-block-list">
<li><strong><a href="https://www.relataly.com/hyperparameter-tuning-with-grid-search/2261/" target="_blank" rel="noreferrer noopener">Grid Search</a>:</strong> grid search is a brute-force search algorithm that systematically evaluates a given set of hyperparameter values by training and evaluating a model for each combination of values. It is a simple and effective technique, but it can be computationally expensive, especially for large or complex datasets.</li>



<li><strong>Random Search: </strong>As mentioned, random search is an alternative to grid search that randomly samples a given set of hyperparameter values rather than evaluating all possible combinations. It can be more efficient than grid search, but it may not find the optimal set of hyperparameters.</li>



<li><strong>Bayesian Optimization:</strong> A bayesian optimization is a probabilistic approach to hyperparameter tuning, which uses Bayesian inference to model the distribution of hyperparameter values that are likely to produce a good performance. It can be more efficient and effective than grid search or random search, but it can be more challenging to implement and interpret.</li>



<li><strong>Genetic Algorithms:</strong> genetic algorithms are optimization algorithms inspired by the principles of natural selection and genetics. They use a population of candidate solutions, which are iteratively evolved and selected based on their fitness or performance, to find the optimal set of hyperparameters.</li>
</ol>



<p class="wp-block-paragraph">In this article, we specifically look at the Random Search technique.</p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%">
<figure class="wp-block-image size-full"><img decoding="async" width="892" height="512" data-attachment-id="12417" data-permalink="https://www.relataly.com/using-random-search-to-tune-the-hyperparameters-of-a-random-decision-forest-with-python/6875/image-of-an-old-mechanic-tuning-a-car-hyperparameter-tuning-machine-learning-python-tutorial-scikit-learn/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2023/02/Image-of-an-old-mechanic-tuning-a-car.-Hyperparameter-Tuning-Machine-Learning-Python-Tutorial-Scikit-Learn.png" data-orig-size="892,512" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="Image-of-an-old-mechanic-tuning-a-car.-Hyperparameter-Tuning-Machine-Learning-Python-Tutorial-Scikit-Learn" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2023/02/Image-of-an-old-mechanic-tuning-a-car.-Hyperparameter-Tuning-Machine-Learning-Python-Tutorial-Scikit-Learn.png" src="https://www.relataly.com/wp-content/uploads/2023/02/Image-of-an-old-mechanic-tuning-a-car.-Hyperparameter-Tuning-Machine-Learning-Python-Tutorial-Scikit-Learn.png" alt="You can spend much time tuning a machine learning model. Image generated with Midjourney. portrait of an old mechanic working on a car. relataly.com" class="wp-image-12417" srcset="https://www.relataly.com/wp-content/uploads/2023/02/Image-of-an-old-mechanic-tuning-a-car.-Hyperparameter-Tuning-Machine-Learning-Python-Tutorial-Scikit-Learn.png 892w, https://www.relataly.com/wp-content/uploads/2023/02/Image-of-an-old-mechanic-tuning-a-car.-Hyperparameter-Tuning-Machine-Learning-Python-Tutorial-Scikit-Learn.png 300w, https://www.relataly.com/wp-content/uploads/2023/02/Image-of-an-old-mechanic-tuning-a-car.-Hyperparameter-Tuning-Machine-Learning-Python-Tutorial-Scikit-Learn.png 768w" sizes="(max-width: 892px) 100vw, 892px" /><figcaption class="wp-element-caption">You can spend much time tuning a machine learning model. Image generated with Midjourney.</figcaption></figure>
</div>
</div>



<h2 class="wp-block-heading">What is Random Search?</h2>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">The random search algorithm generates models from hyperparameter permutations randomly selected from a grid of parameter values. The idea behind the randomized approach is that testing random configurations efficiently identifies a good model. We can use random search both for regression and classification models.</p>



<p class="wp-block-paragraph">Random Search and Grid Search are the most popular techniques for hyperparametric tuning, and both methods are often compared. Unlike random search, grid search covers the search space exhaustively by trying all possible variants. The technique works well for testing a small number of configurations already known to work well. </p>



<p class="wp-block-paragraph">As long as both search space and training time are small, the grid search technique is excellent for finding the best model. However, the number of model variants increases exponentially with the size of the search space. It is often more efficient for large search spaces or complex models to use random search.</p>



<p class="wp-block-paragraph">Since random search does not exhaustively cover the search space, it does not necessarily yield the best model. However, it is also much faster than grid search and efficient in delivering a suitable model in a short time.</p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%"><div class="wp-block-image">
<figure class="alignright size-full is-resized"><img decoding="async" data-attachment-id="6924" data-permalink="https://www.relataly.com/using-random-search-to-tune-the-hyperparameters-of-a-random-decision-forest-with-python/6875/image-16-9/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/04/image-16.png" data-orig-size="640,469" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-16" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/04/image-16.png" src="https://www.relataly.com/wp-content/uploads/2022/04/image-16.png" alt="random decision forest python,
hyperparameter tuning,
comparison between random search and grid search" class="wp-image-6924" width="409" height="301" srcset="https://www.relataly.com/wp-content/uploads/2022/04/image-16.png 640w, https://www.relataly.com/wp-content/uploads/2022/04/image-16.png 300w" sizes="(max-width: 409px) 100vw, 409px" /><figcaption class="wp-element-caption">Random Search vs. Exhaustive Grid Search</figcaption></figure>
</div></div>
</div>



<div style="height:48px" aria-hidden="true" class="wp-block-spacer"></div>



<h2 class="wp-block-heading" id="h-tuning-the-hyperparameters-of-a-random-decision-forest-regressor-in-python-using-random-search">Tuning the Hyperparameters of a Random Decision Forest Regressor in Python using Random Search</h2>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph"><br>In this tutorial, we delve into the use of the Random Search algorithm in Python, specifically for predicting house prices. We&#8217;ll be using a dataset rich in diverse house characteristics. Various elements, such as data quality and quantity, model intricacy, the selection of machine learning algorithms, and housing market stability, significantly influence the accuracy of house price predictions.</p>



<p class="wp-block-paragraph">Our initial model employs a Random Decision Forest algorithm, which we&#8217;ll optimize using a random search approach for hyperparameters tuning. By identifying and implementing a more advantageous configuration, we aim to enhance our model&#8217;s performance significantly.</p>



<p class="wp-block-paragraph">Here&#8217;s a concise outline of the steps we&#8217;ll undertake:</p>



<ol class="wp-block-list">
<li>Loading the house price dataset</li>



<li>Exploring the dataset intricacies</li>



<li>Preparing the data for modeling</li>



<li>Training a baseline Random Decision Forest model</li>



<li>Implementing a random search approach for model optimization</li>



<li>Measuring and evaluating the performance of our optimized model</li>
</ol>



<p class="wp-block-paragraph">Through this step-by-step guide, you&#8217;ll learn to enhance model performance, further refining your understanding of Random Search algorithm implementation in Python.</p>



<p class="wp-block-paragraph">The Python code is available in the relataly GitHub repository. </p>



<div class="wp-block-kadence-advancedbtn kb-buttons-wrap kb-btns_f9d778-26"><a class="kb-button kt-button button kb-btn_b5d394-00 kt-btn-size-standard kt-btn-width-type-full kb-btn-global-inherit kt-btn-has-text-true kt-btn-has-svg-true wp-block-button__link wp-block-kadence-singlebtn" href="https://github.com/flo7up/relataly-public-python-tutorials/blob/master/11%20Hyperparamter%20Tuning/016%20Hyperparameter%20Tuning%20of%20Random%20Decision%20Forests%20using%20Grid%20Search.ipynb" target="_blank" rel="noreferrer noopener"><span class="kb-svg-icon-wrap kb-svg-icon-fe_eye kt-btn-icon-side-left"><svg viewBox="0 0 24 24"  fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"  aria-hidden="true"><path d="M1 12s4-8 11-8 11 8 11 8-4 8-11 8-11-8-11-8z"/><circle cx="12" cy="12" r="3"/></svg></span><span class="kt-btn-inner-text">View on GitHub </span></a>

<a class="kb-button kt-button button kb-btn_ce738a-fc kt-btn-size-standard kt-btn-width-type-full kb-btn-global-inherit kt-btn-has-text-true kt-btn-has-svg-true wp-block-button__link wp-block-kadence-singlebtn" href="https://github.com/flo7up/relataly-public-python-API-tutorials" target="_blank" rel="noreferrer noopener"><span class="kb-svg-icon-wrap kb-svg-icon-fa_github kt-btn-icon-side-left"><svg viewBox="0 0 496 512"  fill="currentColor" xmlns="http://www.w3.org/2000/svg"  aria-hidden="true"><path d="M165.9 397.4c0 2-2.3 3.6-5.2 3.6-3.3.3-5.6-1.3-5.6-3.6 0-2 2.3-3.6 5.2-3.6 3-.3 5.6 1.3 5.6 3.6zm-31.1-4.5c-.7 2 1.3 4.3 4.3 4.9 2.6 1 5.6 0 6.2-2s-1.3-4.3-4.3-5.2c-2.6-.7-5.5.3-6.2 2.3zm44.2-1.7c-2.9.7-4.9 2.6-4.6 4.9.3 2 2.9 3.3 5.9 2.6 2.9-.7 4.9-2.6 4.6-4.6-.3-1.9-3-3.2-5.9-2.9zM244.8 8C106.1 8 0 113.3 0 252c0 110.9 69.8 205.8 169.5 239.2 12.8 2.3 17.3-5.6 17.3-12.1 0-6.2-.3-40.4-.3-61.4 0 0-70 15-84.7-29.8 0 0-11.4-29.1-27.8-36.6 0 0-22.9-15.7 1.6-15.4 0 0 24.9 2 38.6 25.8 21.9 38.6 58.6 27.5 72.9 20.9 2.3-16 8.8-27.1 16-33.7-55.9-6.2-112.3-14.3-112.3-110.5 0-27.5 7.6-41.3 23.6-58.9-2.6-6.5-11.1-33.3 2.6-67.9 20.9-6.5 69 27 69 27 20-5.6 41.5-8.5 62.8-8.5s42.8 2.9 62.8 8.5c0 0 48.1-33.6 69-27 13.7 34.7 5.2 61.4 2.6 67.9 16 17.7 25.8 31.5 25.8 58.9 0 96.5-58.9 104.2-114.8 110.5 9.2 7.9 17 22.9 17 46.4 0 33.7-.3 75.4-.3 83.6 0 6.5 4.6 14.4 17.3 12.1C428.2 457.8 496 362.9 496 252 496 113.3 383.5 8 244.8 8zM97.2 352.9c-1.3 1-1 3.3.7 5.2 1.6 1.6 3.9 2.3 5.2 1 1.3-1 1-3.3-.7-5.2-1.6-1.6-3.9-2.3-5.2-1zm-10.8-8.1c-.7 1.3.3 2.9 2.3 3.9 1.6 1 3.6.7 4.3-.7.7-1.3-.3-2.9-2.3-3.9-2-.6-3.6-.3-4.3.7zm32.4 35.6c-1.6 1.3-1 4.3 1.3 6.2 2.3 2.3 5.2 2.6 6.5 1 1.3-1.3.7-4.3-1.3-6.2-2.2-2.3-5.2-2.6-6.5-1zm-11.4-14.7c-1.6 1-1.6 3.6 0 5.9 1.6 2.3 4.3 3.3 5.6 2.3 1.6-1.3 1.6-3.9 0-6.2-1.4-2.3-4-3.3-5.6-2z"/></svg></span><span class="kt-btn-inner-text">Relataly Github Repo </span></a></div>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%">
<figure class="wp-block-image size-full"><img decoding="async" width="505" height="504" data-attachment-id="12422" data-permalink="https://www.relataly.com/using-random-search-to-tune-the-hyperparameters-of-a-random-decision-forest-with-python/6875/isometric-house-min/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2023/02/isometric-house-min.png" data-orig-size="505,504" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="isometric-house-min" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2023/02/isometric-house-min.png" src="https://www.relataly.com/wp-content/uploads/2023/02/isometric-house-min.png" alt="Now that we have trained a house price prediction model, we can use it to asses the price of new houses. Image generated with Midjourney. Python machine learning tutorial. relataly.com" class="wp-image-12422" srcset="https://www.relataly.com/wp-content/uploads/2023/02/isometric-house-min.png 505w, https://www.relataly.com/wp-content/uploads/2023/02/isometric-house-min.png 300w, https://www.relataly.com/wp-content/uploads/2023/02/isometric-house-min.png 140w" sizes="(max-width: 505px) 100vw, 505px" /><figcaption class="wp-element-caption">Once we have trained a house price prediction model, we can use it to asses the price of new houses. Image generated with <a href="http://www.midjourney.com" target="_blank" rel="noreferrer noopener">Midjourney</a>.</figcaption></figure>
</div>
</div>



<h3 class="wp-block-heading" id="h-prerequisites">Prerequisites</h3>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">Before starting the coding part, ensure that you have set up your Python (3.8 or higher) environment and required packages. If you don&#8217;t have an environment, follow&nbsp;<a href="https://www.relataly.com/anaconda-python-environment-machine-learning/1663/" target="_blank" rel="noreferrer noopener">this tutorial</a>&nbsp;to set up the&nbsp;<a href="https://www.anaconda.com/products/individual" target="_blank" rel="noreferrer noopener">Anaconda environment</a>.</p>



<p class="wp-block-paragraph">Also, make sure you install all required packages. In this tutorial, we will be working with the following standard packages:&nbsp;</p>



<ul class="wp-block-list">
<li><em><a href="https://pandas.pydata.org/" target="_blank" rel="noreferrer noopener">pandas</a></em></li>



<li><em><a href="https://numpy.org/" target="_blank" rel="noreferrer noopener">NumPy</a></em></li>



<li><a href="https://docs.python.org/3/library/math.html" target="_blank" rel="noreferrer noopener">math</a></li>



<li><em><a href="https://matplotlib.org/" target="_blank" rel="noreferrer noopener">matplotlib</a></em></li>
</ul>



<p class="wp-block-paragraph">In addition, we will be using the Python Machine Learning library <a href="https://scikit-learn.org/stable/" target="_blank" rel="noreferrer noopener">Scikit-learn</a> to implement the random forest and the grid search technique. </p>



<p class="wp-block-paragraph">You can install packages using console commands:</p>



<ul class="wp-block-list">
<li><em>pip install &lt;package name&gt;</em></li>



<li><em>conda install &lt;package name&gt;</em>&nbsp;(if you are using the anaconda packet manager)</li>
</ul>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%"></div>
</div>



<h3 class="wp-block-heading" id="h-house-price-prediction-about-the-use-case-and-the-data">House Price Prediction: About the Use Case and the Data</h3>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">House price prediction is the process of using statistical and machine learning techniques to predict the future value of a house. This can be useful for a variety of applications, such as helping homeowners and real estate professionals to make informed decisions about buying and selling properties. In order to make accurate predictions, it is important to have access to high-quality data about the housing market.</p>



<p class="wp-block-paragraph">In this tutorial, we will work with a house price dataset from the <a href="//www.kaggle.com/c/house-prices-advanced-regression-techniques" target="_blank" rel="noreferrer noopener">house price regression challenge on Kaggle.com</a>. The dataset is available via a git hub repository. It contains information about 4800 houses sold between 2016 and 2020 in the US. The data includes the sale price and a list of 48 house characteristics, such as:</p>



<ul class="wp-block-list">
<li>Year &#8211; The year of construction, </li>



<li>SaleYear &#8211; The year in which the house was sold </li>



<li>Lot Area &#8211; The lot area of the house </li>



<li>Quality &#8211; The overall quality of the house from one (lowest) to ten (highest)</li>



<li>Road &#8211; The type of road, e.g., paved, etc. </li>



<li>Utility &#8211; The type of the utility </li>



<li>Park Lot Area &#8211; The parking space included with the property </li>



<li>Room number &#8211; The number of rooms </li>
</ul>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%">
<figure class="wp-block-image size-full"><img decoding="async" width="505" height="505" data-attachment-id="12421" data-permalink="https://www.relataly.com/using-random-search-to-tune-the-hyperparameters-of-a-random-decision-forest-with-python/6875/isometric-house-collection-python-machine-learning-min/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2023/02/isometric-house-collection-python-machine-learning-min.png" data-orig-size="505,505" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="isometric-house-collection-python-machine-learning-min" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2023/02/isometric-house-collection-python-machine-learning-min.png" src="https://www.relataly.com/wp-content/uploads/2023/02/isometric-house-collection-python-machine-learning-min.png" alt="Predicting house prices with machine learning. Image generated with Midjourney. Isometric view of houses. relataly.com" class="wp-image-12421" srcset="https://www.relataly.com/wp-content/uploads/2023/02/isometric-house-collection-python-machine-learning-min.png 505w, https://www.relataly.com/wp-content/uploads/2023/02/isometric-house-collection-python-machine-learning-min.png 300w, https://www.relataly.com/wp-content/uploads/2023/02/isometric-house-collection-python-machine-learning-min.png 140w" sizes="(max-width: 505px) 100vw, 505px" /><figcaption class="wp-element-caption">Predicting house prices with machine learning. Image generated with <a href="http://www.midjourney.com" target="_blank" rel="noreferrer noopener">Midjourney</a>.</figcaption></figure>
</div>
</div>



<h3 class="wp-block-heading" id="h-step-1-load-the-data">Step #1 Load the Data</h3>



<p class="wp-block-paragraph">We begin by loading the house price data from the relataly GitHub repository. A separate download is not required.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># A tutorial for this file is available at www.relataly.com

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import RandomizedSearchCV, train_test_split
from sklearn.metrics import mean_absolute_error, mean_absolute_percentage_error
from sklearn import svm

# Source: 
# https://www.kaggle.com/c/house-prices-advanced-regression-techniques

# Load train and test datasets
path = &quot;https://raw.githubusercontent.com/flo7up/relataly_data/main/house_prices/train.csv&quot;
df = pd.read_csv(path)
print(df.columns)
df.head()</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">Index(['Id', 'MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street',
       'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig',
       'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType',
       'HouseStyle', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd',
       'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType',
       'MasVnrArea', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual',
       'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1',
       'BsmtFinType2', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating',
       'HeatingQC', 'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF',
       'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath',
       'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual',
       'TotRmsAbvGrd', 'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType',
       'GarageYrBlt', 'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual',
       'GarageCond', 'PavedDrive', 'WoodDeckSF', 'OpenPorchSF',
       'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'PoolQC',
       'Fence', 'MiscFeature', 'MiscVal', 'MoSold', 'YrSold', 'SaleType',
       'SaleCondition', 'SalePrice'],
      dtype='object')
	Id	MSSubClass	MSZoning	LotFrontage	LotArea	Street	Alley	LotShape	LandContour	Utilities	...	PoolArea	PoolQC	Fence	MiscFeature	MiscVal	MoSold	YrSold	SaleType	SaleCondition	SalePrice
0	1	60			RL			65.0		8450	Pave	NaN		Reg			Lvl			AllPub		...	0			NaN		NaN		NaN			0		2		2008	WD			Normal			208500
1	2	20			RL			80.0		9600	Pave	NaN		Reg			Lvl			AllPub		...	0			NaN		NaN		NaN			0		5		2007	WD			Normal			181500
2	3	60			RL			68.0		11250	Pave	NaN		IR1			Lvl			AllPub		...	0			NaN		NaN		NaN			0		9		2008	WD			Normal			223500
3	4	70			RL			60.0		9550	Pave	NaN		IR1			Lvl			AllPub		...	0			NaN		NaN		NaN			0		2		2006	WD			Abnorml			140000
4	5	60			RL			84.0		14260	Pave	NaN		IR1			Lvl			AllPub		...	0			NaN		NaN		NaN			0		12		2008	WD			Normal			250000
5 rows × 81 columns</pre></div>



<h3 class="wp-block-heading" id="h-step-2-explore-the-data">Step #2 Explore the Data</h3>



<p class="wp-block-paragraph">Before jumping into preprocessing and model training, let&#8217;s quickly explore the data. A distribution plot can help us understand our dataset&#8217;s frequency of regression values.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Create histograms for feature columns separated by prediction label value
ax = sns.displot(data=df[['SalePrice']].dropna(), height=6, aspect=2)
plt.title('Sale Price Distribution')</pre></div>



<figure class="wp-block-image size-full"><img decoding="async" width="944" height="440" data-attachment-id="8424" data-permalink="https://www.relataly.com/using-random-search-to-tune-the-hyperparameters-of-a-random-decision-forest-with-python/6875/histplot/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/05/histplot.png" data-orig-size="944,440" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="histplot" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/05/histplot.png" src="https://www.relataly.com/wp-content/uploads/2022/05/histplot.png" alt="random search hyperparameter tuning python. random forest regression,
sale price distribution" class="wp-image-8424" srcset="https://www.relataly.com/wp-content/uploads/2022/05/histplot.png 944w, https://www.relataly.com/wp-content/uploads/2022/05/histplot.png 300w, https://www.relataly.com/wp-content/uploads/2022/05/histplot.png 768w" sizes="(max-width: 944px) 100vw, 944px" /></figure>



<p class="wp-block-paragraph">For feature selection, it is helpful to understand the predictive power of the different variables in a dataset. We can use scatterplots to estimate the predictive power of specific features. Running the code below will create a scatterplot that visualizes the relation between the sale price, lot area, and the house&#8217;s overall quality.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Create histograms for feature columns separated by prediction label value
plt.figure(figsize=(16,6))
df_features = df[['SalePrice', 'LotArea', 'OverallQual']]
sns.scatterplot(data=df_features, x='LotArea', y='SalePrice', hue='OverallQual')
plt.title('Sale Price Distribution')</pre></div>



<figure class="wp-block-image size-full"><img decoding="async" width="966" height="387" data-attachment-id="8430" data-permalink="https://www.relataly.com/using-random-search-to-tune-the-hyperparameters-of-a-random-decision-forest-with-python/6875/overall-quality-vs-lotarea-depending-on-sale-price/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/05/Overall-Quality-vs-LotArea-depending-on-Sale-Price.png" data-orig-size="966,387" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="Overall-Quality-vs-LotArea-depending-on-Sale-Price" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/05/Overall-Quality-vs-LotArea-depending-on-Sale-Price.png" src="https://www.relataly.com/wp-content/uploads/2022/05/Overall-Quality-vs-LotArea-depending-on-Sale-Price.png" alt="random search hyperparameter tuning python. random forest regression,
scatter plot, feature selection" class="wp-image-8430" srcset="https://www.relataly.com/wp-content/uploads/2022/05/Overall-Quality-vs-LotArea-depending-on-Sale-Price.png 966w, https://www.relataly.com/wp-content/uploads/2022/05/Overall-Quality-vs-LotArea-depending-on-Sale-Price.png 300w, https://www.relataly.com/wp-content/uploads/2022/05/Overall-Quality-vs-LotArea-depending-on-Sale-Price.png 768w" sizes="(max-width: 966px) 100vw, 966px" /></figure>



<p class="wp-block-paragraph">As expected, the scatterplot shows that the sale price increases with the overall quality. On the other hand, the LotArea has only a minor effect on the sale price. </p>



<h3 class="wp-block-heading">Step #3 Data Preprocessing</h3>



<p class="wp-block-paragraph">Next, we prepare the data for use as input to train a regression model. Because we want to keep things simple, we reduce the number of variables and use only a small set of features. In addition, we encode categorical variables with integer dummy values.</p>



<p class="wp-block-paragraph">To ensure that our regression model does not know the target variable, we separate house price (y) from features (x). Last, we split the data into separate datasets for training and testing. The result is four different data sets: x_train, y_train, x_test, and y_test.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}">def preprocessFeatures(df):   
    # Define a list of relevant features
    feature_list = ['SalePrice', 'OverallQual', 'Utilities', 'GarageArea', 'LotArea', 'OverallCond']
    df_dummy = pd.get_dummies(df[feature_list])
    # Cleanse records with na values
    #df_prep = df_prep.dropna()
    return df_dummy

df_base = preprocessFeatures(df)

# Split the data into x_train and y_train data sets
x_train, x_test, y_train, y_test = train_test_split( df_base.copy(), df_base['SalePrice'].copy(), train_size=0.7, random_state=0)
x_train</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">		OverallQual	GarageArea	LotArea	OverallCond	Utilities_AllPub	Utilities_NoSeWa
682		6			431			2887	5			1					0
960		5			0			7207	7			1					0
1384	6			280			9060	5			1					0
1100	2			246			8400	5			1					0
416		6			440			7844	7			1					0</pre></div>



<h3 class="wp-block-heading" id="h-step-4-train-different-regression-models-using-random-search">Step #4 Train Different Regression Models using Random Search</h3>



<p class="wp-block-paragraph">Now that the dataset is ready, we can train the random decision forest regressor. To do this, we first define a dictionary with different parameter ranges. In addition, we need to define the number of model variants (n) that the algorithm should try. The random search algorithm then selects n random permutations from the grid and uses them to train the model. </p>



<p class="wp-block-paragraph">We use the RandomSearchCV algorithm from the scikit-learn package. The &#8220;CV&#8221; in the function name stands for cross-validation. Cross-validation involves splitting the data into subsets (folds) and rotating them between training and validation runs. This way, each model is trained and tested multiple times on different data partitions. When the search algorithm finally evaluates the model configuration, it summarizes these results into a test score.</p>



<p class="wp-block-paragraph">We use a Random Decision Forest &#8211; a robust machine learning algorithm that can handle classification and regression tasks. As a so-called ensemble model, the Random Forest considers predictions from a set of multiple independent estimators. The estimator is an important parameter to pass to the RandomSearchCV function. Random decision forests have several hyperparameters that we can use to influence their behavior. We define the following parameter ranges:</p>



<ul class="wp-block-list">
<li>max_leaf_nodes = [2, 3, 4, 5, 6, 7]</li>



<li>min_samples_split = [5, 10, 20, 50]</li>



<li>max_depth = [5,10,15,20]</li>



<li>max_features = [3,4,5]</li>



<li>n_estimators = [50, 100, 200]</li>
</ul>



<p class="wp-block-paragraph">These parameter ranges define the search space from which the randomized search algorithm (RandomSearchCV) will select random configurations. Other parameters will use default values as defined by <a href="https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html?highlight=random%20forest#sklearn.ensemble.RandomForestClassifier" target="_blank" rel="noreferrer noopener">scikit-learn</a>.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Define the Estimator and the Parameter Ranges
dt = RandomForestRegressor()
number_of_iterations = 20
max_leaf_nodes = [2, 3, 4, 5, 6, 7]
min_samples_split = [5, 10, 20, 50]
max_depth = [5,10,15,20]
max_features = [3,4,5]
n_estimators = [50, 100, 200]

# Define the param distribution dictionary
param_distributions = dict(max_leaf_nodes=max_leaf_nodes, 
                           min_samples_split=min_samples_split, 
                           max_depth=max_depth,
                           max_features=max_features,
                           n_estimators=n_estimators)

# Build the gridsearch
grid = RandomizedSearchCV(estimator=dt, 
                          param_distributions=param_distributions, 
                          n_iter=number_of_iterations, 
                          cv = 5)

grid_results = grid.fit(x_train, y_train)

# Summarize the results in a readable format
print(&quot;Best params: {0}, using {1}&quot;.format(grid_results.cv_results_['mean_test_score'], grid_results.best_params_))
results_df = pd.DataFrame(grid_results.cv_results_)</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">Best params: [0.68738293 0.49581669 0.52138751 0.61235299 0.65360944 0.61165147
 0.70392285 0.52278886 0.67687248 0.68219638 0.70031536 0.65842909
 0.51939338 0.70801017 0.70911805 0.69543885 0.67983801 0.60744371
 0.68270285 0.70741042], using {'n_estimators': 100, 'min_samples_split': 5, 'max_leaf_nodes': 7, 'max_features': 3, 'max_depth': 15}
	mean_fit_time	std_fit_time	mean_score_time	std_score_time	param_n_estimators	param_min_samples_split	param_max_leaf_nodes	param_max_features	param_max_depth	params	split0_test_score	split1_test_score	split2_test_score	split3_test_score	split4_test_score	mean_test_score	std_test_score	rank_test_score
0	0.049196		0.002071		0.004074		0.000820		50					20						5	4	15	{'n_estimators': 50, 'min_samples_split': 20, ...	0.662973	0.705533	0.669520	0.702608	0.696280	0.687383	0.017637	7
1	0.041115		0.000554		0.003046		0.000094		50					50						2	3	10	{'n_estimators': 50, 'min_samples_split': 50, ...	0.490984	0.527231	0.426270	0.523086	0.511513	0.495817	0.036978	20
2	0.043325		0.000779		0.003486		0.000447		50					50						2	5	20	{'n_estimators': 50, 'min_samples_split': 50, ...	0.484524	0.559358	0.485459	0.517253	0.560343	0.521388	0.033545	18
3	0.162083		0.005665		0.012420		0.004788		200					5						3	3	20	{'n_estimators': 200, 'min_samples_split': 5, ...	0.586586	0.638341	0.573437	0.626793	0.636608	0.612353	0.027021	14
4	0.166659		0.003026		0.010958		0.000084		200					10						4	3	15	{'n_estimators': 200, 'min_samples_split': 10,...	0.633305	0.679161	0.623236	0.661864	0.670481	0.653609	0.021636	13</pre></div>



<p class="wp-block-paragraph">These are the five best models and their respective hyperparameter configurations.</p>



<h3 class="wp-block-heading" id="h-step-5-select-the-best-model-and-measure-performance"><strong>Step #5 Select the best Model and Measure Performance</strong></h3>



<p class="wp-block-paragraph">Finally, we will choose the best model from the list using the &#8220;best_model&#8221; function. We then calculate the MAE and the MAPE to understand how the model performs on the overall test dataset. We then print a comparison between actual sale prices and predicted sale prices.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Select the best Model and Measure Performance
best_model = grid_results.best_estimator_
y_pred = best_model.predict(x_test)
y_df = pd.DataFrame(y_test)
y_df['PredictedPrice']=y_pred
y_df.head()</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">	SalePrice	PredictedPrice
529	200624		166037.831002
491	133000		135860.757958
459	110000		123030.336177
279	192000		206488.444327
655	88000		130453.604206</pre></div>



<p class="wp-block-paragraph">Next, let&#8217;s take a look at the classification errors.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Mean Absolute Error (MAE)
MAE = mean_absolute_error(y_pred, y_test)
print('Mean Absolute Error (MAE): ' + str(np.round(MAE, 2)))

# Mean Absolute Percentage Error (MAPE)
MAPE = mean_absolute_percentage_error(y_pred, y_test)
print('Median Absolute Percentage Error (MAPE): ' + str(np.round(MAPE*100, 2)) + ' %')</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">Mean Absolute Error (MAE): 29591.56 
Median Absolute Percentage Error (MAPE): 15.57 %</pre></div>



<p class="wp-block-paragraph">On average, the model deviates from the actual value by 16 %. Considering we only used a fraction of the available features and defined a small search space, there is much room for improvement.</p>



<h2 class="wp-block-heading" id="h-summary">Summary</h2>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">This article has shown how we can use grid Search in Python to efficiently search for the optimal hyperparameter configuration of a machine learning model. In the conceptual part, you learned about hyperparameters and how to use random search to try out all permutations of a predefined parameter grid. The second part was a Python hands-on tutorial, in which you learned to use random search to tune the hyperparameters of a regression model. We worked with a house price dataset and trained a random decision forest regressor that predicts the sale price for houses depending on several characteristics. Then we defined parameter ranges and tested random permutations. In this way, we quickly identified a configuration that outperforms our initial baseline model. </p>



<p class="wp-block-paragraph">Remember that a random search efficiently identifies a good-performing model but does not necessarily return the best-performing one. Tech random search techniques can be used to tune the hyperparameters of both regression and classification models.</p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%"></div>
</div>



<p class="wp-block-paragraph"></p>



<h2 class="wp-block-heading">Sources and Further Reading</h2>



<p class="wp-block-paragraph">I hope this article was helpful. If you have any questions or suggestions, please write them in the comments. </p>



<div style="display: inline-block;">
  <iframe sandbox="allow-popups allow-scripts allow-modals allow-forms allow-same-origin" style="width:120px;height:240px;" marginwidth="0" marginheight="0" scrolling="no" frameborder="0" src="//ws-eu.amazon-adsystem.com/widgets/q?ServiceVersion=20070822&amp;OneJS=1&amp;Operation=GetAdHtml&amp;MarketPlace=DE&amp;source=ss&amp;ref=as_ss_li_til&amp;ad_type=product_link&amp;tracking_id=flo7up-21&amp;language=de_DE&amp;marketplace=amazon&amp;region=DE&amp;placement=3030181162&amp;asins=3030181162&amp;linkId=669e46025028259138fbb5ccec12dfbe&amp;show_border=true&amp;link_opens_in_new_window=true"></iframe>
<iframe sandbox="allow-popups allow-scripts allow-modals allow-forms allow-same-origin" style="width:120px;height:240px;" marginwidth="0" marginheight="0" scrolling="no" frameborder="0" src="//ws-eu.amazon-adsystem.com/widgets/q?ServiceVersion=20070822&amp;OneJS=1&amp;Operation=GetAdHtml&amp;MarketPlace=DE&amp;source=ss&amp;ref=as_ss_li_til&amp;ad_type=product_link&amp;tracking_id=flo7up-21&amp;language=de_DE&amp;marketplace=amazon&amp;region=DE&amp;placement=1999579577&amp;asins=1999579577&amp;linkId=91d862698bf9010ff4c09539e4c49bf4&amp;show_border=true&amp;link_opens_in_new_window=true"></iframe>
<iframe sandbox="allow-popups allow-scripts allow-modals allow-forms allow-same-origin" style="width:120px;height:240px;" marginwidth="0" marginheight="0" scrolling="no" frameborder="0" src="//ws-eu.amazon-adsystem.com/widgets/q?ServiceVersion=20070822&amp;OneJS=1&amp;Operation=GetAdHtml&amp;MarketPlace=DE&amp;source=ss&amp;ref=as_ss_li_til&amp;ad_type=product_link&amp;tracking_id=flo7up-21&amp;language=de_DE&amp;marketplace=amazon&amp;region=DE&amp;placement=1839217715&amp;asins=1839217715&amp;linkId=356ba074068849ff54393f527190825d&amp;show_border=true&amp;link_opens_in_new_window=true"></iframe>
<iframe sandbox="allow-popups allow-scripts allow-modals allow-forms allow-same-origin" style="width:120px;height:240px;" marginwidth="0" marginheight="0" scrolling="no" frameborder="0" src="//ws-eu.amazon-adsystem.com/widgets/q?ServiceVersion=20070822&amp;OneJS=1&amp;Operation=GetAdHtml&amp;MarketPlace=DE&amp;source=ss&amp;ref=as_ss_li_til&amp;ad_type=product_link&amp;tracking_id=flo7up-21&amp;language=de_DE&amp;marketplace=amazon&amp;region=DE&amp;placement=1492032646&amp;asins=1492032646&amp;linkId=2214804dd039e7103577abd08722abac&amp;show_border=true&amp;link_opens_in_new_window=true"></iframe>
</div>



<p class="has-contrast-2-color has-base-3-background-color has-text-color has-background wp-block-paragraph"><em>The links above to Amazon are affiliate links. By buying through these links, you support the Relataly.com blog and help to cover the hosting costs. Using the links does not affect the price.</em></p>
<p>The post <a href="https://www.relataly.com/using-random-search-to-tune-the-hyperparameters-of-a-random-decision-forest-with-python/6875/">Using Random Search to Tune the Hyperparameters of a Random Decision Forest with Python</a> appeared first on <a href="https://www.relataly.com">relataly.com</a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://www.relataly.com/using-random-search-to-tune-the-hyperparameters-of-a-random-decision-forest-with-python/6875/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">6875</post-id>	</item>
		<item>
		<title>How to Measure the Performance of a Machine Learning Classifier with Python and Scikit-Learn?</title>
		<link>https://www.relataly.com/measuring-classification-performance-in-machine-learning-with-python-and-scikit-learn/846/</link>
					<comments>https://www.relataly.com/measuring-classification-performance-in-machine-learning-with-python-and-scikit-learn/846/#comments</comments>
		
		<dc:creator><![CDATA[Florian Follonier]]></dc:creator>
		<pubDate>Fri, 31 Dec 2021 17:37:00 +0000</pubDate>
				<category><![CDATA[Classification (multi-class)]]></category>
		<category><![CDATA[Classification (two-class)]]></category>
		<category><![CDATA[Healthcare]]></category>
		<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[Measuring Model Performance]]></category>
		<category><![CDATA[Random Decision Forests]]></category>
		<category><![CDATA[Beginner Tutorials]]></category>
		<category><![CDATA[Confusion Matrix]]></category>
		<guid isPermaLink="false">https://www.relataly.com/?p=846</guid>

					<description><![CDATA[<p>Have you ever received a spam email and wondered how your email provider was able to identify it as spam? Well, the answer is likely machine learning! One common type of machine learning problem is called classification. The goal is to predict the correct class labels for a given set of observations. For example, we ... <a title="How to Measure the Performance of a Machine Learning Classifier with Python and Scikit-Learn?" class="read-more" href="https://www.relataly.com/measuring-classification-performance-in-machine-learning-with-python-and-scikit-learn/846/" aria-label="Read more about How to Measure the Performance of a Machine Learning Classifier with Python and Scikit-Learn?">Read more</a></p>
<p>The post <a href="https://www.relataly.com/measuring-classification-performance-in-machine-learning-with-python-and-scikit-learn/846/">How to Measure the Performance of a Machine Learning Classifier with Python and Scikit-Learn?</a> appeared first on <a href="https://www.relataly.com">relataly.com</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">Have you ever received a spam email and wondered how your email provider was able to identify it as spam? Well, the answer is likely machine learning! One common type of machine learning problem is called classification. The goal is to predict the correct class labels for a given set of observations. For example, we could train a classifier to identify whether an email is spam or not or to classify images of animals into different species. But before we can use a classifier in a real-world setting, we need to evaluate its performance to understand how well it can correctly classify observations. There are several tools and techniques we can use to do this, including the confusion matrix, error metrics, and the ROC curve. In this article, we&#8217;ll dive into these evaluation methods and see how they can help us understand the capabilities of our classifier.</p>



<p class="wp-block-paragraph">This tutorial is divided into two parts: a conceptual introduction to evaluating classification performance and a hands-on example using Python and Scikit-Learn. In the first part, we will discuss some of the common error metrics that are used to evaluate the performance of a classifier. This includes the confusion matrix, error metrics, and the ROC curve. The second part of the tutorial is hands-on. We use Python and Scikit-Learn to build a breast cancer detection model classifying tissue samples as benign or malignant. We then apply various techniques to evaluate the model&#8217;s performance.</p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%">
<figure class="wp-block-image size-full"><img decoding="async" width="500" height="496" data-attachment-id="12651" data-permalink="https://www.relataly.com/measuring-classification-performance-in-machine-learning-with-python-and-scikit-learn/846/target-machine-learning-error-prediction-midjourney-relataly/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2023/03/target-machine-learning-error-prediction-midjourney-relataly.png" data-orig-size="500,496" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="target-machine-learning-error-prediction-midjourney-relataly" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2023/03/target-machine-learning-error-prediction-midjourney-relataly.png" src="https://www.relataly.com/wp-content/uploads/2023/03/target-machine-learning-error-prediction-midjourney-relataly.png" alt="" class="wp-image-12651" srcset="https://www.relataly.com/wp-content/uploads/2023/03/target-machine-learning-error-prediction-midjourney-relataly.png 500w, https://www.relataly.com/wp-content/uploads/2023/03/target-machine-learning-error-prediction-midjourney-relataly.png 300w, https://www.relataly.com/wp-content/uploads/2023/03/target-machine-learning-error-prediction-midjourney-relataly.png 140w" sizes="(max-width: 500px) 100vw, 500px" /><figcaption class="wp-element-caption">Models can be wrong, but we should know how often they are. Image created with <a href="http://www.midjourney.com" target="_blank" rel="noreferrer noopener">Midjourney</a>.</figcaption></figure>
</div>
</div>



<h2 class="wp-block-heading">Why even bother Measuring Classification Performance?</h2>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">Measuring classification performance in machine learning is important because it allows us to evaluate how well a model is able to predict the class of a given input accurately. This is important because the ultimate goal of many machine learning models is to make accurate predictions in real-world applications.</p>



<p class="wp-block-paragraph">There are several reasons why it is important to measure classification performance. First, by measuring performance, we can determine whether a model is able to make accurate predictions. If a model cannot make accurate predictions, it may not be useful for the task it was designed for. Second, by measuring performance, we can compare the performance of different models and choose the best one for a given task. This can be especially important when working with large, complex datasets where multiple models may be applicable.</p>



<p class="wp-block-paragraph">In order to measure classification performance, we need to use a performance metric appropriate for the task at hand. Next&#8217;s let&#8217;s understand what this means.</p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%"><div class="wp-block-image">
<figure class="aligncenter size-full is-resized"><img decoding="async" data-attachment-id="7738" data-permalink="https://www.relataly.com/measuring-classification-performance-in-machine-learning-with-python-and-scikit-learn/846/breast-cancer-classifier-confusion-matrix/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/04/breast-cancer-classifier-confusion-matrix.png" data-orig-size="419,385" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="breast-cancer-classifier-confusion-matrix" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/04/breast-cancer-classifier-confusion-matrix.png" src="https://www.relataly.com/wp-content/uploads/2022/04/breast-cancer-classifier-confusion-matrix.png" alt="Confusion matrix for a two-class classifier, measuring model performance, classification error metrics, Scikit-learn, python, breast cancer dataset" class="wp-image-7738" width="334" height="307" srcset="https://www.relataly.com/wp-content/uploads/2022/04/breast-cancer-classifier-confusion-matrix.png 419w, https://www.relataly.com/wp-content/uploads/2022/04/breast-cancer-classifier-confusion-matrix.png 300w" sizes="(max-width: 334px) 100vw, 334px" /><figcaption class="wp-element-caption">Example confusion matrix of a two-class classifier</figcaption></figure>
</div></div>
</div>



<p class="wp-block-paragraph"></p>



<h2 class="wp-block-heading">Techniques for Measuring Classification Performance</h2>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">This first part of the tutorial presents essential techniques for measuring the performance of classification models, including confusion matrix, error metrics, and roc curves. But why are there so many different techniques? Isn&#8217;t it enough to calculate the rate between correct and false classifications? </p>



<p class="wp-block-paragraph">The answer depends on the balance of the class labels and their importance. Let&#8217;s compare a simple two-class case vs. a more complex one. In the most simple case, the following applies:</p>



<ul class="wp-block-list">
<li>The class labels in the sample are perfectly balanced (for example, 50 positives and 50 negatives).</li>



<li>Both class labels are equally important, so it does not matter if the model is better at predicting class one or two.</li>
</ul>



<p class="wp-block-paragraph">In this case, we can measure the model performance as the rate between correctly predicted labels and those that a model falsely predicted. It is as simple as that. However, most classification problems are more complex:</p>



<ul class="wp-block-list">
<li>The class labels are imbalanced, so the model encounters one class more often than the other.</li>



<li>One class is more important than the other. For example, consider a binary classification problem that aims to identify the few positive cases from a sample with many negative ones. Especially in disease detection, it is crucial that the model correctly identifies the few positive cases, even if some of the observations classified as positive are negative.</li>
</ul>



<p class="wp-block-paragraph">Confusion matrix and error techniques help us objectively evaluate such models built for more complex problems.</p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%"></div>
</div>



<div style="height:14px" aria-hidden="true" class="wp-block-spacer"></div>



<h3 class="wp-block-heading">The Confusion Matrix</h3>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">A confusion matrix is an essential tool for evaluating a classification model. The confusion matrix is a table with four combinations of predicted and actual values for a problem where the output may include two classes (negative and positive). As a result, each prediction falls into one of the following four squares:</p>



<ul class="wp-block-list">
<li><strong>True Positives (TP)</strong>: the outcome from a prediction is&nbsp;<em>&#8220;positive,&#8221; </em>and the actual value is also&nbsp;&#8220;positive.&#8221;</li>



<li><strong>False Positives (FP):</strong> The model predicted a positive value, but this prediction is false.</li>



<li><strong>True Negatives (TN):</strong> Predicted was a negative value, which is correct.</li>



<li><strong>False Negatives (FN):</strong> The model predicted a negative value while the actual class was positive.</li>
</ul>



<p class="wp-block-paragraph">We can assign each classification to a cell in the matrix. The diagonal contains the correctly classified cases whose actual class matches the predicted class. All other cells outside the diagonal represent possible errors. Using the confusion matrix, you can see at a glance how well the model works and what errors it makes. </p>



<p class="wp-block-paragraph">The confusion matrix is the basis for calculating various error metrics, which we will look at in more detail in the following section.</p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%">
<figure class="wp-block-image size-full"><img decoding="async" width="642" height="570" data-attachment-id="7705" data-permalink="https://www.relataly.com/measuring-classification-performance-in-machine-learning-with-python-and-scikit-learn/846/image-24/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/04/image-24.png" data-orig-size="642,570" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-24" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/04/image-24.png" src="https://www.relataly.com/wp-content/uploads/2022/04/image-24.png" alt="Confusion matrix" class="wp-image-7705" srcset="https://www.relataly.com/wp-content/uploads/2022/04/image-24.png 642w, https://www.relataly.com/wp-content/uploads/2022/04/image-24.png 300w" sizes="(max-width: 642px) 100vw, 642px" /><figcaption class="wp-element-caption">Confusion matrix</figcaption></figure>
</div>
</div>



<h3 class="wp-block-heading">Metrics for Measuring Classification Errors</h3>



<p class="wp-block-paragraph">To objectively measure the performance of a classifier, we can count up the cases in the different squares and use this information to calculate essential error metrics, including accuracy, precision, recall, f-1 score, and specificity.</p>



<p class="wp-block-paragraph"></p>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow">
<h4 class="wp-block-heading has-base-2-background-color has-background" style="font-size:24px">Precision</h4>



<p class="wp-block-paragraph">Precision is a metric for the rate of missed positive values. Mathematically, it is the sum of true positives divided by the sum of False Positives and True Positives. </p>



<p class="wp-block-paragraph">In other words, it measures the ability of a classification model to identify the relevant data points without misclassifying too many irrelevant cases.&nbsp;</p>



<div class="wp-block-mathml-mathmlblock">\[Precision = {TP  \over FP + TP}\]<script id="wp-hooks-js" src="https://www.relataly.com/wp-includes/js/dist/hooks.min.js?ver=7496969728ca0f95732d"></script>
<script id="wp-i18n-js" src="https://www.relataly.com/wp-includes/js/dist/i18n.min.js?ver=781d11515ad3d91786ec"></script>
<script id="wp-i18n-js-after">
wp.i18n.setLocaleData( { 'text direction\u0004ltr': [ 'ltr' ] } );
//# sourceURL=wp-i18n-js-after
</script>
<script  async id="mathjax-js" src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.7/MathJax.js?config=TeX-MML-AM_CHTML"></script>
</div>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow">
<h4 class="wp-block-heading has-base-2-background-color has-background" style="font-size:24px">Accuracy</h4>



<p class="wp-block-paragraph">Accuracy tells us the rate of the positive values that were classified correctly. It is calculated as the sum of all correct classifications divided by the number of false positives. </p>



<p class="wp-block-paragraph">The usefulness of Accuracy ends when the class labels are imbalanced so that one class is underrepresented. The Accuracy can be misleading as it can become nearly 100% even if the classification model has not identified any of the data points in the underrepresented class. If your data is imbalanced, you should combine accuracy with the Recall.</p>



<div class="wp-block-mathml-mathmlblock">\[Accuracy= {TP + TN \over TP + FN + FP + TN}\]</div>



<div class="wp-block-mathml-mathmlblock">\[= {Correct Classifications \over Total  Sample Size}\]</div>
</div>
</div>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow">
<h4 class="wp-block-heading has-base-2-background-color has-background" style="font-size:24px">F1-Score</h4>



<p class="wp-block-paragraph">The&nbsp;F1-Score&nbsp;combines Precision and Recall into a single metric. It is calculated as the harmonic mean of Precision and Recall. </p>



<p class="wp-block-paragraph">The F1-Score is a single overall metric based on precision and recall. We can use this metric to compare the performance of two classifiers with different recall and precision. </p>



<div class="wp-block-mathml-mathmlblock">\[F1Score = {TP + TN \over FN}\] </div>



<div class="wp-block-mathml-mathmlblock">\[= {2 * Precision * Recall\over Precision + Recall}\]</div>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow">
<h4 class="wp-block-heading has-base-2-background-color has-background" style="font-size:24px;font-style:normal;font-weight:500">Recall (Sensitivity)</h4>



<p class="wp-block-paragraph">Recall, sometimes called &#8220;Sensitivity,&#8221; measures the percentage of correctly classified positives among the entire sum of actual positives. We calculate it as the number of True Positives divided by the False Negatives and True Positives.</p>



<p class="wp-block-paragraph">The Recall is particularly helpful if we deal with an imbalanced dataset, for example, when the goal is to identify a few critical cases among a large sample. </p>



<div class="wp-block-mathml-mathmlblock">\[Recall= {TP \over FN + TP}\]</div>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow">
<h4 class="wp-block-heading has-base-2-background-color has-background" style="font-size:24px">Specificity </h4>



<p class="wp-block-paragraph">We calculate the number of negative samples. It is also called the True-Negative Rate and plays a vital role in the ROC Curve, which we will look at in more detail in the following section.</p>



<div class="wp-block-mathml-mathmlblock">\[Specificity= {TP \over FN + TP}\]</div>
</div>
</div>



<p class="wp-block-paragraph">None of the five metrics is sufficient to measure the model performance. We, therefore, use different metrics in combination. Note the following rules:</p>



<ul class="wp-block-list">
<li>If the classes in the dataset are balanced, measure performance using Accuracy.</li>



<li>If the dataset is imbalanced or one class is more important than the other, look at Recall and Precision. </li>



<li>For classification problems where you want to compare different models with similar recall and precision, use the F1Score.</li>
</ul>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow">
<h3 class="wp-block-heading" id="h-decision-boundary">Decision Boundary</h3>



<p class="wp-block-paragraph">A classifier determines class labels by calculating the probabilities of samples falling into a particular category. Since the probabilities are continuous values between 0.0 and 1.0, we use a decision boundary to convert them to class labels. The default threshold for a binary classifier is 0.5. Samples with probabilities above 0.5 are assigned to the first class, and samples below 0.5 to the second class.</p>



<p class="wp-block-paragraph">In practice, we often encounter classification problems, where the cost of an error varies between class labels. In such cases, we can alter the decision boundary to give one of the classes a higher priority. Consider the case of credit card fraud detection. In this case, it is critical for service providers to reliably detect the few fraud cases among the many legitimate credit card transactions. We can alter the decision threshold to increase the probability that the model detects fraud (high True Positive rate). The cost of detecting more fraud is a higher number of transactions that the model misclassifies as fraud. However, in this particular example, this is acceptable because the service provider can quickly resolve misunderstandings with the customer.</p>


<div class="wp-block-image">
<figure class="aligncenter size-large is-resized"><img decoding="async" data-attachment-id="7860" data-permalink="https://www.relataly.com/measuring-classification-performance-in-machine-learning-with-python-and-scikit-learn/846/image-29/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/04/image-29.png" data-orig-size="2350,1319" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-29" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/04/image-29.png" src="https://www.relataly.com/wp-content/uploads/2022/04/image-29-1024x575.png" alt="Comparison of different decision boundaries (0.5 vs 0.25 vs 0.9) and illustration of the effects on the classification error and confusion matrix, python tutorial" class="wp-image-7860" width="866" height="494"/><figcaption class="wp-element-caption">Comparison of different decision boundaries (0.5 vs. 0.25 vs. 0.9) and illustration of the effects on the classification error and confusion matrix</figcaption></figure>
</div></div>
</div>



<h3 class="wp-block-heading">The ROC Curve</h3>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow">
<p class="wp-block-paragraph">The ROC curve is another helpful tool to measure classification performance and is particularly useful for comparing different classification models&#8217; performance. ROC stands for &#8220;Receiver Operating Characteristic.&#8221; The ROC curve is a graphical plot that illustrates the diagnostic ability of a binary classifier system as its discrimination threshold is varied. The curve emerges when we plot the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings. </p>



<p class="wp-block-paragraph">The more the ROC curve tends to the upper left corner, the better the performance of the classification model. A perfect classifier would show a point in the upper left corner or coordinate (0,1), which is the ideal point for a diagnostic test. This is because a point at (0,1) indicates that the classifier has a 100% true positive rate and a 0% false positive rate. A curve near the diagonal indicates that the True Positive Rate and False Positive Rate are equal, which corresponds to the expected prediction result of a random classifier with no predictive power. If the ROC curve remains significantly below the diagonal, this indicates a classifier with inverse prediction power.</p>



<p class="wp-block-paragraph">The ROC for classification models is not necessarily a curve and often runs as a jumpy line with several plateaus.  Plateaus range where changes to the threshold do not change the classification results. Curves with plateaus can signify tiny sample sizes, but they may also have other reasons.</p>
</div>
</div>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow"><div class="wp-block-image">
<figure class="aligncenter size-large is-resized"><img decoding="async" data-attachment-id="7847" data-permalink="https://www.relataly.com/measuring-classification-performance-in-machine-learning-with-python-and-scikit-learn/846/image-27-2/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/04/image-27.png" data-orig-size="915,1047" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-27" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/04/image-27.png" src="https://www.relataly.com/wp-content/uploads/2022/04/image-27-895x1024.png" alt="classification performance tutorial python machine learning roc curve based on confusion matrix" class="wp-image-7847" width="345" height="395" srcset="https://www.relataly.com/wp-content/uploads/2022/04/image-27.png 895w, https://www.relataly.com/wp-content/uploads/2022/04/image-27.png 262w, https://www.relataly.com/wp-content/uploads/2022/04/image-27.png 768w, https://www.relataly.com/wp-content/uploads/2022/04/image-27.png 915w" sizes="(max-width: 345px) 100vw, 345px" /><figcaption class="wp-element-caption">Example of an ROC curve </figcaption></figure>
</div></div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow"><div class="wp-block-image">
<figure class="aligncenter size-large is-resized"><img decoding="async" data-attachment-id="7846" data-permalink="https://www.relataly.com/measuring-classification-performance-in-machine-learning-with-python-and-scikit-learn/846/image-26-2/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/04/image-26.png" data-orig-size="1033,1022" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-26" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/04/image-26.png" src="https://www.relataly.com/wp-content/uploads/2022/04/image-26-1024x1013.png" alt="Interpretation of the ROC Curve, classification performance tutorial python machine learning roc curve based on confusion matrix" class="wp-image-7846" width="391" height="387" srcset="https://www.relataly.com/wp-content/uploads/2022/04/image-26.png 1024w, https://www.relataly.com/wp-content/uploads/2022/04/image-26.png 300w, https://www.relataly.com/wp-content/uploads/2022/04/image-26.png 768w, https://www.relataly.com/wp-content/uploads/2022/04/image-26.png 1033w" sizes="(max-width: 391px) 100vw, 391px" /><figcaption class="wp-element-caption">Interpretation of the ROC curve</figcaption></figure>
</div></div>
</div>



<h2 class="wp-block-heading">Measuring Classification Performance in Python (Two-Class)</h2>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">In this tutorial, we will show how to implement various techniques for evaluating classification models using a breast cancer dataset and a simple logistic regression model in Python with Scikit-Learn. Abnormal changes in the breast may be a sign of cancer and need to be investigated. However, changes are not necessarily malignant and, in many cases, are benign. We will work with a breast cancer dataset and train a machine learning classifier to make this distinction (benign/malignant). We will use the model to predict the type of breast cancer based on various characteristics and explore how machine learning can be applied in the life sciences to support medical diagnostics. After training the model, we will use the Confusion Matrix, Error Metrics, and the ROC Curve to measure its performance.</p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%"></div>
</div>



<div class="wp-block-kadence-advancedbtn kb-buttons-wrap kb-btns_e50ee7-21"><a class="kb-button kt-button button kb-btn_74527c-1b kt-btn-size-standard kt-btn-width-type-full kb-btn-global-inherit kt-btn-has-text-true kt-btn-has-svg-true wp-block-button__link wp-block-kadence-singlebtn" href="https://github.com/flo7up/relataly-public-python-tutorials/blob/master/02%20Classification/020%20Measuring%20Classifier%20Performance%20with%20Confusion%20Matrix%2C%20Error%20Metrics%20and%20ROC%20Curve.ipynb" target="_blank" rel="noreferrer noopener"><span class="kb-svg-icon-wrap kb-svg-icon-fe_eye kt-btn-icon-side-left"><svg viewBox="0 0 24 24"  fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"  aria-hidden="true"><path d="M1 12s4-8 11-8 11 8 11 8-4 8-11 8-11-8-11-8z"/><circle cx="12" cy="12" r="3"/></svg></span><span class="kt-btn-inner-text">View on GitHub </span></a>

<a class="kb-button kt-button button kb-btn_1ba336-60 kt-btn-size-standard kt-btn-width-type-full kb-btn-global-inherit kt-btn-has-text-true kt-btn-has-svg-true wp-block-button__link wp-block-kadence-singlebtn" href="https://github.com/flo7up/relataly-public-python-API-tutorials" target="_blank" rel="noreferrer noopener"><span class="kb-svg-icon-wrap kb-svg-icon-fa_github kt-btn-icon-side-left"><svg viewBox="0 0 496 512"  fill="currentColor" xmlns="http://www.w3.org/2000/svg"  aria-hidden="true"><path d="M165.9 397.4c0 2-2.3 3.6-5.2 3.6-3.3.3-5.6-1.3-5.6-3.6 0-2 2.3-3.6 5.2-3.6 3-.3 5.6 1.3 5.6 3.6zm-31.1-4.5c-.7 2 1.3 4.3 4.3 4.9 2.6 1 5.6 0 6.2-2s-1.3-4.3-4.3-5.2c-2.6-.7-5.5.3-6.2 2.3zm44.2-1.7c-2.9.7-4.9 2.6-4.6 4.9.3 2 2.9 3.3 5.9 2.6 2.9-.7 4.9-2.6 4.6-4.6-.3-1.9-3-3.2-5.9-2.9zM244.8 8C106.1 8 0 113.3 0 252c0 110.9 69.8 205.8 169.5 239.2 12.8 2.3 17.3-5.6 17.3-12.1 0-6.2-.3-40.4-.3-61.4 0 0-70 15-84.7-29.8 0 0-11.4-29.1-27.8-36.6 0 0-22.9-15.7 1.6-15.4 0 0 24.9 2 38.6 25.8 21.9 38.6 58.6 27.5 72.9 20.9 2.3-16 8.8-27.1 16-33.7-55.9-6.2-112.3-14.3-112.3-110.5 0-27.5 7.6-41.3 23.6-58.9-2.6-6.5-11.1-33.3 2.6-67.9 20.9-6.5 69 27 69 27 20-5.6 41.5-8.5 62.8-8.5s42.8 2.9 62.8 8.5c0 0 48.1-33.6 69-27 13.7 34.7 5.2 61.4 2.6 67.9 16 17.7 25.8 31.5 25.8 58.9 0 96.5-58.9 104.2-114.8 110.5 9.2 7.9 17 22.9 17 46.4 0 33.7-.3 75.4-.3 83.6 0 6.5 4.6 14.4 17.3 12.1C428.2 457.8 496 362.9 496 252 496 113.3 383.5 8 244.8 8zM97.2 352.9c-1.3 1-1 3.3.7 5.2 1.6 1.6 3.9 2.3 5.2 1 1.3-1 1-3.3-.7-5.2-1.6-1.6-3.9-2.3-5.2-1zm-10.8-8.1c-.7 1.3.3 2.9 2.3 3.9 1.6 1 3.6.7 4.3-.7.7-1.3-.3-2.9-2.3-3.9-2-.6-3.6-.3-4.3.7zm32.4 35.6c-1.6 1.3-1 4.3 1.3 6.2 2.3 2.3 5.2 2.6 6.5 1 1.3-1.3.7-4.3-1.3-6.2-2.2-2.3-5.2-2.6-6.5-1zm-11.4-14.7c-1.6 1-1.6 3.6 0 5.9 1.6 2.3 4.3 3.3 5.6 2.3 1.6-1.3 1.6-3.9 0-6.2-1.4-2.3-4-3.3-5.6-2z"/></svg></span><span class="kt-btn-inner-text">Relataly Github Repo </span></a></div>



<h3 class="wp-block-heading">About the Breast Cancer Dataset</h3>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow">
<p class="wp-block-paragraph">The breast cancer dataset contains 569 samples, with 30 features derived from digitized images of tissue samples. The features in the dataset describe the characteristics of the cell nuclei present in the image, including color, size, and symmetry. In addition, the dataset includes a binary target variable that indicates whether the sample is benign or malignant. 212 Samples are malignant, and 357 are benign. </p>



<p class="wp-block-paragraph">You can find more information on the dataset on the <a href="https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)" target="_blank" rel="noreferrer noopener">UCI.edu webpage</a>. The breast cancer dataset is included in the scikit-learn package, so there is no need to download the data upfront.</p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow">
<figure class="wp-block-image is-resized"><img decoding="async" src="https://www.kurzweilai.net/images/breast-cancer-images-enlarged.png" alt="benign tissue samples vs malignant tissue samples, machine learning classification, measuring model performance, python, Scikit-learn, random decision forest classifier" width="524" height="275"/><figcaption class="wp-element-caption">Exemplary images of benign and malignant samples. Source: <a href="https://www.kurzweilai.net/pigeons-diagnose-breast-cancer-on-x-rays-as-well-as-radiologists" target="_blank" rel="noreferrer noopener">kurzweilai </a></figcaption></figure>
</div>
</div>



<h3 class="wp-block-heading" id="h-prerequisites">Prerequisites</h3>



<p class="wp-block-paragraph">Before starting the coding part, make sure that you have set up your Python 3 environment and required packages. If you don’t have an environment, follow&nbsp;<a href="https://www.relataly.com/anaconda-python-environment-machine-learning/1663/" target="_blank" rel="noreferrer noopener">this tutorial</a>&nbsp;to set up the&nbsp;<a href="https://www.anaconda.com/products/individual" target="_blank" rel="noreferrer noopener">Anaconda environment</a>. Also, make sure you install all required packages. In this tutorial, we will be working with the following standard packages:&nbsp;</p>



<ul class="wp-block-list">
<li>pandas</li>



<li>NumPy</li>



<li>math</li>



<li>matplotlib</li>



<li>scikit-learn</li>
</ul>



<p class="wp-block-paragraph">You can install packages using console commands:</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">pip install &lt;package name&gt;
conda install &lt;package name&gt; (if you are using the anaconda packet manager)</pre></div>



<h3 class="wp-block-heading">Step #1 Loading the Data</h3>



<p class="wp-block-paragraph">We begin by loading the cancer dataset from scikit-learn. Then we display a list of the features and plot the balance of our classification target, the two tissue types. &#8220;1&#8221; is type &#8220;benign,&#8221; and 0 corresponds to type &#8220;malignant.&#8221;</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># A tutorial for this file is available at www.relataly.com

import numpy as np 
import pandas as pd 
import seaborn as sns
import matplotlib.pyplot as plt 
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, classification_report, roc_auc_score, plot_roc_curve
from sklearn.model_selection import cross_val_predict
from sklearn import datasets

df = datasets.load_breast_cancer(as_frame=True)

df_dia = df.data
df_dia['cancer_type'] = df.target

plt.figure(figsize=(16,2))
plt.title(f'labels')
fig = sns.countplot(y=&quot;cancer_type&quot;, data=df_dia)

df_dia.head()</pre></div>



<figure class="wp-block-image size-full"><img decoding="async" width="934" height="170" data-attachment-id="7759" data-permalink="https://www.relataly.com/measuring-classification-performance-in-machine-learning-with-python-and-scikit-learn/846/label-balance/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/04/label-balance.png" data-orig-size="934,170" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="label-balance" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/04/label-balance.png" src="https://www.relataly.com/wp-content/uploads/2022/04/label-balance.png" alt="" class="wp-image-7759" srcset="https://www.relataly.com/wp-content/uploads/2022/04/label-balance.png 934w, https://www.relataly.com/wp-content/uploads/2022/04/label-balance.png 300w, https://www.relataly.com/wp-content/uploads/2022/04/label-balance.png 768w" sizes="(max-width: 934px) 100vw, 934px" /></figure>



<p class="wp-block-paragraph">The barplot shows more benign observations among the sample than malignant ones.</p>



<h3 class="wp-block-heading" id="h-step-2-data-preparation-and-model-training">Step #2<strong> </strong>Data Preparation and Model Training</h3>



<p class="wp-block-paragraph">Next, we will prepare the data and use it for training a random decision forest classifier. It is important to remember that the performance of a classifier is dependent on the specific data it is trained on. Therefore, it is crucial to evaluate the classifier using a separate, unseen test dataset to avoid overfitting and ensure that the classifier generalizes well to new data. The code below therefore splits the data into train and test datasets.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Select a small number of features that we use as input to the classification model
features = ['carwidth', 'carlength']
df_base = df[features + ['Price_label']]

# Separate labels from training data
X = df_base[features] #Training data
y = df_base['Price_label'] #Prediction label

# Split the data into x_train and y_train data sets
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.5, random_state=0)</pre></div>



<p class="wp-block-paragraph">Now that we have prepared the data, it is time to train our classifier. We use a random forest algorithm from the Scikit-learn package. If you want to learn more about this topic, check out the <a href="https://www.relataly.com/category/machine-learning-algorithms/random-decision-forests/" target="_blank" rel="noreferrer noopener">relataly tutorials on random forests</a>. </p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Create the Random Forest Classifier
dfrst = RandomForestClassifier(n_estimators=3, max_depth=4, min_samples_split=6, class_weight='balanced')
ranfor = dfrst.fit(X_train, y_train)
y_pred = ranfor.predict(X_test)</pre></div>



<p class="wp-block-paragraph">After running the code, you have a trained classifier.</p>



<h3 class="wp-block-heading">Step #3 Creating a Confusion Matrix</h3>



<p class="wp-block-paragraph">Next, we will create the confusion matrix and several standard error metrics. First, we create the matrix by running the code below. Remember that the matrix will contain only the tabular data without any visualization. To illustrate the results in a heatmap, we first need to plot the matrix. We will use the heatmap function from the seaborn package for this task.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Create heatmap from the confusion matrix
def createConfMatrix(class_names, matrix):
    class_names=[0, 1] 
    tick_marks = [0.5, 1.5]
    fig, ax = plt.subplots(figsize=(7, 6))
    sns.heatmap(pd.DataFrame(matrix), annot=True, cmap=&quot;Blues&quot;, fmt='g')
    ax.xaxis.set_label_position(&quot;top&quot;)
    plt.title('Confusion matrix')
    plt.ylabel('Actual label'); plt.xlabel('Predicted label')
    plt.yticks(tick_marks, class_names); plt.xticks(tick_marks, class_names)
    
# Create a confusion matrix
cnf_matrix = confusion_matrix(y_test, y_pred)
createConfMatrix(matrix=cnf_matrix, class_names=[0, 1])</pre></div>



<figure class="wp-block-image size-full is-resized"><img decoding="async" data-attachment-id="7738" data-permalink="https://www.relataly.com/measuring-classification-performance-in-machine-learning-with-python-and-scikit-learn/846/breast-cancer-classifier-confusion-matrix/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/04/breast-cancer-classifier-confusion-matrix.png" data-orig-size="419,385" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="breast-cancer-classifier-confusion-matrix" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/04/breast-cancer-classifier-confusion-matrix.png" src="https://www.relataly.com/wp-content/uploads/2022/04/breast-cancer-classifier-confusion-matrix.png" alt="Confusion matrix for a two-class classifier, measuring model performance, classification error metrics, Scikit-learn, python, breast cancer dataset" class="wp-image-7738" width="455" height="418" srcset="https://www.relataly.com/wp-content/uploads/2022/04/breast-cancer-classifier-confusion-matrix.png 419w, https://www.relataly.com/wp-content/uploads/2022/04/breast-cancer-classifier-confusion-matrix.png 300w" sizes="(max-width: 455px) 100vw, 455px" /><figcaption class="wp-element-caption">The confusion matrix shows the following: In 93 samples, the model correctly predicted a malignant label, and in 181 cases the model predicted that the tissue sample was benign. In 3 cases, the model failed to recognize a malignant sample, and in 8 cases the model raised a false alarm.</figcaption></figure>



<p class="wp-block-paragraph">Next, we calculate the error metrics (accuracy, precision, recall, f1-score). You can do this by using the separate functions from the Scikit-learn package. Alternatively, you can also use the classification report, which contains all these error metrics.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Calculate Standard Error Metrics
print('accuracy: {:.2f}'.format(accuracy_score(y_test, y_pred)))
print('precision: {:.2f}'.format(precision_score(y_test, y_pred)))
print('recall: {:.2f}'.format(recall_score(y_test, y_pred)))
print('f1_score: {:.2f}'.format(f1_score(y_test, y_pred)))

# Classification Report (Alternative)
results_log = classification_report(y_test, y_pred, output_dict=True)
results_df_log = pd.DataFrame(results_log).transpose()
print(results_df_log)</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">accuracy: 0.94 
precision: 0.97 
recall: 0.94
f1_score: 0.95</pre></div>



<h3 class="wp-block-heading">Step #4 ROC and AUC</h3>



<p class="wp-block-paragraph">Finally, let&#8217;s calculate the ROC and the Area under the Curve (AUC). </p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Compute ROC curve
fig, ax = plt.subplots(figsize=(10, 6))
RocCurveDisplay.from_estimator(ranfor, X_test, y_test, ax=ax)
plt.title('ROC Curve for the Car Price Classifier')
plt.show()</pre></div>



<figure class="wp-block-image size-full"><img decoding="async" width="609" height="387" data-attachment-id="7751" data-permalink="https://www.relataly.com/measuring-classification-performance-in-machine-learning-with-python-and-scikit-learn/846/output-7/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/04/output-7.png" data-orig-size="609,387" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="output-7" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/04/output-7.png" src="https://www.relataly.com/wp-content/uploads/2022/04/output-7.png" alt="ROC Curve for the Breast Cancer Classifier" class="wp-image-7751" srcset="https://www.relataly.com/wp-content/uploads/2022/04/output-7.png 609w, https://www.relataly.com/wp-content/uploads/2022/04/output-7.png 300w" sizes="(max-width: 609px) 100vw, 609px" /></figure>



<p class="wp-block-paragraph">The ROC tells us, that the model already performs quite well. However, we want to know it precisely. By running the code below, you can calculate the AUC.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Calculate probability scores 
y_scores = cross_val_predict(ranfor, X_test, y_test, cv=3, method='predict_proba')
# Because of the structure of how the model returns the y_scores, we need to convert them into binary values
y_scores_binary = [1 if x[0] &lt; 0.5 else 0 for x in y_scores]
# Now, we can calculate the area under the ROC curve
auc = roc_auc_score(y_test, y_scores_binary, average=&quot;macro&quot;)
auc # Be aware that due to the random nature of cross validation, the results will change when you run the code</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">0.9035191562634525</pre></div>



<h2 class="wp-block-heading">Summary</h2>



<p class="wp-block-paragraph">This tutorial has shown how to evaluate the performance of a two-label classification model. We started by introducing the concept of the confusion matrix and how it can be used to evaluate the performance of a classifier. We then discussed various error metrics, such as accuracy, precision, and recall, and how we can use them to gain a better understanding of the classifier&#8217;s performance. Next, we discussed the ROC curve and how it can be used to visualize the trade-offs between precision and recall for different thresholds of the classifier. We also discussed how we could use the ROC curve to compare the performance of different classifiers. In the second part, we have applied the different tools and techniques to the practical example of a breast cancer classifier. We used the confusion matrix and error metrics to evaluate the classifier and the ROC curve to compare its performance. </p>



<p class="wp-block-paragraph">Overall, this tutorial has provided an overview of the tools and techniques that are commonly used to evaluate the performance of a classification model. By understanding and applying these tools and techniques, we can gain a better understanding of how well a classifier is performing and make informed decisions about whether it is ready for production.</p>



<p class="wp-block-paragraph">I hope this article helped you understand how to measure the performance of classification models. If you have any questions or feedback, please let me know. And if you are looking for error metrics to measure regression performance, check out <a href="https://www.relataly.com/category/data-science/measuring-model-performance/" target="_blank" rel="noreferrer noopener">this tutorial on regression errors</a>.</p>



<h2 class="wp-block-heading" id="h-sources-and-further-reading">Sources and Further Reading</h2>



<ol class="wp-block-list">
<li><a href="https://amzn.to/3EKidwE" target="_blank" rel="noreferrer noopener">David Forsyth (2019) Applied Machine Learning Springer</a></li>



<li><a href="https://amzn.to/3MAy8j5" target="_blank" rel="noreferrer noopener">Andriy Burkov (2020) Machine Learning Engineering</a></li>



<li><a href="https://www.kurzweilai.net/pigeons-diagnose-breast-cancer-on-x-rays-as-well-as-radiologists" target="_blank" rel="noreferrer noopener">https://www.kurzweilai.net/pigeons-diagnose-breast-cancer-on-x-rays-as-well-as-radiologists</a></li>
</ol>



<p class="has-contrast-2-color has-base-3-background-color has-text-color has-background wp-block-paragraph"><em>The links above to Amazon are affiliate links. By buying through these links, you support the Relataly.com blog and help to cover the hosting costs. Using the links does not affect the price.</em></p>
<p>The post <a href="https://www.relataly.com/measuring-classification-performance-in-machine-learning-with-python-and-scikit-learn/846/">How to Measure the Performance of a Machine Learning Classifier with Python and Scikit-Learn?</a> appeared first on <a href="https://www.relataly.com">relataly.com</a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://www.relataly.com/measuring-classification-performance-in-machine-learning-with-python-and-scikit-learn/846/feed/</wfw:commentRss>
			<slash:comments>1</slash:comments>
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">846</post-id>	</item>
		<item>
		<title>Multivariate Anomaly Detection on Time-Series Data in Python: Using Isolation Forests to Detect Credit Card Fraud</title>
		<link>https://www.relataly.com/multivariate-outlier-detection-using-isolation-forests-in-python-detecting-credit-card-fraud/4233/</link>
					<comments>https://www.relataly.com/multivariate-outlier-detection-using-isolation-forests-in-python-detecting-credit-card-fraud/4233/#comments</comments>
		
		<dc:creator><![CDATA[Florian Follonier]]></dc:creator>
		<pubDate>Wed, 16 Jun 2021 09:33:00 +0000</pubDate>
				<category><![CDATA[Anomaly Detection]]></category>
		<category><![CDATA[Classification (two-class)]]></category>
		<category><![CDATA[Finance]]></category>
		<category><![CDATA[Fraud Detection]]></category>
		<category><![CDATA[K-Nearest Neighbors (KNN)]]></category>
		<category><![CDATA[Local Outlier Factor]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[Random Isolation Forest]]></category>
		<category><![CDATA[Scikit-Learn]]></category>
		<category><![CDATA[Seaborn]]></category>
		<category><![CDATA[AI in Business]]></category>
		<category><![CDATA[AI in Finance]]></category>
		<category><![CDATA[Intermediate Tutorials]]></category>
		<category><![CDATA[Multivariate Models]]></category>
		<category><![CDATA[Unsupervised Learning]]></category>
		<guid isPermaLink="false">https://www.relataly.com/?p=4233</guid>

					<description><![CDATA[<p>Credit card fraud has become one of the most common use cases for anomaly detection systems. The number of fraud attempts has risen sharply, resulting in billions of dollars in losses. Early detection of fraud attempts with machine learning is therefore becoming increasingly important. In this article, we take on the fight against international credit ... <a title="Multivariate Anomaly Detection on Time-Series Data in Python: Using Isolation Forests to Detect Credit Card Fraud" class="read-more" href="https://www.relataly.com/multivariate-outlier-detection-using-isolation-forests-in-python-detecting-credit-card-fraud/4233/" aria-label="Read more about Multivariate Anomaly Detection on Time-Series Data in Python: Using Isolation Forests to Detect Credit Card Fraud">Read more</a></p>
<p>The post <a href="https://www.relataly.com/multivariate-outlier-detection-using-isolation-forests-in-python-detecting-credit-card-fraud/4233/">Multivariate Anomaly Detection on Time-Series Data in Python: Using Isolation Forests to Detect Credit Card Fraud</a> appeared first on <a href="https://www.relataly.com">relataly.com</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow">
<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">Credit card fraud has become one of the most common use cases for anomaly detection systems. The number of fraud attempts has risen sharply, resulting in billions of dollars in losses. Early detection of fraud attempts with <a href="https://www.relataly.com/category/machine-learning/" target="_blank" rel="noreferrer noopener">machine learning</a> is therefore becoming increasingly important. In this article, we take on the fight against international credit card fraud and develop a multivariate anomaly detection model in Python that spots fraudulent payment transactions. The model will use the Isolation Forest algorithm, one of the most effective techniques for detecting outliers. Isolation Forests are so-called ensemble models. They have various hyperparameters with which we can optimize model performance. However, we will not do this manually but instead, use grid search for hyperparameter tuning. To assess the performance of our model, we will also compare it with other models.</p>



<p class="wp-block-paragraph">The remainder of this article is structured as follows: We start with a brief introduction to anomaly detection and look at the Isolation Forest algorithm. Equipped with these theoretical foundations, we then turn to the practical part, in which we train and validate an isolation forest that detects credit card fraud. We use an unsupervised learning approach, where the model learns to distinguish regular from suspicious card transactions. We will train our model on a public dataset from Kaggle that contains credit card transactions. Finally, we will compare the performance of our model against two nearest neighbor algorithms (LOF and <a href="https://www.relataly.com/category/machine-learning-algorithms/k-nearest-neighbors-knn/" target="_blank" rel="noreferrer noopener">KNN</a>).</p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%">
<figure class="wp-block-image size-large"><img decoding="async" width="512" height="512" data-attachment-id="12702" data-permalink="https://www.relataly.com/financial-crime-serious-business-python-relataly-midjourney-min/" data-orig-file="https://www.relataly.com/wp-content/uploads/2023/03/financial-crime-serious-business-python-relataly-midjourney-min.png" data-orig-size="1024,1024" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="financial crime serious business python relataly midjourney-min" data-image-description="&lt;p&gt;Financial crime is a real problem, although its usually not commited by cats. Image created with Midjourney&lt;/p&gt;
" data-image-caption="&lt;p&gt;Financial crime is a real problem, although its usually not commited by cats. Image created with Midjourney&lt;/p&gt;
" data-large-file="https://www.relataly.com/wp-content/uploads/2023/03/financial-crime-serious-business-python-relataly-midjourney-min.png" src="https://www.relataly.com/wp-content/uploads/2023/03/financial-crime-serious-business-python-relataly-midjourney-min-512x512.png" alt="Financial crime is a real problem, although its usually not commited by cats. Image created with Midjourney" class="wp-image-12702" srcset="https://www.relataly.com/wp-content/uploads/2023/03/financial-crime-serious-business-python-relataly-midjourney-min.png 512w, https://www.relataly.com/wp-content/uploads/2023/03/financial-crime-serious-business-python-relataly-midjourney-min.png 300w, https://www.relataly.com/wp-content/uploads/2023/03/financial-crime-serious-business-python-relataly-midjourney-min.png 140w, https://www.relataly.com/wp-content/uploads/2023/03/financial-crime-serious-business-python-relataly-midjourney-min.png 768w, https://www.relataly.com/wp-content/uploads/2023/03/financial-crime-serious-business-python-relataly-midjourney-min.png 1024w" sizes="(max-width: 512px) 100vw, 512px" /><figcaption class="wp-element-caption">Financial crime is a real problem, although cats are relatively seldom involved. Image created with <a href="http://www.midjourney.com" target="_blank" rel="noreferrer noopener">Midjourney</a></figcaption></figure>
</div>
</div>
</div>
</div>



<h2 class="wp-block-heading" id="h-multivariate-anomaly-detection">Multivariate Anomaly Detection</h2>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">Before we take a closer look at the use case and our unsupervised approach, let&#8217;s briefly discuss anomaly detection. Anomaly detection deals with finding points that deviate from legitimate data regarding their mean or median in a distribution. In machine learning, the term is often used synonymously with outlier detection. </p>



<p class="wp-block-paragraph">Some anomaly detection models work with a single feature (univariate data), for example, in monitoring electronic signals. However, most anomaly detection models use multivariate data, which means they have two (bivariate) or more (multivariate) features. They find a wide range of applications, including the following:</p>



<ul class="wp-block-list">
<li>Predictive Maintenance and Detection of Malfunctions and Decay</li>



<li>Detection of Retail Bank Credit Card Fraud</li>



<li>Detection of Pricing Errors</li>



<li>Cyber Security, for example, Network Intrusion Detection</li>



<li>Detecting Fraudulent Market Behavior in Investment Banking</li>
</ul>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%">
<figure class="wp-block-image size-large"><img decoding="async" width="1024" height="410" data-attachment-id="9019" data-permalink="https://www.relataly.com/multivariate-outlier-detection-using-isolation-forests-in-python-detecting-credit-card-fraud/4233/output4/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/07/output4.png" data-orig-size="1184,474" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="identifying anomalous datapoints in two dimensions, credit card fraud detection" data-image-description="&lt;p&gt;identifying anomalous datapoints in two dimensions, credit card fraud detection&lt;/p&gt;
" data-image-caption="&lt;p&gt;identifying anomalous datapoints in two dimensions, credit card fraud detection&lt;/p&gt;
" data-large-file="https://www.relataly.com/wp-content/uploads/2022/07/output4.png" src="https://www.relataly.com/wp-content/uploads/2022/07/output4-1024x410.png" alt="identifying anomalous data points in two dimensions, credit card fraud detection" class="wp-image-9019" srcset="https://www.relataly.com/wp-content/uploads/2022/07/output4.png 1024w, https://www.relataly.com/wp-content/uploads/2022/07/output4.png 300w, https://www.relataly.com/wp-content/uploads/2022/07/output4.png 768w, https://www.relataly.com/wp-content/uploads/2022/07/output4.png 1184w" sizes="(max-width: 1024px) 100vw, 1024px" /><figcaption class="wp-element-caption">Identifying anomalous data points in two dimensions in credit card fraud detection</figcaption></figure>
</div>
</div>



<h3 class="wp-block-heading" id="h-unsupervised-algorithms-for-anomaly-detection">Unsupervised Algorithms for Anomaly Detection</h3>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">Outlier detection is a classification problem. However, the field is more diverse as outlier detection is a problem we can approach with supervised and unsupervised machine learning techniques. It would go beyond the scope of this article to explain the multitude of outlier detection techniques. Still, the following chart provides a good overview of standard algorithms that learn unsupervised.</p>



<figure class="wp-block-image size-large is-resized"><img decoding="async" data-attachment-id="7649" data-permalink="https://www.relataly.com/multivariate-outlier-detection-using-isolation-forests-in-python-detecting-credit-card-fraud/4233/image-21-3/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/04/image-21.png" data-orig-size="1983,994" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-21" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/04/image-21.png" src="https://www.relataly.com/wp-content/uploads/2022/04/image-21-1024x513.png" alt="Unsupervised Algorithms for Anomaly Detection. Outlier detection using random isolation forests" class="wp-image-7649" width="759" height="380" srcset="https://www.relataly.com/wp-content/uploads/2022/04/image-21.png 1024w, https://www.relataly.com/wp-content/uploads/2022/04/image-21.png 300w, https://www.relataly.com/wp-content/uploads/2022/04/image-21.png 768w, https://www.relataly.com/wp-content/uploads/2022/04/image-21.png 1536w, https://www.relataly.com/wp-content/uploads/2022/04/image-21.png 1983w" sizes="(max-width: 759px) 100vw, 759px" /><figcaption class="wp-element-caption">Unsupervised Algorithms for Anomaly Detection. </figcaption></figure>



<p class="wp-block-paragraph">A prerequisite for supervised learning is that we have information about which data points are outliers and belong to regular data. In credit card fraud detection, this information is available because banks can validate with their customers whether a suspicious transaction is a fraud or not. In many other outlier detection cases, it remains unclear which outliers are legitimate and which are just noise or other uninteresting events in the data. </p>



<p class="wp-block-paragraph">Whether we know which classes in our dataset are outliers and which are not affects the selection of possible algorithms we could use to solve the outlier detection problem. Unsupervised learning techniques are a natural choice if the class labels are unavailable. And if the class labels are available, we could use both unsupervised and supervised learning algorithms.</p>



<p class="wp-block-paragraph">In the following, we will focus on Isolation Forests.</p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%"></div>
</div>



<h3 class="wp-block-heading" id="h-the-isolation-forest-iforest-algorithm">The Isolation Forest (&#8220;iForest&#8221;) Algorithm</h3>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">Isolation forests (sometimes called iForests) are among the most powerful techniques for identifying anomalies in a dataset. They belong to the group of so-called ensemble models. The predictions of ensemble models do not rely on a single model. Instead, they combine the results of multiple independent models (decision trees). Nevertheless, isolation forests should not be confused with traditional <a href="https://www.relataly.com/category/machine-learning-algorithms/decision-forests/" target="_blank" rel="noreferrer noopener">random decision forests</a>. While random forests predict given class labels (supervised learning), isolation forests learn to distinguish outliers from inliers (regular data) in an unsupervised learning process.</p>



<p class="wp-block-paragraph">An Isolation Forest contains multiple independent isolation trees. The algorithm invokes a process that recursively divides the training data at random points to isolate data points from each other to build an Isolation Tree. The number of partitions required to isolate a point tells us whether it is an anomalous or regular point. The underlying assumption is that random splits can isolate an anomalous data point much sooner than nominal ones.</p>



<p class="wp-block-paragraph">Also: <a href="https://www.relataly.com/stock-market-prediction-using-multivariate-time-series-in-python/1815/" target="_blank" rel="noreferrer noopener">Stock Market Prediction using Multivariate Time Series Data</a> </p>



<figure class="wp-block-image size-large is-resized"><img decoding="async" data-attachment-id="4616" data-permalink="https://www.relataly.com/multivariate-outlier-detection-using-isolation-forests-in-python-detecting-credit-card-fraud/4233/image-40-5/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2021/06/image-40.png" data-orig-size="1106,461" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-40" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2021/06/image-40.png" src="https://www.relataly.com/wp-content/uploads/2021/06/image-40-1024x427.png" alt="Isolation Tree and Isolation Forest (Tree Ensemble), Outlier detection using random isolation forests" class="wp-image-4616" width="754" height="314" srcset="https://www.relataly.com/wp-content/uploads/2021/06/image-40.png 1024w, https://www.relataly.com/wp-content/uploads/2021/06/image-40.png 300w, https://www.relataly.com/wp-content/uploads/2021/06/image-40.png 768w, https://www.relataly.com/wp-content/uploads/2021/06/image-40.png 1106w" sizes="(max-width: 754px) 100vw, 754px" /><figcaption class="wp-element-caption">Isolation Tree and Isolation Forest (Tree Ensemble)</figcaption></figure>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%"></div>
</div>



<h3 class="wp-block-heading" id="h-how-the-isolation-forest-algorithm-works">How the Isolation Forest Algorithm Works</h3>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">The illustration below shows exemplary training of an Isolation Tree on univariate data, i.e., with only one feature. The algorithm has already split the data at five random points between the minimum and maximum values of a random sample. The isolated points are colored in purple. The example below has taken two partitions to isolate the point on the far left. The other purple points were separated after 4 and 5 splits.</p>



<p class="wp-block-paragraph">The partitioning process ends when the algorithm has isolated all points from each other or when all remaining points have equal values. The algorithm has calculated and assigned an outlier score to each point at the end of the process, based on how many splits it took to isolate it.</p>



<p class="wp-block-paragraph">When using an isolation forest model on unseen data to detect outliers, the algorithm will assign an anomaly score to the new data points. These scores will be calculated based on the ensemble trees we built during model training.</p>



<p class="wp-block-paragraph">So how does this process work when our dataset involves multiple features? For multivariate anomaly detection, partitioning the data remains almost the same. The significant difference is that the algorithm selects a random feature in which the partitioning will occur before each partitioning. Consequently, multivariate isolation forests split the data along multiple dimensions (features).</p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%">
<figure class="wp-block-image size-large is-resized"><img decoding="async" data-attachment-id="4615" data-permalink="https://www.relataly.com/multivariate-outlier-detection-using-isolation-forests-in-python-detecting-credit-card-fraud/4233/image-39-5/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2021/06/image-39.png" data-orig-size="685,430" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-39" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2021/06/image-39.png" src="https://www.relataly.com/wp-content/uploads/2021/06/image-39.png" alt="Exemplary partitioning process of an isolation tree (5 Steps); Outlier detection using random isolation forests;  Outlier detection using isolation forests" class="wp-image-4615" width="375" height="235" srcset="https://www.relataly.com/wp-content/uploads/2021/06/image-39.png 685w, https://www.relataly.com/wp-content/uploads/2021/06/image-39.png 300w, https://www.relataly.com/wp-content/uploads/2021/06/image-39.png 80w" sizes="(max-width: 375px) 100vw, 375px" /><figcaption class="wp-element-caption">Exemplary partitioning process of an isolation tree (5 Steps)</figcaption></figure>
</div>
</div>



<h2 class="wp-block-heading" id="h-credit-card-fraud-detection-using-isolation-forests">Credit Card Fraud Detection using Isolation Forests</h2>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">Monitoring transactions has become a crucial task for financial institutions. In 2019 alone, more than 271,000 cases of credit card theft were reported in the U.S., causing billions of dollars in losses and making credit card fraud one of the most common types of identity theft. The vast majority of fraud cases are attributable to organized crime, which often specializes in this particular crime.</p>



<p class="wp-block-paragraph">Anything that deviates from the customer&#8217;s normal payment behavior can make a transaction suspicious, including an unusual location, time, or country in which the customer conducted the transaction. Credit card providers use similar anomaly detection systems to monitor their customers&#8217; transactions and look for potential fraud attempts. They can halt the transaction and inform their customer as soon as they detect a fraud attempt. We train an Isolation Forest algorithm for credit card fraud detection using Python in the following.</p>



<p class="wp-block-paragraph">Now that we have established the context for our machine learning problem, we can begin implementing an anomaly detection model in Python.</p>



<p class="wp-block-paragraph">The code is available on the GitHub repository.</p>



<div class="wp-block-kadence-advancedbtn kb-buttons-wrap kb-btns_60179c-c7"><a class="kb-button kt-button button kb-btn_ef16c1-1b kt-btn-size-standard kt-btn-width-type-full kb-btn-global-inherit kt-btn-has-text-true kt-btn-has-svg-true wp-block-button__link wp-block-kadence-singlebtn" href="https://github.com/flo7up/relataly-public-python-tutorials/blob/master/04%20Anomaly%20Detection/026%20Credit%20Card%20Fraud%20Detection%20with%20Isolation%20Forests.ipynb" target="_blank" rel="noreferrer noopener"><span class="kb-svg-icon-wrap kb-svg-icon-fe_eye kt-btn-icon-side-left"><svg viewBox="0 0 24 24"  fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"  aria-hidden="true"><path d="M1 12s4-8 11-8 11 8 11 8-4 8-11 8-11-8-11-8z"/><circle cx="12" cy="12" r="3"/></svg></span><span class="kt-btn-inner-text">View on GitHub </span></a>

<a class="kb-button kt-button button kb-btn_adf11b-23 kt-btn-size-standard kt-btn-width-type-full kb-btn-global-inherit kt-btn-has-text-true kt-btn-has-svg-true wp-block-button__link wp-block-kadence-singlebtn" href="https://github.com/flo7up/relataly-public-python-API-tutorials" target="_blank" rel="noreferrer noopener"><span class="kb-svg-icon-wrap kb-svg-icon-fa_github kt-btn-icon-side-left"><svg viewBox="0 0 496 512"  fill="currentColor" xmlns="http://www.w3.org/2000/svg"  aria-hidden="true"><path d="M165.9 397.4c0 2-2.3 3.6-5.2 3.6-3.3.3-5.6-1.3-5.6-3.6 0-2 2.3-3.6 5.2-3.6 3-.3 5.6 1.3 5.6 3.6zm-31.1-4.5c-.7 2 1.3 4.3 4.3 4.9 2.6 1 5.6 0 6.2-2s-1.3-4.3-4.3-5.2c-2.6-.7-5.5.3-6.2 2.3zm44.2-1.7c-2.9.7-4.9 2.6-4.6 4.9.3 2 2.9 3.3 5.9 2.6 2.9-.7 4.9-2.6 4.6-4.6-.3-1.9-3-3.2-5.9-2.9zM244.8 8C106.1 8 0 113.3 0 252c0 110.9 69.8 205.8 169.5 239.2 12.8 2.3 17.3-5.6 17.3-12.1 0-6.2-.3-40.4-.3-61.4 0 0-70 15-84.7-29.8 0 0-11.4-29.1-27.8-36.6 0 0-22.9-15.7 1.6-15.4 0 0 24.9 2 38.6 25.8 21.9 38.6 58.6 27.5 72.9 20.9 2.3-16 8.8-27.1 16-33.7-55.9-6.2-112.3-14.3-112.3-110.5 0-27.5 7.6-41.3 23.6-58.9-2.6-6.5-11.1-33.3 2.6-67.9 20.9-6.5 69 27 69 27 20-5.6 41.5-8.5 62.8-8.5s42.8 2.9 62.8 8.5c0 0 48.1-33.6 69-27 13.7 34.7 5.2 61.4 2.6 67.9 16 17.7 25.8 31.5 25.8 58.9 0 96.5-58.9 104.2-114.8 110.5 9.2 7.9 17 22.9 17 46.4 0 33.7-.3 75.4-.3 83.6 0 6.5 4.6 14.4 17.3 12.1C428.2 457.8 496 362.9 496 252 496 113.3 383.5 8 244.8 8zM97.2 352.9c-1.3 1-1 3.3.7 5.2 1.6 1.6 3.9 2.3 5.2 1 1.3-1 1-3.3-.7-5.2-1.6-1.6-3.9-2.3-5.2-1zm-10.8-8.1c-.7 1.3.3 2.9 2.3 3.9 1.6 1 3.6.7 4.3-.7.7-1.3-.3-2.9-2.3-3.9-2-.6-3.6-.3-4.3.7zm32.4 35.6c-1.6 1.3-1 4.3 1.3 6.2 2.3 2.3 5.2 2.6 6.5 1 1.3-1.3.7-4.3-1.3-6.2-2.2-2.3-5.2-2.6-6.5-1zm-11.4-14.7c-1.6 1-1.6 3.6 0 5.9 1.6 2.3 4.3 3.3 5.6 2.3 1.6-1.3 1.6-3.9 0-6.2-1.4-2.3-4-3.3-5.6-2z"/></svg></span><span class="kt-btn-inner-text">Relataly GitHub Repo </span></a></div>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%"><div class="wp-block-image">
<figure class="aligncenter size-full"><img decoding="async" width="512" height="496" data-attachment-id="12451" data-permalink="https://www.relataly.com/multivariate-outlier-detection-using-isolation-forests-in-python-detecting-credit-card-fraud/4233/credit-card-data-analytics-machine-learning-cyber-crime-fraud-detection/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2023/02/credit-card-data-analytics-machine-learning-cyber-crime-fraud-detection.png" data-orig-size="512,496" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="credit-card-data-analytics-machine-learning-cyber-crime-fraud-detection" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2023/02/credit-card-data-analytics-machine-learning-cyber-crime-fraud-detection.png" src="https://www.relataly.com/wp-content/uploads/2023/02/credit-card-data-analytics-machine-learning-cyber-crime-fraud-detection.png" alt="" class="wp-image-12451" srcset="https://www.relataly.com/wp-content/uploads/2023/02/credit-card-data-analytics-machine-learning-cyber-crime-fraud-detection.png 512w, https://www.relataly.com/wp-content/uploads/2023/02/credit-card-data-analytics-machine-learning-cyber-crime-fraud-detection.png 300w" sizes="(max-width: 512px) 100vw, 512px" /><figcaption class="wp-element-caption">Each application of a credit card creates a new data point to review. Image created with <a href="http://www.midjourney.com" target="_blank" rel="noreferrer noopener">Midjourney</a>.</figcaption></figure>
</div></div>
</div>



<h3 class="wp-block-heading" id="h-prerequisites">Prerequisites</h3>



<p class="wp-block-paragraph">Before starting the coding part, make sure that you have set up your <a href="https://www.python.org/downloads/" target="_blank" rel="noreferrer noopener">Python 3</a> environment and required packages. If you don&#8217;t have an environment, consider the&nbsp;<a href="https://www.anaconda.com/products/individual" target="_blank" rel="noreferrer noopener">Anaconda Python environment</a>. To set it up, you can follow the steps in&nbsp;<a href="https://www.relataly.com/category/data-science/setup-anaconda-environment/" target="_blank" rel="noreferrer noopener">this tutorial</a>.</p>



<p class="wp-block-paragraph">Also, make sure you install all required packages. In this tutorial, we will be working with the following standard packages:&nbsp;</p>



<ul class="wp-block-list">
<li><em><a href="https://pandas.pydata.org/" target="_blank" rel="noreferrer noopener">pandas</a></em></li>



<li><em><a href="https://numpy.org/" target="_blank" rel="noreferrer noopener">NumPy</a></em></li>



<li><em><a href="https://matplotlib.org/" target="_blank" rel="noreferrer noopener">matplotlib</a></em></li>
</ul>



<p class="wp-block-paragraph">In addition, we will be using the machine learning library <a href="https://scikit-learn.org/stable/" target="_blank" rel="noreferrer noopener">Scikit-learn</a> and <a data-type="URL" data-id="https://seaborn.pydata.org/" href="https://seaborn.pydata.org/" target="_blank" rel="noreferrer noopener">Seaborn</a> for visualization. </p>



<p class="wp-block-paragraph">You can install packages using console commands:</p>



<ul class="wp-block-list">
<li><em>pip install &lt;package name&gt;</em></li>



<li><em>conda install &lt;package name&gt;</em>&nbsp;(if you are using the anaconda packet manager)</li>
</ul>



<h3 class="wp-block-heading" id="h-dataset-credit-card-transactions">Dataset: Credit Card Transactions</h3>



<p class="wp-block-paragraph">In the following, we will work with a public dataset containing anonymized credit card transactions made by European cardholders in September 2013. You can download the dataset from <a href="https://www.kaggle.com/mlg-ulb/creditcardfraud" target="_blank" rel="noreferrer noopener">Kaggle.com.</a> </p>



<p class="wp-block-paragraph">The dataset contains 28 features (V1-V28) obtained from the source data using Principal Component Analysis (PCA). In addition, the data includes the date and the amount of the transaction. </p>



<p class="wp-block-paragraph">Transactions are labeled fraudulent or genuine, with 492 fraudulent cases out of 284,807 transactions. The positive class (frauds) accounts for only 0.172% of all credit card transactions, so the classes are highly unbalanced.</p>



<p class="wp-block-paragraph"></p>



<h3 class="wp-block-heading" id="h-step-1-load-the-data">Step #1: Load the Data</h3>



<p class="wp-block-paragraph">In the following, we will go through several steps of training an Anomaly detection model for credit card fraud. We will carry out several activities, such as:</p>



<ol class="wp-block-list">
<li>Loading and preprocessing the data: this involves cleaning, transforming, and preparing the data for analysis, in order to make it suitable for use with the isolation forest algorithm.</li>



<li>Feature engineering: this involves extracting and selecting relevant features from the data, such as transaction amounts, merchant categories, and time of day, in order to create a set of inputs for the anomaly detection algorithm.</li>



<li>Model training: We will train several machine learning models on different algorithms (incl. the isolation forest) on the preprocessed and engineered data. The models will learn the normal patterns and behaviors in credit card transactions. This activity includes hyperparameter tuning.</li>



<li>Model evaluation and testing: this involves evaluating the performance of the trained model on a test dataset in order to assess its accuracy, precision, recall, and other metrics and to identify any potential issues or improvements. As part of this activity, we compare the performance of the isolation forest to other models.</li>
</ol>



<p class="wp-block-paragraph">We begin by setting up imports and loading the data into our Python project. Then we&#8217;ll quickly verify that the dataset looks as expected. </p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}">import math 
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
from datetime import date, timedelta, datetime
import seaborn as sns
from sklearn.ensemble import IsolationForest
from sklearn.model_selection import train_test_split
from sklearn.neighbors import LocalOutlierFactor, KNeighborsClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, plot_confusion_matrix

# The Data can be downloaded from Kaggle.com: https://www.kaggle.com/mlg-ulb/creditcardfraud?select=creditcard.csv
path = 'data/credit-card-transactions/'
df = pd.read_csv(f'{path}creditcard.csv')
df</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">		Time	V1			V2			V3			V4			V5			V6			V7			V8			V9			...	V21	V22	V23	V24	V25	V26	V27	V28	Amount	Class
0		0.0		-1.359807	-0.072781	2.536347	1.378155	-0.338321	0.462388	0.239599	0.098698	0.363787	...	-0.018307	0.277838	-0.110474	0.066928	0.128539	-0.189115	0.133558	-0.021053	149.62	0
1		0.0		1.191857	0.266151	0.166480	0.448154	0.060018	-0.082361	-0.078803	0.085102	-0.255425	...	-0.225775	-0.638672	0.101288	-0.339846	0.167170	0.125895	-0.008983	0.014724	2.69	0
2		1.0		-1.358354	-1.340163	1.773209	0.379780	-0.503198	1.800499	0.791461	0.247676	-1.514654	...	0.247998	0.771679	0.909412	-0.689281	-0.327642	-0.139097	-0.055353	-0.059752	378.66	0
3		1.0		-0.966272	-0.185226	1.792993	-0.863291	-0.010309	1.247203	0.237609	0.377436	-1.387024	...	-0.108300	0.005274	-0.190321	-1.175575	0.647376	-0.221929	0.062723	0.061458	123.50	0
4		2.0		-1.158233	0.877737	1.548718	0.403034	-0.407193	0.095921	0.592941	-0.270533	0.817739	...	-0.009431	0.798278	-0.137458	0.141267	-0.206010	0.502292	0.219422	0.215153	69.99	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
284802	172786.0	-11.881118	10.071785	-9.834783	-2.066656	-5.364473	-2.606837	-4.918215	7.305334	1.914428	...	0.213454	0.111864	1.014480	-0.509348	1.436807	0.250034	0.943651	0.823731	0.77	0
284803	172787.0	-0.732789	-0.055080	2.035030	-0.738589	0.868229	1.058415	0.024330	0.294869	0.584800	...	0.214205	0.924384	0.012463	-1.016226	-0.606624	-0.395255	0.068472	-0.053527	24.79	0
284804	172788.0	1.919565	-0.301254	-3.249640	-0.557828	2.630515	3.031260	-0.296827	0.708417	0.432454	...	0.232045	0.578229	-0.037501	0.640134	0.265745	-0.087371	0.004455	-0.026561	67.88	0
284805	172788.0	-0.240440	0.530483	0.702510	0.689799	-0.377961	0.623708	-0.686180	0.679145	0.392087	...	0.265245	0.800049	-0.163298	0.123205	-0.569159	0.546668	0.108821	0.104533	10.00	0
284806	172792.0	-0.533413	-0.189733	0.703337	-0.506271	-0.012546	-0.649617	1.577006	-0.414650	0.486180	...	0.261057	0.643078	0.376777	0.008797	-0.473649	-0.818267	-0.002415	0.013649	217.00	0</pre></div>



<p class="wp-block-paragraph">Everything should look good so that we can continue.</p>



<h3 class="wp-block-heading" id="h-step-2-data-exploration">Step #2: Data Exploration</h3>



<p class="wp-block-paragraph">The purpose of data exploration in anomaly detection is to gain a better understanding of the data and the underlying patterns and trends that it contains. This can help to identify potential anomalies or outliers in the data and to determine the appropriate approaches and algorithms for detecting them.</p>



<p class="wp-block-paragraph">In the following, we will create histograms that visualize the distribution of the different features. </p>



<h4 class="wp-block-heading" id="h-2-1-features">2.1 Features</h4>



<p class="wp-block-paragraph">First, we will create a series of frequency histograms for our dataset&#8217;s features (V1 &#8211; V28). We will subsequently take a different look at the Class, Time, and Amount so that we can drop them at the moment. </p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># create histograms on all features
df_hist = df_base.drop(['Time','Amount', 'Class'], 1)
df_hist.hist(figsize=(20,20), bins = 50, color = &quot;c&quot;, edgecolor='black')
plt.show()</pre></div>



<figure class="wp-block-image size-large is-resized"><img decoding="async" data-attachment-id="9013" data-permalink="https://www.relataly.com/multivariate-outlier-detection-using-isolation-forests-in-python-detecting-credit-card-fraud/4233/output1-1/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/07/output1-1.png" data-orig-size="1170,1131" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="output1-1" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/07/output1-1.png" src="https://www.relataly.com/wp-content/uploads/2022/07/output1-1-1024x990.png" alt="Multivariate Anomaly Detection on Time-Series Data in Python: Feature frequency distributions on credit card data" class="wp-image-9013" width="1011" height="976" srcset="https://www.relataly.com/wp-content/uploads/2022/07/output1-1.png 1024w, https://www.relataly.com/wp-content/uploads/2022/07/output1-1.png 300w, https://www.relataly.com/wp-content/uploads/2022/07/output1-1.png 768w, https://www.relataly.com/wp-content/uploads/2022/07/output1-1.png 1170w" sizes="(max-width: 1011px) 100vw, 1011px" /></figure>



<p class="wp-block-paragraph">Next, we will look at the correlation between the 28 features. We expect the features to be uncorrelated due to the use of PCA. Let&#8217;s verify that by creating a heatmap on their correlation values.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># feature correlation
f_cor = df_hist.corr()
sns.heatmap(f_cor)</pre></div>



<figure class="wp-block-image size-full is-resized"><img decoding="async" data-attachment-id="9014" data-permalink="https://www.relataly.com/multivariate-outlier-detection-using-isolation-forests-in-python-detecting-credit-card-fraud/4233/output2-1/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/07/output2-1.png" data-orig-size="782,251" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="output2-1" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/07/output2-1.png" src="https://www.relataly.com/wp-content/uploads/2022/07/output2-1.png" alt="" class="wp-image-9014" width="796" height="255" srcset="https://www.relataly.com/wp-content/uploads/2022/07/output2-1.png 782w, https://www.relataly.com/wp-content/uploads/2022/07/output2-1.png 300w, https://www.relataly.com/wp-content/uploads/2022/07/output2-1.png 768w" sizes="(max-width: 796px) 100vw, 796px" /></figure>



<p class="wp-block-paragraph">As we expected, our features are uncorrelated. </p>



<h4 class="wp-block-heading" id="h-2-2-class-labels">2.2 Class Labels</h4>



<p class="wp-block-paragraph">Next, let&#8217;s print an overview of the class labels to understand better how balanced the two classes are.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Plot the balance of class labels
fig1, ax1 = plt.subplots(figsize=(14, 7))
plt.pie(df[['Class']].value_counts(), explode=[0,0.1], labels=[0,1], autopct='%1.2f%%', shadow=True, startangle=45)</pre></div>



<figure class="wp-block-image size-full is-resized"><img decoding="async" data-attachment-id="9015" data-permalink="https://www.relataly.com/multivariate-outlier-detection-using-isolation-forests-in-python-detecting-credit-card-fraud/4233/output6/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/07/output6.png" data-orig-size="394,394" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="output6" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/07/output6.png" src="https://www.relataly.com/wp-content/uploads/2022/07/output6.png" alt="balance of class labels in the credit card fraud dataset" class="wp-image-9015" width="494" height="494" srcset="https://www.relataly.com/wp-content/uploads/2022/07/output6.png 394w, https://www.relataly.com/wp-content/uploads/2022/07/output6.png 300w, https://www.relataly.com/wp-content/uploads/2022/07/output6.png 150w" sizes="(max-width: 494px) 100vw, 494px" /></figure>



<p class="wp-block-paragraph">We see that the data set is highly unbalanced. While this would constitute a problem for traditional classification techniques, it is a predestined use case for outlier detection algorithms like the Isolation Forest.</p>



<h4 class="wp-block-heading" id="h-2-3-time-and-amount">2.3 Time and Amount</h4>



<p class="wp-block-paragraph">Finally, we will create some plots to gain insights into time and amount. Let&#8217;s first have a look at the time variable. </p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Plot istribution of the Time variable, which contains transaction data for two days
fig, ax = plt.subplots(nrows=1, ncols=2, sharex=True, figsize=(14, 7))
sns.histplot(data=df_base[df_base['Class'] == 0], x='Time', kde=True, ax=ax[0])
sns.histplot(data=df_base[df_base['Class'] == 1], x='Time', kde=True, ax=ax[1])
plt.show()</pre></div>



<figure class="wp-block-image size-full is-resized"><img decoding="async" data-attachment-id="9016" data-permalink="https://www.relataly.com/multivariate-outlier-detection-using-isolation-forests-in-python-detecting-credit-card-fraud/4233/output3/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/07/output3.png" data-orig-size="842,423" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="output3" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/07/output3.png" src="https://www.relataly.com/wp-content/uploads/2022/07/output3.png" alt="Multivariate Anomaly Detection on Time-Series Data in Python: histogram of credit card transactions" class="wp-image-9016" width="933" height="469" srcset="https://www.relataly.com/wp-content/uploads/2022/07/output3.png 842w, https://www.relataly.com/wp-content/uploads/2022/07/output3.png 300w, https://www.relataly.com/wp-content/uploads/2022/07/output3.png 768w" sizes="(max-width: 933px) 100vw, 933px" /></figure>



<p class="wp-block-paragraph">The time frame of our dataset covers two days, which reflects the distribution graph well. We can see that most transactions happen during the day &#8211; which is only plausible.</p>



<p class="wp-block-paragraph">Next, let&#8217;s examine the correlation between transaction size and fraud cases. To do this, we create a scatterplot that distinguishes between the two classes.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Plot time against amount
x = df_base['Amount']
y = df_base['Time']
fig, ax = plt.subplots(figsize=(20, 7))
ax.set(xlim=(0, 1500))
sns.scatterplot(data=df_base[df_base['Class']==0][::15], x=x, y=y, hue=&quot;Class&quot;, palette=[&quot;#BECEE9&quot;], alpha=.5, ax=ax)
sns.scatterplot(data=df_base[df_base['Class']==1][::15], x=x, y=y, hue=&quot;Class&quot;, palette=[&quot;#EF1B1B&quot;], zorder=100, ax=ax)
plt.legend(['no fraud', 'fraud'], loc='lower right')
fig.suptitle('Transaction Amount over Time split by Class')</pre></div>



<figure class="wp-block-image size-large is-resized"><img decoding="async" data-attachment-id="9019" data-permalink="https://www.relataly.com/multivariate-outlier-detection-using-isolation-forests-in-python-detecting-credit-card-fraud/4233/output4/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/07/output4.png" data-orig-size="1184,474" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="identifying anomalous datapoints in two dimensions, credit card fraud detection" data-image-description="&lt;p&gt;identifying anomalous datapoints in two dimensions, credit card fraud detection&lt;/p&gt;
" data-image-caption="&lt;p&gt;identifying anomalous datapoints in two dimensions, credit card fraud detection&lt;/p&gt;
" data-large-file="https://www.relataly.com/wp-content/uploads/2022/07/output4.png" src="https://www.relataly.com/wp-content/uploads/2022/07/output4-1024x410.png" alt="Transaction Amount over Time splot by Class; Credit card fraud detection with isolation forests" class="wp-image-9019" width="1053" height="422" srcset="https://www.relataly.com/wp-content/uploads/2022/07/output4.png 1024w, https://www.relataly.com/wp-content/uploads/2022/07/output4.png 300w, https://www.relataly.com/wp-content/uploads/2022/07/output4.png 768w, https://www.relataly.com/wp-content/uploads/2022/07/output4.png 1184w" sizes="(max-width: 1053px) 100vw, 1053px" /></figure>



<p class="wp-block-paragraph">The scatterplot provides the insight that suspicious amounts tend to be relatively low. In other words, there is some inverse correlation between class and transaction amount. </p>



<h3 class="wp-block-heading" id="h-step-3-preprocessing">Step #3: Preprocessing</h3>



<p class="wp-block-paragraph">Now that we have a rough idea of the data, we will prepare it for training the model. For the training of the isolation forest, we drop the class label from the base dataset and then divide the data into separate datasets for training (70%) and testing (30%). We do not have to normalize or standardize the data when using a decision tree-based algorithm. </p>



<p class="wp-block-paragraph">We will use all features from the dataset. So our model will be a multivariate anomaly detection model. </p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Separate the classes from the train set
df_classes = df_base['Class']
df_train = df_base.drop(['Class'], axis=1)

# split the data into train and test 
X_train, X_test, y_train, y_test = train_test_split(df_train, df_classes, test_size=0.30, random_state=42)</pre></div>



<h3 class="wp-block-heading" id="h-step-4-model-training">Step #4: Model Training</h3>



<p class="wp-block-paragraph">Once we have prepared the data, it&#8217;s time to start training the Isolation Forest. However, to compare the performance of our model with other algorithms, we will train several different models. In total, we will prepare and compare the following five outlier detection models:</p>



<ul class="wp-block-list">
<li>Isolation Forest (default)</li>



<li>Isolation Forest (hypertuned)</li>



<li>Local Outlier Factor (default)</li>



<li>K Neared Neighbour (default)</li>



<li>K Nearest Neighbour (hypertuned)</li>
</ul>



<p class="wp-block-paragraph">For hyperparameter tuning of the models, we use Grid Search.</p>



<h4 class="wp-block-heading" id="h-4-1-train-an-isolation-forest">4.1 Train an Isolation Forest</h4>



<p class="wp-block-paragraph">Next, we train our isolation forest algorithm. An isolation forest is a type of machine learning algorithm for anomaly detection. It is a variant of the random forest algorithm, which is a widely-used ensemble learning method that uses multiple decision trees to make predictions.</p>



<p class="wp-block-paragraph">The isolation forest algorithm works by randomly selecting a feature and a split value for the feature, and then using the split value to divide the data into two subsets. This process is repeated for each decision tree in the ensemble, and the trees are combined to make a final prediction.</p>



<p class="wp-block-paragraph">The isolation forest algorithm is designed to be efficient and effective for detecting anomalies in high-dimensional datasets. It has a number of advantages, such as its ability to handle large and complex datasets, and its high accuracy and low false positive rate. It is widely used in a variety of applications, such as fraud detection, intrusion detection, and anomaly detection in manufacturing.</p>



<h5 class="wp-block-heading" id="h-4-1-1-isolation-forest-baseline">4.1.1 Isolation Forest (baseline)</h5>



<p class="wp-block-paragraph">First, we train a baseline model. A baseline model is a simple or reference model used as a starting point for evaluating the performance of more complex or sophisticated models in machine learning. It provides a baseline or benchmark for comparison, which allows us to assess the relative performance of different models and to identify which models are more accurate, effective, or efficient.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># train the model on the nominal train set
model_isf = IsolationForest().fit(X_train)</pre></div>



<p class="wp-block-paragraph">We create a function to measure the performance of our baseline model and illustrate the results in a confusion matrix. Later, when we go into hyperparameter tuning, we can use this function to objectively compare the performance of more sophisticated models.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}">def measure_performance(model, X_test, y_true, map_labels):
    # predict on testset
    df_pred_test = X_test.copy()
    #df_pred_test['Class'] = y_test
    df_pred_test['Pred'] = model.predict(X_test)
    if map_labels:
        df_pred_test['Pred'] = df_pred_test['Pred'].map({1: 0, -1: 1})
    #df_pred_test['Outlier_Score'] = model.decision_function(X_test)

    # measure performance
    #y_true = df_pred_test['Class']
    x_pred = df_pred_test['Pred'] 
    matrix = confusion_matrix(x_pred, y_true)

    sns.heatmap(pd.DataFrame(matrix, columns = ['Actual', 'Predicted']),
                xticklabels=['Regular [0]', 'Fraud [1]'], 
                yticklabels=['Regular [0]', 'Fraud [1]'], 
                annot=True, fmt=&quot;d&quot;, linewidths=.5, cmap=&quot;YlGnBu&quot;)
    plt.ylabel('Predicted')
    plt.xlabel('Actual')
    
    print(classification_report(x_pred, y_true))
    
    model_score = score(x_pred, y_true,average='macro')
    print(f'f1_score: {np.round(model_score[2]*100, 2)}%')
    
    return model_score

model_name = 'Isolation Forest (baseline)'
print(f'{model_name} model')

map_labels = True
model_score = measure_performance(model_isf, X_test, y_test, map_labels)

performance_df = pd.DataFrame().append({'model_name':model_name, 
                                    'f1_score': model_score[0], 
                                    'precision': model_score[1], 
                                    'recall': model_score[2]}, ignore_index=True)</pre></div>



<figure class="wp-block-image size-large is-resized"><img decoding="async" data-attachment-id="4585" data-permalink="https://www.relataly.com/multivariate-outlier-detection-using-isolation-forests-in-python-detecting-credit-card-fraud/4233/image-31-6/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2021/06/image-31.png" data-orig-size="668,699" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-31" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2021/06/image-31.png" src="https://www.relataly.com/wp-content/uploads/2021/06/image-31.png" alt="confusion matrix for our credit card fraud detection algorithm" class="wp-image-4585" width="490" height="514" srcset="https://www.relataly.com/wp-content/uploads/2021/06/image-31.png 668w, https://www.relataly.com/wp-content/uploads/2021/06/image-31.png 287w" sizes="(max-width: 490px) 100vw, 490px" /></figure>



<h5 class="wp-block-heading" id="h-4-1-2-isolation-forest-hypertuning">4.1.2 Isolation Forest (Hypertuning)</h5>



<p class="wp-block-paragraph">Next, we will train another Isolation Forest Model using grid search hyperparameter tuning to test different parameter configurations. </p>



<p class="wp-block-paragraph">The hyperparameters of an isolation forest include:</p>



<ul class="wp-block-list">
<li><strong>n_estimators</strong>: The number of decision trees in the forest.</li>



<li><strong>max_samples</strong>: The number of samples to draw from the dataset to train each decision tree.</li>



<li><strong>contamination: </strong>The expected proportion of anomalies in the dataset.</li>



<li><strong>max_features: </strong>The number of features to consider when choosing the split points in the decision trees.</li>



<li><strong>bootstrap: </strong>Whether or not to use bootstrap sampling when drawing samples to train the decision trees.</li>
</ul>



<p class="wp-block-paragraph">These hyperparameters can be adjusted to improve the performance of the isolation forest. The optimal values for these hyperparameters will depend on the specific characteristics of the dataset and the task at hand, which is why we require several experiments.</p>



<p class="wp-block-paragraph">The code below will evaluate the different parameter configurations based on their f1_score and automatically choose the best-performing model. </p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Define the parameter grid
n_estimators=[50, 100]
max_features=[1.0, 5, 10]
bootstrap=[True]
param_grid = dict(n_estimators=n_estimators, max_features=max_features, bootstrap=bootstrap)

# Build the gridsearch
model_isf = IsolationForest(n_estimators=n_estimators, 
                            max_features=max_features, 
                            contamination=contamination_rate, 
                            bootstrap=False, 
                            n_jobs=-1)

# Define an f1_scorer
f1sc = make_scorer(f1_score, average='macro')

grid = GridSearchCV(estimator=model_isf, param_grid=param_grid, cv = 3, scoring=f1sc)
grid_results = grid.fit(X=X_train, y=y_train)

# Summarize the results in a readable format
print(&quot;Best: {0}, using {1}&quot;.format(grid_results.cv_results_['mean_test_score'], grid_results.best_params_))
results_df = pd.DataFrame(grid_results.cv_results_)

# Evaluate model performance
model_name = 'KNN (tuned)'
print(f'{model_name} model')

best_model = grid_results.best_estimator_
map_labels = True # if True - maps 1 to 0 and -1 to 1 - not required for scikit-learn knn models
model_score = measure_performance(best_model, X_test, y_test, map_labels)
performance_df = performance_df.append({'model_name':model_name, 
                                    'f1_score': model_score[0], 
                                    'precision': model_score[1], 
                                    'recall': model_score[2]}, ignore_index=True)
results_df</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">Best: [0.61083219 0.55718259 0.55912644 0.52670328 0.5317127 ], using {'n_neighbors': 1}
KNN (tuned) model
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     85385
           1       0.21      0.48      0.29        58

    accuracy                           1.00     85443
   macro avg       0.60      0.74      0.64     85443
weighted avg       1.00      1.00      1.00     85443

f1_score: 64.39%</pre></div>



<h4 class="wp-block-heading" id="h-4-2-lof-model">4.2 LOF Model</h4>



<p class="wp-block-paragraph">We train the Local Outlier Factor Model using the same training data and evaluation procedure. The local outlier factor (LOF) is a measure of the local deviation of a data point with respect to its neighbors. It is used to identify points in a dataset that are significantly different from their surrounding points and that may therefore be considered outliers. </p>



<p class="wp-block-paragraph">The LOF is a useful tool for detecting outliers in a dataset, as it considers the local context of each data point rather than the global distribution of the data. This makes it more robust to outliers that are only significant within a specific region of the dataset. However, isolation forests can often outperform LOF models.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Train a tuned local outlier factor model
model_lof = LocalOutlierFactor(n_neighbors=3, contamination=contamination_rate, novelty=True)
model_lof.fit(X_train)

# Evaluate model performance
model_name = 'LOF (baseline)'
print(f'{model_name} model')

map_labels = True 
model_score = measure_performance(model_lof, X_test, y_test, map_labels)
performance_df = performance_df.append({'model_name':model_name, 
                                    'f1_score': model_score[0], 
                                    'precision': model_score[1], 
                                    'recall': model_score[2]}, ignore_index=True)</pre></div>



<h4 class="wp-block-heading" id="h-4-3-knn-model">4.3 KNN Model</h4>



<p class="wp-block-paragraph">Next, we train the KNN models. KNN is a type of machine learning algorithm for classification and regression. It is a type of instance-based learning, which means that it stores and uses the training data instances themselves to make predictions, rather than building a model that summarizes or generalizes the data.</p>



<p class="wp-block-paragraph">Below we add two K-Nearest Neighbor models to our list. We use the default parameter hyperparameter configuration for the first model. The second model will most likely perform better because we optimize its hyperparameters using the grid search technique.</p>



<h5 class="wp-block-heading" id="h-4-3-1-knn-default">4.3.1 KNN (default)</h5>



<p class="wp-block-paragraph">First, we train the default model using the same training data as before. By experimenting with different values of this parameter, you can try to identify the optimal number of neighbors that maximize the model&#8217;s performance on the given dataset. This approach could help to achieve better results compared to the default settings of the KNN algorithm, which may not be the most appropriate for the specific dataset we are working with.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Train a KNN Model
model_knn = KNeighborsClassifier(n_neighbors=5)
model_knn.fit(X=X_train, y=y_train)

# Evaluate model performance
model_name = 'KNN (baseline)'
print(f'{model_name} model')

map_labels = False # if True - maps 1 to 0 and -1 to 1 - set to False for classification models (e.g., KNN)
model_score = measure_performance(model_knn, X_test, y_test, map_labels)
performance_df = performance_df.append({'model_name':model_name, 
                                    'f1_score': model_score[0], 
                                    'precision': model_score[1], 
                                    'recall': model_score[2]}, ignore_index=True)</pre></div>



<figure class="wp-block-image size-large is-resized"><img decoding="async" data-attachment-id="4595" data-permalink="https://www.relataly.com/multivariate-outlier-detection-using-isolation-forests-in-python-detecting-credit-card-fraud/4233/image-35-6/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2021/06/image-35.png" data-orig-size="533,532" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-35" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2021/06/image-35.png" src="https://www.relataly.com/wp-content/uploads/2021/06/image-35.png" alt="Multivariate Anomaly Detection on Time-Series Data in Python: Confusion matrix for our credit card fraud detection algorithm" class="wp-image-4595" width="586" height="585" srcset="https://www.relataly.com/wp-content/uploads/2021/06/image-35.png 533w, https://www.relataly.com/wp-content/uploads/2021/06/image-35.png 300w, https://www.relataly.com/wp-content/uploads/2021/06/image-35.png 150w, https://www.relataly.com/wp-content/uploads/2021/06/image-35.png 120w" sizes="(max-width: 586px) 100vw, 586px" /></figure>



<h5 class="wp-block-heading" id="h-4-3-1-knn-hypertuned">4.3.1 KNN (hypertuned)</h5>



<p class="wp-block-paragraph">In the next step, we will train a second KNN model to improve its performance by fine-tuning its hyperparameters. Despite having only a few parameters, hyperparameter tuning can enhance the model&#8217;s ability to make accurate predictions. In this case, we will concentrate on optimizing the number of nearest neighbors considered in the KNN algorithm. </p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Define hypertuning parameters
n_neighbors=[1, 2, 3, 4, 5]
param_grid = dict(n_neighbors=n_neighbors)

# Build the gridsearch
model_knn = KNeighborsClassifier(n_neighbors=n_neighbors)
grid = GridSearchCV(estimator=model_knn, param_grid=param_grid, cv = 5)
grid_results = grid.fit(X=X_train, y=y_train)

# Summarize the results in a readable format
print(&quot;Best: {0}, using {1}&quot;.format(grid_results.cv_results_['mean_test_score'], grid_results.best_params_))
results_df = pd.DataFrame(grid_results.cv_results_)

# Evaluate model performance
model_name = 'KNN (tuned)'
print(f'{model_name} model')

best_model = grid_results.best_estimator_
map_labels = False # if True - maps 1 to 0 and -1 to 1 - set to False for classification models (e.g., KNN)
model_score = measure_performance(best_model, X_test, y_test, map_labels)
performance_df = performance_df.append({'model_name':model_name, 
                                    'f1_score': model_score[0], 
                                    'precision': model_score[1], 
                                    'recall': model_score[2]}, ignore_index=True)
results_df</pre></div>



<figure class="wp-block-image size-large is-resized"><img decoding="async" data-attachment-id="4596" data-permalink="https://www.relataly.com/multivariate-outlier-detection-using-isolation-forests-in-python-detecting-credit-card-fraud/4233/image-36-5/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2021/06/image-36.png" data-orig-size="379,262" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-36" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2021/06/image-36.png" src="https://www.relataly.com/wp-content/uploads/2021/06/image-36.png" alt="confusion matrix for our credit card fraud detection algorithm" class="wp-image-4596" width="521" height="360" srcset="https://www.relataly.com/wp-content/uploads/2021/06/image-36.png 379w, https://www.relataly.com/wp-content/uploads/2021/06/image-36.png 300w" sizes="(max-width: 521px) 100vw, 521px" /></figure>



<h3 class="wp-block-heading" id="h-step-5-measuring-and-comparing-performance">Step #5: Measuring and Comparing Performance</h3>



<p class="wp-block-paragraph">Finally, we will compare the performance of our models with a bar chart that shows the f1_score, precision, and recall. If you want to learn more about classification performance, <a href="https://www.relataly.com/measuring-classification-performance-with-python-and-scikit-learn/846/" target="_blank" rel="noreferrer noopener">this tutorial discusses the different metrics in more detail</a>.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}">print(performance_df)

performance_df = performance_df.sort_values('model_name')

fig, ax = plt.subplots(figsize=(12, 4))
tidy = performance_df.melt(id_vars='model_name').rename(columns=str.title)
sns.barplot(y='Model_Name', x='Value', hue='Variable', data=tidy, ax=ax, palette='nipy_spectral', linewidth=1, edgecolor=&quot;w&quot;)
plt.title('Model Outlier Detection Performance (Macro)')</pre></div>



<figure class="wp-block-image size-large is-resized"><img decoding="async" data-attachment-id="4679" data-permalink="https://www.relataly.com/multivariate-outlier-detection-using-isolation-forests-in-python-detecting-credit-card-fraud/4233/image-54-3/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2021/06/image-54.png" data-orig-size="1446,650" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-54" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2021/06/image-54.png" src="https://www.relataly.com/wp-content/uploads/2021/06/image-54-1024x460.png" alt="Multivariate Anomaly Detection on Time-Series Data in Python: Performance comparison between different algorithms for credit card fraud detection" class="wp-image-4679" width="982" height="441" srcset="https://www.relataly.com/wp-content/uploads/2021/06/image-54.png 1024w, https://www.relataly.com/wp-content/uploads/2021/06/image-54.png 300w, https://www.relataly.com/wp-content/uploads/2021/06/image-54.png 768w, https://www.relataly.com/wp-content/uploads/2021/06/image-54.png 1446w" sizes="(max-width: 982px) 100vw, 982px" /></figure>



<p class="wp-block-paragraph">All three metrics play an important role in evaluating performance because, on the one hand, we want to capture as many fraud cases as possible, but we also don&#8217;t want to raise false alarms too frequently. </p>



<ul class="wp-block-list">
<li>As we can see, the optimized Isolation Forest performs particularly well-balanced.</li>



<li>The default Isolation Forest has a high f1_score and detects many fraud cases but frequently raises false alarms.</li>



<li>The opposite is true for the KNN model. Only a few fraud cases are detected here, but the model is often correct when noticing a fraud case.</li>



<li>The default LOF model performs slightly worse than the other models. Compared to the optimized Isolation Forest, it performs worse in all three metrics.</li>
</ul>



<h2 class="wp-block-heading" id="h-summary">Summary</h2>



<p class="wp-block-paragraph">Credit card fraud detection is important because it helps to protect consumers and businesses, to maintain trust and confidence in the financial system, and to reduce financial losses. It is a critical part of ensuring the security and reliability of credit card transactions. </p>



<p class="wp-block-paragraph">This article has shown how to use Python and the Isolation Forest Algorithm to implement a credit card fraud detection system. We developed a multivariate anomaly detection model to spot fraudulent credit card transactions. You learned how to prepare the data for testing and training an isolation forest model and how to validate this model. Finally, we have proven that the Isolation Forest is a robust algorithm for anomaly detection that outperforms traditional techniques.</p>



<p class="wp-block-paragraph">I hope you enjoyed the article and can apply what you learned to your projects. Have a great day!</p>



<h2 class="wp-block-heading">Sources and Further Reading</h2>



<div style="display: inline-block;">
  <iframe sandbox="allow-popups allow-scripts allow-modals allow-forms allow-same-origin" style="width:120px;height:240px;" marginwidth="0" marginheight="0" scrolling="no" frameborder="0" src="//ws-eu.amazon-adsystem.com/widgets/q?ServiceVersion=20070822&amp;OneJS=1&amp;Operation=GetAdHtml&amp;MarketPlace=DE&amp;source=ss&amp;ref=as_ss_li_til&amp;ad_type=product_link&amp;tracking_id=flo7up-21&amp;language=de_DE&amp;marketplace=amazon&amp;region=DE&amp;placement=3030181162&amp;asins=3030181162&amp;linkId=669e46025028259138fbb5ccec12dfbe&amp;show_border=true&amp;link_opens_in_new_window=true"></iframe>
<iframe sandbox="allow-popups allow-scripts allow-modals allow-forms allow-same-origin" style="width:120px;height:240px;" marginwidth="0" marginheight="0" scrolling="no" frameborder="0" src="//ws-eu.amazon-adsystem.com/widgets/q?ServiceVersion=20070822&amp;OneJS=1&amp;Operation=GetAdHtml&amp;MarketPlace=DE&amp;source=ss&amp;ref=as_ss_li_til&amp;ad_type=product_link&amp;tracking_id=flo7up-21&amp;language=de_DE&amp;marketplace=amazon&amp;region=DE&amp;placement=1999579577&amp;asins=1999579577&amp;linkId=91d862698bf9010ff4c09539e4c49bf4&amp;show_border=true&amp;link_opens_in_new_window=true"></iframe>
<iframe sandbox="allow-popups allow-scripts allow-modals allow-forms allow-same-origin" style="width:120px;height:240px;" marginwidth="0" marginheight="0" scrolling="no" frameborder="0" src="//ws-eu.amazon-adsystem.com/widgets/q?ServiceVersion=20070822&amp;OneJS=1&amp;Operation=GetAdHtml&amp;MarketPlace=DE&amp;source=ss&amp;ref=as_ss_li_til&amp;ad_type=product_link&amp;tracking_id=flo7up-21&amp;language=de_DE&amp;marketplace=amazon&amp;region=DE&amp;placement=1839217715&amp;asins=1839217715&amp;linkId=356ba074068849ff54393f527190825d&amp;show_border=true&amp;link_opens_in_new_window=true"></iframe>
<iframe sandbox="allow-popups allow-scripts allow-modals allow-forms allow-same-origin" style="width:120px;height:240px;" marginwidth="0" marginheight="0" scrolling="no" frameborder="0" src="//ws-eu.amazon-adsystem.com/widgets/q?ServiceVersion=20070822&amp;OneJS=1&amp;Operation=GetAdHtml&amp;MarketPlace=DE&amp;source=ss&amp;ref=as_ss_li_til&amp;ad_type=product_link&amp;tracking_id=flo7up-21&amp;language=de_DE&amp;marketplace=amazon&amp;region=DE&amp;placement=1492032646&amp;asins=1492032646&amp;linkId=2214804dd039e7103577abd08722abac&amp;show_border=true&amp;link_opens_in_new_window=true"></iframe>
</div>



<p class="has-contrast-2-color has-base-3-background-color has-text-color has-background wp-block-paragraph"><em>The links above to Amazon are affiliate links. By buying through these links, you support the Relataly.com blog and help to cover the hosting costs. Using the links does not affect the price.</em></p>
<p>The post <a href="https://www.relataly.com/multivariate-outlier-detection-using-isolation-forests-in-python-detecting-credit-card-fraud/4233/">Multivariate Anomaly Detection on Time-Series Data in Python: Using Isolation Forests to Detect Credit Card Fraud</a> appeared first on <a href="https://www.relataly.com">relataly.com</a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://www.relataly.com/multivariate-outlier-detection-using-isolation-forests-in-python-detecting-credit-card-fraud/4233/feed/</wfw:commentRss>
			<slash:comments>2</slash:comments>
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">4233</post-id>	</item>
		<item>
		<title>Image Classification with Convolutional Neural Networks &#8211; Classifying Cats and Dogs in Python</title>
		<link>https://www.relataly.com/image-classification-with-deep-learning/2485/</link>
					<comments>https://www.relataly.com/image-classification-with-deep-learning/2485/#respond</comments>
		
		<dc:creator><![CDATA[Florian Follonier]]></dc:creator>
		<pubDate>Sun, 13 Dec 2020 14:09:31 +0000</pubDate>
				<category><![CDATA[Classification (two-class)]]></category>
		<category><![CDATA[Convolutional Neural Network (CNN)]]></category>
		<category><![CDATA[Data Sources]]></category>
		<category><![CDATA[Image Recognition]]></category>
		<category><![CDATA[Keras]]></category>
		<category><![CDATA[Neural Networks]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[Tensorflow]]></category>
		<category><![CDATA[Use Cases]]></category>
		<category><![CDATA[Beginner Tutorials]]></category>
		<category><![CDATA[Computer Vision]]></category>
		<category><![CDATA[Deep Learning]]></category>
		<category><![CDATA[Image Dataset]]></category>
		<category><![CDATA[Supervised Learning]]></category>
		<category><![CDATA[Two-Label Classification]]></category>
		<guid isPermaLink="false">https://www.relataly.com/?p=2485</guid>

					<description><![CDATA[<p>This tutorial shows how to use Convolutional Neural Networks (CNNs) with Python for image classification. CNNs belong to the field of deep learning, a subarea of machine learning, and have become a cornerstone to many exciting innovations. There are endless applications, from self-driving cars over biometric security to automated tagging in social media. And the ... <a title="Image Classification with Convolutional Neural Networks &#8211; Classifying Cats and Dogs in Python" class="read-more" href="https://www.relataly.com/image-classification-with-deep-learning/2485/" aria-label="Read more about Image Classification with Convolutional Neural Networks &#8211; Classifying Cats and Dogs in Python">Read more</a></p>
<p>The post <a href="https://www.relataly.com/image-classification-with-deep-learning/2485/">Image Classification with Convolutional Neural Networks &#8211; Classifying Cats and Dogs in Python</a> appeared first on <a href="https://www.relataly.com">relataly.com</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">This tutorial shows how to use Convolutional Neural Networks (CNNs) with Python for image classification. CNNs belong to the field of deep learning, a subarea of machine learning, and have become a cornerstone to many exciting innovations. There are endless applications, from self-driving cars over biometric security to automated tagging in social media. And the importance of CNNs grows steadily! So there are plenty of reasons to understand how this technology works and how we can implement it. </p>



<p class="wp-block-paragraph">This article proceeds as follows: The first part introduces the core concepts behind CNNs and explains their use in image classification. The second part is a hands-on tutorial in which you will build your own CNN to distinguish images of cats and dogs. This tutorial develops a model that achieves around 82% validation accuracy. We will work with TensorFlow and Python to integrate different layers, such as Convolution Layers, Dense layers, and MaxPooling. Furthermore, we will prevent the network from overfitting the training data by using Dropout between the layers. We will also load the model and make predictions on a fresh set of images. Finally, we analyze and illustrate the performance of our image classifier. </p>



<p class="wp-block-paragraph">Also: <a href="https://www.relataly.com/automated-prompt-generation-for-dall-e-using-chatgpt-in-python-a-step-by-step-api-tutorial/12143/" target="_blank" rel="noreferrer noopener">Generating Detailed Images with OpenAI DALL-E and ChatGPT in Python: A Step-By-Step API Tutorial</a></p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%"></div>
</div>



<h2 class="wp-block-heading" id="h-image-classification-with-convolutional-neural-networks">Image Classification with Convolutional Neural Networks</h2>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">The history of image recognition dates back to the mid-1960s when the first attempts were made to identify objects by coding their characteristic shapes and lines. However, this task turned out to be incredibly complex. Our human brain is trained so well to recognize things that one can easily forget how diverse the observation conditions can be. Here are some examples:</p>



<ul class="wp-block-list">
<li>Fotos can be taken from various viewpoints</li>



<li>Living things can have multiple forms and poses</li>



<li>Objects come in different forms, colors, and sizes</li>



<li>The picture may hide parts of the things in the picture</li>



<li>The light conditions vary from image  to image</li>



<li>There may be one or multiple objects in the same image</li>
</ul>



<p class="wp-block-paragraph">At the beginning of the 1990s, the focus of research shifted to statistical approaches and learning algorithms.</p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%">
<figure class="wp-block-image size-large"><img decoding="async" width="512" height="512" data-attachment-id="13345" data-permalink="https://www.relataly.com/image-classification-with-deep-learning/2485/machine_learning_computer_vision_dazzling_magic_neural_network-min/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2023/03/machine_learning_computer_vision_dazzling_magic_neural_network-min.png" data-orig-size="1024,1024" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="machine_learning_computer_vision_dazzling_magic_neural_network-min" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2023/03/machine_learning_computer_vision_dazzling_magic_neural_network-min.png" src="https://www.relataly.com/wp-content/uploads/2023/03/machine_learning_computer_vision_dazzling_magic_neural_network-min-512x512.png" alt="The idea of computer vision is inspired by the fact that the visual cortex has cells activated by specific shapes and their orientation in the visual field. " class="wp-image-13345" srcset="https://www.relataly.com/wp-content/uploads/2023/03/machine_learning_computer_vision_dazzling_magic_neural_network-min.png 512w, https://www.relataly.com/wp-content/uploads/2023/03/machine_learning_computer_vision_dazzling_magic_neural_network-min.png 300w, https://www.relataly.com/wp-content/uploads/2023/03/machine_learning_computer_vision_dazzling_magic_neural_network-min.png 140w, https://www.relataly.com/wp-content/uploads/2023/03/machine_learning_computer_vision_dazzling_magic_neural_network-min.png 768w, https://www.relataly.com/wp-content/uploads/2023/03/machine_learning_computer_vision_dazzling_magic_neural_network-min.png 1024w" sizes="(max-width: 512px) 100vw, 512px" /><figcaption class="wp-element-caption">The idea of computer vision is inspired by the fact that the visual cortex has cells activated by specific shapes and their orientation in the visual field. </figcaption></figure>
</div>
</div>



<p class="wp-block-paragraph"></p>



<h3 class="wp-block-heading" id="h-the-emergence-of-cnns">The Emergence of CNNs</h3>



<p class="wp-block-paragraph">The basic concept of a neural network in computer vision has existed since the 1980s. It goes back to research from Hubel and Wiesel on the emergence of a cat&#8217;s visual system. They found that the visual cortex has cells activated by specific shapes and their orientation in the visual field. Some of their findings inspired the development of crucial computer vision technologies, such as, for example, hierarchical features with different levels of abstraction [1, 2]. However, it took another three decades of research and the availability of faster computers before the emergence of modern CNNs.</p>



<p class="wp-block-paragraph">The year 2012  was a defining moment for the use of CNNs in image recognition. This year, for the first time, CNN won the <a href="http://www.image-net.org/challenges/LSVRC/" target="_blank" rel="noreferrer noopener">ILSVRC </a>competition for computer vision. The challenge was classifying more than a hundred thousand images into 1000 object categories. With an error rate of only 15,3%, the succeeding model was a CNN called &#8220;AlexNet.&#8221;.</p>



<p class="wp-block-paragraph">AlexNet was the first model to achieve more than 75% accuracy. In the same year, CNNs succeeded in several other competitions. For example, in 2015, the CNN ResNet exceeded human performance in the ILSVRC competition. Only a decade ago, this achievement was considered almost impossible. So how was this performance increase possible? To understand this surge in performance, let us first look at what a picture is.</p>



<figure class="wp-block-image size-large is-resized"><img decoding="async" data-attachment-id="2653" data-permalink="https://www.relataly.com/image-classification-with-deep-learning/2485/image-15-5/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2020/12/image-15.png" data-orig-size="1081,506" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-15" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2020/12/image-15.png" src="https://www.relataly.com/wp-content/uploads/2020/12/image-15-1024x479.png" alt="" class="wp-image-2653" width="848" height="395" srcset="https://www.relataly.com/wp-content/uploads/2020/12/image-15.png 300w, https://www.relataly.com/wp-content/uploads/2020/12/image-15.png 768w" sizes="(max-width: 848px) 100vw, 848px" /><figcaption class="wp-element-caption">Top-performing models in the ImageNet image classification challenge (Alyafeai &amp; Ghouti, 2019)</figcaption></figure>



<h3 class="wp-block-heading" id="h-what-is-an-image">What is an Image?</h3>



<p class="wp-block-paragraph">A digital image is a three-dimensional array of integer values. One dimension of this array represents the pixel width, and one dimension represents the height of the picture. The third dimension contains the color depth, defined by the image format. As shown below, we can thus represent the format of a digital image as &#8220;width x height x depth.&#8221; Next, let&#8217;s have a quick look at different image formats.</p>



<figure class="wp-block-image size-large is-resized"><img decoding="async" data-attachment-id="2649" data-permalink="https://www.relataly.com/image-classification-with-deep-learning/2485/image-11-6/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2020/12/image-11.png" data-orig-size="1152,437" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-11" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2020/12/image-11.png" src="https://www.relataly.com/wp-content/uploads/2020/12/image-11-1024x388.png" alt="an image is a multidimensional integer array" class="wp-image-2649" width="861" height="326" srcset="https://www.relataly.com/wp-content/uploads/2020/12/image-11.png 1024w, https://www.relataly.com/wp-content/uploads/2020/12/image-11.png 300w, https://www.relataly.com/wp-content/uploads/2020/12/image-11.png 768w, https://www.relataly.com/wp-content/uploads/2020/12/image-11.png 1152w" sizes="(max-width: 861px) 100vw, 861px" /><figcaption class="wp-element-caption">A digital image is a multidimensional integer array.</figcaption></figure>



<h3 class="wp-block-heading" id="h-overview-of-different-image-formats">Overview of Different Image Formats</h3>



<p class="wp-block-paragraph">We can train CNNs with different image formats, but the input data are always multidimensional arrays of integer values. One of the most commonly used color formats in deep learning is &#8220;RGB.&#8221; RGB stands for the three color channels: &#8220;Red,&#8221; &#8220;Green,&#8221; and &#8220;Blue.&#8221; RGB images are divided into three layers of integer values, one layer for each color channel—the integer values of a 16-bit RGB image in each layer range from 1 to 255. Together, the three layers can reproduce 65,536 different colors. </p>



<p class="wp-block-paragraph">In contrast to RGB images, grey-scale images only have a single color layer. This layer resembles the brightness of each pixel in the image. Consequently, the format of a grey-scale image is width x height x 1. Using grey-scale images or images with black and white shades instead of RGB images can speed up the training process because less data needs to be processed. However, image data with multiple color channels provide the model with more information, leading to better predictions. The RGB format is often a good choice between prediction quality and performance. Next, let&#8217;s look at how CNNs handle digital images in the learning process.</p>



<h3 class="wp-block-heading" id="h-convolutional-neural-networks">Convolutional Neural Networks</h3>



<p class="wp-block-paragraph">As mentioned before, a CNN is a specific form of an artificial neural network. The main difference between the CNN and the standard multi-layer perceptron is their convolutional layers. CNNs can have other layers, but the convolutions make a CNN so good at detecting objects. They allow the network to identify patterns based on features that work regardless of where in the image they occur. Let&#8217;s see how this works in more detail.</p>



<h4 class="wp-block-heading" id="h-convolutional-layers">Convolutional Layers</h4>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow">
<p class="wp-block-paragraph">Convolutional layers use a rasterizing technique that breaks down an image into smaller groups of pixels called filters. Filters act as feature detectors from the original image. The primary purpose is to extract meaningful features from the input images.</p>



<p class="wp-block-paragraph">During the training, the CNN slides the filter over image locations and calculates the dot product for each feature at a time. The results of these calculations are stored in a so-called feature map (sometimes called an activation map). A feature map represents where in the image a particular feature was identified. Subsequently, the values from the feature map are transformed with an activation function (usually ReLu), and the algorithm uses them as input to the next layer.</p>


<div class="wp-block-image">
<figure class="alignleft size-large is-resized"><img decoding="async" data-attachment-id="6596" data-permalink="https://www.relataly.com/image-classification-with-deep-learning/2485/image-1/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/04/image-1.png" data-orig-size="1237,502" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-1" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/04/image-1.png" src="https://www.relataly.com/wp-content/uploads/2022/04/image-1-1024x416.png" alt="Illustration of operations in the convolutional layers" class="wp-image-6596" width="811" height="332"/><figcaption class="wp-element-caption">Illustration of operations in the convolutional layers</figcaption></figure>
</div></div>
</div>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow">
<p class="wp-block-paragraph">Features become more complex with the increasing depth of the network. In the first layer of the network, convolutions will detect generic geometric forms and low-level features based on edges, corners, squares, or circles. The subsequent layers of the network will look at more sophisticated shapes and may, for example, include features that resemble the form of an eye of a cat or the nose of a dog. In this way, convolutions provide the network with features at different levels of detail that enable powerful detection patterns.</p>


<div class="wp-block-image">
<figure class="alignleft size-large is-resized"><img decoding="async" data-attachment-id="2661" data-permalink="https://www.relataly.com/image-classification-with-deep-learning/2485/image-16-4/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2020/12/image-16.png" data-orig-size="1908,819" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-16" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2020/12/image-16.png" src="https://www.relataly.com/wp-content/uploads/2020/12/image-16-1024x440.png" alt="Convolutions at the example of an image that contains the number &quot;3&quot;" class="wp-image-2661" width="874" height="374" srcset="https://www.relataly.com/wp-content/uploads/2020/12/image-16.png 300w, https://www.relataly.com/wp-content/uploads/2020/12/image-16.png 768w, https://www.relataly.com/wp-content/uploads/2020/12/image-16.png 1536w, https://www.relataly.com/wp-content/uploads/2020/12/image-16.png 1908w" sizes="(max-width: 874px) 100vw, 874px" /><figcaption class="wp-element-caption">Exemplary convolutions of an image that contains the number &#8220;3.&#8221;</figcaption></figure>
</div></div>
</div>



<h4 class="wp-block-heading" id="h-pooling-downsampling">Pooling / Downsampling</h4>



<p class="wp-block-paragraph">A convolutional layer is usually followed by a pooling operation, which reduces the amount of data by filtering unnecessary information. This process is also called downsampling or subsampling. There are various forms of pooling. In the most common variant &#8211; max-pooling &#8211; only the highest value in a predefined grid (e.g., 2&#215;2) is processed, and the remaining values are discarded. For example, imagine a 2&#215;2 grid with values 0.1, 0.5, 0.4, and 0.8. The algorithm would only process the 0,8 further for this grid and use it as part of the input to the next layer. The advantages of pooling are reduced data and faster training times. Because pooling minimizes the complexity of the network, it allows for the construction of deeper architectures with more layers. In addition, pooling offers a certain protection against overfitting during training.</p>



<h4 class="wp-block-heading" id="h-dropout">Dropout</h4>



<p class="wp-block-paragraph">Dropout is another technique that helps prevent the network from overfitting the training data. When we activate Dropout for a layer, the algorithm will remove a random number of neurons from the layer per training step. As a result, the network needs to learn patterns that give less weight to individual layers and thus generalize better. The dropout rate controls the percentage of switched-off neurons in each training iteration. We can configure Dropout for each layer separately. </p>



<p class="wp-block-paragraph">CNNs with many layers and training epochs tend to overfit the training data. Especially here, Dropout is crucial to avoid overfitting and to achieve good prediction results with data that the network does not know yet. A typical value for the rate lies between 10% to 30%.</p>



<h4 class="wp-block-heading" id="h-multi-layer-perceptron-mlp">Multi-Layer Perceptron (MLP)</h4>



<p class="wp-block-paragraph">The CNN architecture ends with multiple dense layers that are fully connected. The layers are part of a Multilayer Perception (MLP), which has the task of dense down the results from the previous convolutions and outputting one of the multiple classes. Consequently, the number of neurons in the final dense layer usually corresponds to the number of different classes to be predicted. It is also possible to use a single neuron in the final layer for two-class prediction problems. In this case, the last neuron outputs a binary label of 0 or 1.</p>



<h2 class="wp-block-heading" id="h-building-a-cnn-with-tensorflow-that-classifies-cats-and-dogs">Building a CNN with Tensorflow that Classifies Cats and Dogs</h2>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">Now that you are familiar with the basic concepts behind convolutional neural networks, we can commence with the practical part and build an image classifier. In the following, we will train a CNN to distinguish images of cats and dogs. We first define a CNN model and then feed it a few thousand photos from a public dataset with labeled images of cats and dogs.</p>



<p class="wp-block-paragraph">Distinguishing cats and dogs may not sound difficult, but many challenges exist. Imagine the almost infinite circumstances in which animals can be photographed, not to mention the many forms a cat can take. These variations lead to the fact that even humans sometimes confuse a cat with a dog or vice versa. So don&#8217;t expect our model to be perfect right from the start. Our model will score around 82% accuracy on the validation dataset.</p>



<p class="wp-block-paragraph">The code is available on the GitHub repository.</p>



<div class="wp-block-kadence-advancedbtn kb-buttons-wrap kb-btns_d70aa4-6e"><a class="kb-button kt-button button kb-btn_b926ba-d4 kt-btn-size-standard kt-btn-width-type-full kb-btn-global-inherit kt-btn-has-text-true kt-btn-has-svg-true wp-block-button__link wp-block-kadence-singlebtn" href="https://github.com/flo7up/relataly-public-python-tutorials/blob/master/06%20Computer%20Vision/200%20Classifying%20Cats%20%26%20Dogs%20Binary.ipynb" target="_blank" rel="noreferrer noopener"><span class="kb-svg-icon-wrap kb-svg-icon-fe_eye kt-btn-icon-side-left"><svg viewBox="0 0 24 24"  fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"  aria-hidden="true"><path d="M1 12s4-8 11-8 11 8 11 8-4 8-11 8-11-8-11-8z"/><circle cx="12" cy="12" r="3"/></svg></span><span class="kt-btn-inner-text">View on GitHub </span></a>

<a class="kb-button kt-button button kb-btn_80b142-4f kt-btn-size-standard kt-btn-width-type-full kb-btn-global-inherit kt-btn-has-text-true kt-btn-has-svg-true wp-block-button__link wp-block-kadence-singlebtn" href="https://github.com/flo7up/relataly-public-python-API-tutorials" target="_blank" rel="noreferrer noopener"><span class="kb-svg-icon-wrap kb-svg-icon-fa_github kt-btn-icon-side-left"><svg viewBox="0 0 496 512"  fill="currentColor" xmlns="http://www.w3.org/2000/svg"  aria-hidden="true"><path d="M165.9 397.4c0 2-2.3 3.6-5.2 3.6-3.3.3-5.6-1.3-5.6-3.6 0-2 2.3-3.6 5.2-3.6 3-.3 5.6 1.3 5.6 3.6zm-31.1-4.5c-.7 2 1.3 4.3 4.3 4.9 2.6 1 5.6 0 6.2-2s-1.3-4.3-4.3-5.2c-2.6-.7-5.5.3-6.2 2.3zm44.2-1.7c-2.9.7-4.9 2.6-4.6 4.9.3 2 2.9 3.3 5.9 2.6 2.9-.7 4.9-2.6 4.6-4.6-.3-1.9-3-3.2-5.9-2.9zM244.8 8C106.1 8 0 113.3 0 252c0 110.9 69.8 205.8 169.5 239.2 12.8 2.3 17.3-5.6 17.3-12.1 0-6.2-.3-40.4-.3-61.4 0 0-70 15-84.7-29.8 0 0-11.4-29.1-27.8-36.6 0 0-22.9-15.7 1.6-15.4 0 0 24.9 2 38.6 25.8 21.9 38.6 58.6 27.5 72.9 20.9 2.3-16 8.8-27.1 16-33.7-55.9-6.2-112.3-14.3-112.3-110.5 0-27.5 7.6-41.3 23.6-58.9-2.6-6.5-11.1-33.3 2.6-67.9 20.9-6.5 69 27 69 27 20-5.6 41.5-8.5 62.8-8.5s42.8 2.9 62.8 8.5c0 0 48.1-33.6 69-27 13.7 34.7 5.2 61.4 2.6 67.9 16 17.7 25.8 31.5 25.8 58.9 0 96.5-58.9 104.2-114.8 110.5 9.2 7.9 17 22.9 17 46.4 0 33.7-.3 75.4-.3 83.6 0 6.5 4.6 14.4 17.3 12.1C428.2 457.8 496 362.9 496 252 496 113.3 383.5 8 244.8 8zM97.2 352.9c-1.3 1-1 3.3.7 5.2 1.6 1.6 3.9 2.3 5.2 1 1.3-1 1-3.3-.7-5.2-1.6-1.6-3.9-2.3-5.2-1zm-10.8-8.1c-.7 1.3.3 2.9 2.3 3.9 1.6 1 3.6.7 4.3-.7.7-1.3-.3-2.9-2.3-3.9-2-.6-3.6-.3-4.3.7zm32.4 35.6c-1.6 1.3-1 4.3 1.3 6.2 2.3 2.3 5.2 2.6 6.5 1 1.3-1.3.7-4.3-1.3-6.2-2.2-2.3-5.2-2.6-6.5-1zm-11.4-14.7c-1.6 1-1.6 3.6 0 5.9 1.6 2.3 4.3 3.3 5.6 2.3 1.6-1.3 1.6-3.9 0-6.2-1.4-2.3-4-3.3-5.6-2z"/></svg></span><span class="kt-btn-inner-text">Relataly GitHub Repo </span></a></div>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%">
<figure class="wp-block-image is-resized"><img decoding="async" src="https://www.relataly.com/wp-content/uploads/2022/04/Image-Recognition-Convolutional-Neural-Networks.png" alt="Image Recognition Convolutional Neural Networks - classifying cats and dogs python " width="382" height="125"/><figcaption class="wp-element-caption">Cat or Dog? That&#8217;s what our CNN will predict.</figcaption></figure>
</div>
</div>



<div style="height:29px" aria-hidden="true" class="wp-block-spacer"></div>



<h3 class="wp-block-heading" id="h-prerequisites">Prerequisites</h3>



<p class="wp-block-paragraph">Before starting the coding part, make sure that you have set up your <a href="https://www.python.org/downloads/" target="_blank" rel="noreferrer noopener">Python 3</a> environment and required packages. If you don&#8217;t have an environment, you can follow&nbsp;<a href="https://www.relataly.com/anaconda-python-environment-machine-learning/1663/" target="_blank" rel="noreferrer noopener">this tutorial</a>&nbsp;to set up the&nbsp;<a href="https://www.anaconda.com/products/individual" target="_blank" rel="noreferrer noopener">Anaconda environment</a>.</p>



<p class="wp-block-paragraph">Also, make sure you install all required packages. In this tutorial, we will be working with the following standard packages:&nbsp;</p>



<ul class="wp-block-list">
<li><em><a href="https://pandas.pydata.org/" target="_blank" rel="noreferrer noopener">pandas</a></em></li>



<li><em><a href="https://numpy.org/" target="_blank" rel="noreferrer noopener">NumPy</a></em></li>



<li><a href="https://docs.python.org/3/library/math.html" target="_blank" rel="noreferrer noopener">math</a></li>



<li><em><a href="https://matplotlib.org/" target="_blank" rel="noreferrer noopener">matplotlib</a></em></li>
</ul>



<p class="wp-block-paragraph">In addition, we will be using <em><a href="https://keras.io/" target="_blank" rel="noreferrer noopener">Keras&nbsp;</a></em>(2.0 or higher) with <a href="https://www.tensorflow.org/" target="_blank" rel="noreferrer noopener"><em>Tensorflow</em> </a>backend and the machine learning library <a href="https://scikit-learn.org/stable/" target="_blank" rel="noreferrer noopener">Scikit-learn</a>.</p>



<p class="wp-block-paragraph">You can install packages using console commands:</p>



<ul class="wp-block-list">
<li><em>pip install &lt;package name&gt;</em></li>



<li><em>conda install &lt;package name&gt;</em>&nbsp;(if you are using the anaconda packet manager)</li>
</ul>



<h4 class="wp-block-heading" id="h-download-the-dataset">Download the Dataset</h4>



<p class="wp-block-paragraph">We will train our image classification model with a public dataset from <a href="http://www.kaggle.com" target="_blank" rel="noreferrer noopener">Kaggle.com</a>. The dataset contains more than 25.000 JPG pictures of cats and dogs. The images are uniformly named and numbered, for example, dog.1.jpg, dog.2.jpg, dog.3.jpg, cat.1.jpg, cat.2.jpg, and so on. You can download the picture set directly from Kaggle: <a href="https://www.kaggle.com/c/dogs-vs-cats/overview" target="_blank" rel="noreferrer noopener">cats-vs-dogs</a>. </p>



<h4 class="wp-block-heading" id="h-setup-the-folder-structure">Setup the Folder Structure</h4>



<p class="wp-block-paragraph">There are different ways data can be structured and loaded during model training. One approach (1) is to split the images into classes and create a separate folder for each class, class_a, class_b, etc. Another method (2) is to put all images into a single folder and define a DataFrame that splits the data into test and train. Because the cats and dogs dataset files already contain the classes in their name, I decided to go for the second approach. </p>



<p class="wp-block-paragraph">Before we begin with the coding part, we create a folder structure that looks as follows:</p>



<figure class="wp-block-image size-large is-resized"><img decoding="async" data-attachment-id="2676" data-permalink="https://www.relataly.com/image-classification-with-deep-learning/2485/image-17-4/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2020/12/image-17.png" data-orig-size="532,286" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-17" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2020/12/image-17.png" src="https://www.relataly.com/wp-content/uploads/2020/12/image-17.png" alt="structure of the data that we will use to train the convolutional neural network" class="wp-image-2676" width="409" height="220" srcset="https://www.relataly.com/wp-content/uploads/2020/12/image-17.png 532w, https://www.relataly.com/wp-content/uploads/2020/12/image-17.png 300w" sizes="(max-width: 409px) 100vw, 409px" /><figcaption class="wp-element-caption">The folder structure of our cats and dogs prediction project</figcaption></figure>



<p class="wp-block-paragraph">If you want to use the standard pathways given in the python tutorial, make sure that your notebook resides in the parent folder of the &#8220;data&#8221; folder.</p>



<p class="wp-block-paragraph">After you have created the folder structure, open the cats-vs-dogs zip file. The ZIP file contains the folders &#8220;train,&#8221; &#8220;test,&#8221; and &#8220;sample.&#8221; Unzip the JPG files from the &#8220;train&#8221; (20.000 images) and the &#8220;test&#8221; folder (5.000 pictures) to the &#8220;train&#8221; folder of your project. Afterward, the train folder should contain 25.000 images. The sample folder is intended to include your sample images, for example, of your pet. We will later use the images from the sample folder to test the model on new real-world data. </p>



<p class="wp-block-paragraph">We have fulfilled all requirements and can start with the coding part.</p>



<h3 class="wp-block-heading" id="h-step-1-make-imports-and-check-training-device">Step #1 Make Imports and Check Training Device</h3>



<p class="wp-block-paragraph">We begin by setting up the imports for this project. I have put the package imports at the beginning to give you a  quick overview of the packages you need to install.</p>



<p class="wp-block-paragraph">Using the GPU instead of the CPU allows for faster training times. However, setting up Tensorflow to work with the GPUs can cause problems. Not everyone has a GPU; in this case, TensorFlow should usually automatically run all code on the CPU. However, should you for any reason prefer to manually switch to CPU training, change [&#8220;CUDA_VISIBLE_DEVICES&#8221;]= &#8220;1&#8221; to &#8220;-1&#8221;. As a result, Tensorflow will run all code on the CPU and ignore all available GPUs. </p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}">import os
#os.environ[&quot;CUDA_VISIBLE_DEVICES&quot;]=&quot;-1&quot; 

import tensorflow as tf
from tensorflow.keras.preprocessing.image import ImageDataGenerator, array_to_img, img_to_array, load_img
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Convolution2D, MaxPooling2D, ZeroPadding2D
from tensorflow.keras.layers import Conv2D, Activation, Dropout, Flatten, Dense, BatchNormalization
from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau
from tensorflow.keras.metrics import Accuracy
from tensorflow.keras import regularizers
from tensorflow.keras.optimizers import SGD, Adam
from tensorflow.python.client import device_lib
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score

tf.config.allow_growth = True
tf.config.per_process_gpu_memory_fraction = 0.9

from random import randint
import seaborn as sns
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.colors as mcolors
import seaborn as sns
from PIL import Image
import random as rdn</pre></div>



<p class="wp-block-paragraph">Running the command below checks the TensorFlow version and the number of available GPUs in our system. </p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># check the tensorflow version
print('Tensorflow Version: ' + tf.__version__)

# check the number of available GPUs
physical_devices = tf.config.list_physical_devices('GPU')
print(&quot;Num GPUs:&quot;, len(physical_devices))</pre></div>



<pre class="wp-block-preformatted">Tensorflow Version: 2.4.0-rc3
Num GPUs: 1</pre>



<p class="wp-block-paragraph">My GPU is an RTX 3080. When I wrote this article, the GPU was not yet supported by the standard TensorFlow release. I have therefore used the pre-release version of TensorFlow (2.4.0-rc3). I expect the following standard release (2.3) to work fine. </p>



<p class="wp-block-paragraph">In my case, the GPU check returns one because I have a single GPU on my computer. If TensorFlow doesn&#8217;t recognize any GPU, this command will return 0. Tensorflow will then run on the CPU.</p>



<h3 class="wp-block-heading" id="h-step-2-define-the-prediction-classes">Step #2 Define the Prediction Classes</h3>



<p class="wp-block-paragraph">Next, we will define the path to the folders that contain our train and validation images. In addition, we will define a Dataframe &#8220;image_df,&#8221; which has all the pictures from the &#8220;train&#8221; folder. With the help of this Dataframe, we can later split the data simply by defining which images from the train folder contain the training dataset and which belong to the test dataset. Important note: the dataframe &#8220;image_df&#8221; only includes the names of the images and the classes, but not the photos themselves.</p>



<p class="wp-block-paragraph">It&#8217;s good to check the distribution of classes in the training data set. For this purpose, we create a bar plot, which illustrates the number of both classes in the image data. And yes, I admit, I choose some custom colors to make it look fancy.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># set the directory for train and validation images
train_path = 'data/images/cats-and-dogs/train/'
#test_path = 'data/cats-and-dogs/test/'

# function to create a list of image labels 
def createImageDf(path):
    filenames = os.listdir(path)
    categories = []

    for fname in filenames:
        category = fname.split('.')[0]
        if category == 'dog':
            categories.append(1)
        else:
            categories.append(0)
    df = pd.DataFrame({
        'filename':filenames,
        'category':categories
    })
    return df

# display the header of the train_df dataset
image_df = createImageDf(train_path)
image_df.head(5)

sns.countplot(y='category', data=image_df, palette=['#2FE5C7',&quot;#2F8AE5&quot;], orient=&quot;h&quot;)</pre></div>



<p class="wp-block-paragraph"></p>



<figure class="wp-block-image size-full"><img decoding="async" width="376" height="262" data-attachment-id="11572" data-permalink="https://www.relataly.com/image-classification-with-deep-learning/2485/image-9-2/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/12/image-9.png" data-orig-size="376,262" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-9" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/12/image-9.png" src="https://www.relataly.com/wp-content/uploads/2022/12/image-9.png" alt="" class="wp-image-11572" srcset="https://www.relataly.com/wp-content/uploads/2022/12/image-9.png 376w, https://www.relataly.com/wp-content/uploads/2022/12/image-9.png 300w" sizes="(max-width: 376px) 100vw, 376px" /></figure>



<p class="wp-block-paragraph">The number of images in the two classes is balanced, so we don&#8217;t need to rebalance the data. That&#8217;s nice!</p>



<h3 class="wp-block-heading" id="h-step-3-plot-sample-images">Step #3 Plot Sample Images</h3>



<p class="wp-block-paragraph">I prefer not to jump directly into preprocessing and check that the data has been correctly loaded. We will do this by plotting some random images from the train folder. This step is not necessary, but it&#8217;s a best practice.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}">n_pictures = 16 # number of pictures to be shown
columns = int(n_pictures / 2)
rows = 2
plt.figure(figsize=(40, 12))
for i in range(n_pictures):
    num = i + 1
    ax = plt.subplot(rows, columns, i + 1)
    if i &lt; columns:
        image_name = 'cat.' + str(rdn.randint(1, 1000)) + '.jpg'
    else: 
        image_name = 'dog.' + str(rdn.randint(1, 1000)) + '.jpg'
    plt.xlabel(image_name)    
    plt.imshow(load_img(train_path + image_name)) 

#if you get a deprecated warning, you can ignore it</pre></div>



<figure class="wp-block-image size-full"><img decoding="async" width="1024" height="315" data-attachment-id="7123" data-permalink="https://www.relataly.com/image-classification-with-deep-learning/2485/cats-and-dogs-neural-networks-classification/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/04/cats-and-dogs-neural-networks-classification.png" data-orig-size="1024,315" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="cats-and-dogs-neural-networks-classification" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/04/cats-and-dogs-neural-networks-classification.png" src="https://www.relataly.com/wp-content/uploads/2022/04/cats-and-dogs-neural-networks-classification.png" alt="classifying cats and dogs convolutional neural networks" class="wp-image-7123" srcset="https://www.relataly.com/wp-content/uploads/2022/04/cats-and-dogs-neural-networks-classification.png 1024w, https://www.relataly.com/wp-content/uploads/2022/04/cats-and-dogs-neural-networks-classification.png 300w, https://www.relataly.com/wp-content/uploads/2022/04/cats-and-dogs-neural-networks-classification.png 768w" sizes="(max-width: 1024px) 100vw, 1024px" /></figure>



<p class="wp-block-paragraph">I never expected to have so many pictures of cats and dogs one day, but I guess neither did you 🙂 Neural networks require a fixed input shape where each neuron corresponds to a pixel value. </p>



<p class="wp-block-paragraph">As we can see from the sample images, the images in our dataset have different sizes and aspect ratios. For the images to fit into the input shape of our neural network, we need to put the images into a standard format. But before that, we split the data into two datasets for train and test.</p>



<h3 class="wp-block-heading" id="h-step-4-split-the-data">Step #4 Split the Data</h3>



<p class="wp-block-paragraph">Image classification requires splitting the data into a train and a validation set. We define a split ratio of 1/5 so that 80% of the data goes into the training dataset and 20% goes into the validation dataframe. We shuffle the data to create two DataFrameswith a mix of random cat and dog pictures. In addition, we transform the classes of the images into categorical values 0-&gt;&#8221;cat&#8221; and 1-&gt;&#8221;dog&#8221;. The result is two new DataFrames: train_df (20.000 images) and validate_df (5.000 images).</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}">image_df[&quot;category&quot;] = image_df[&quot;category&quot;].replace({0:'cat',1:'dog'})

train_df, validate_df = train_test_split(image_df, test_size=0.20, random_state=42)
train_df = train_df.reset_index(drop=True)
total_train = train_df.shape[0]

validate_df = validate_df.reset_index(drop=True)
total_validate = validate_df.shape[0]
train_df.head()

print(len(train_df), len(validate_df))</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">Output: 20000 5000</pre></div>



<h3 class="wp-block-heading" id="h-step-5-preprocess-the-images">Step #5 Preprocess the Images</h3>



<p class="wp-block-paragraph">The next step is to define two data generators for these DataFrames, which use the names given in the train and validation DataFrames to feed the images from the &#8220;train&#8221; path into our neural network. The data generator has various configuration options. We will perform the following operations:</p>



<ul class="wp-block-list">
<li>Rescale the image by dividing their RGB color values (1-255) by 255</li>



<li>Shuffle the images (again)</li>



<li>Bring the images into a uniform shape of 128 x 128 pixels</li>



<li>We define a batch size of 32, which processes the 32 images simultaneously.</li>



<li>The class mode is &#8220;binary&#8221; so our two prediction labels are encoded as&nbsp;float32&nbsp;scalars with values 0 or 1. As a result, we will only have a single end neuron in our network.</li>



<li>We perform some data augmentation techniques on the training data (incl. horizontal flip, shearing, and zoom). In this way, the model never sees different variants of the images, which helps to prevent overfitting.</li>
</ul>



<figure class="wp-block-image size-large is-resized"><img decoding="async" data-attachment-id="2700" data-permalink="https://www.relataly.com/image-classification-with-deep-learning/2485/image-19-4/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2020/12/image-19.png" data-orig-size="833,262" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-19" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2020/12/image-19.png" src="https://www.relataly.com/wp-content/uploads/2020/12/image-19.png" alt="" class="wp-image-2700" width="753" height="236" srcset="https://www.relataly.com/wp-content/uploads/2020/12/image-19.png 833w, https://www.relataly.com/wp-content/uploads/2020/12/image-19.png 300w, https://www.relataly.com/wp-content/uploads/2020/12/image-19.png 768w" sizes="(max-width: 753px) 100vw, 753px" /><figcaption class="wp-element-caption">Some augmentation techniques</figcaption></figure>



<p class="wp-block-paragraph">It is essential to mention that the input shape of the first layer of the neural network must correspond to the image shape of 128 x 128. The reason is that each pixel becomes an input to a neuron.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># set the dimensions to which we will convert the images
img_width, img_height = 128, 128
target_size = (img_width, img_height)
batch_size = 32
rescale=1.0/255

# configure the train data generator
print('Train data:')
train_datagen = ImageDataGenerator(rescale=rescale)
train_generator = train_datagen.flow_from_dataframe(
    train_df, 
    train_path,
    shear_range=0.2, #
    zoom_range=0.2, #
    horizontal_flip=True, # 
    shuffle=True, # shuffle the image data
    x_col='filename', y_col='category',
    classes=['dog', 'cat'],
    target_size=target_size,
    batch_size=batch_size,
    color_mode=&quot;rgb&quot;,
    class_mode='binary')

# configure test data generator
# only rescaling
print('Test data:')
validation_datagen = ImageDataGenerator(rescale=rescale)
validation_generator = validation_datagen.flow_from_dataframe(
    validate_df, 
    train_path,    
    shuffle=True,
    x_col='filename', y_col='category',
    classes=['dog', 'cat'],
    target_size=target_size,
    batch_size=batch_size,
    color_mode=&quot;rgb&quot;,
    class_mode='binary')</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">Train data:
Found 20000 validated image filenames belonging to 2 classes.
Test data:
Found 5000 validated image filenames belonging to 2 classes.</pre></div>



<p class="wp-block-paragraph">At this point, we have already completed the data preprocessing part. The next step is to define and compile the convolutional neural network.</p>



<h3 class="wp-block-heading" id="h-step-6-define-and-compile-the-convolutional-neural-network">Step #6 Define and Compile the Convolutional Neural Network</h3>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">The architecture of our image classification CNN is inspired by the famous VGGNet. In this section, we will define and compile our CNN model. We do this by defining multiple layers and stacking them on top of each other. However, to lower the amount of time needed to train the network, I reduced the number of layers.</p>



<p class="wp-block-paragraph">The initial layer of our network is the initial input layer, which receives the preprocessed images. As already noted, the shape of the input layer needs to match the shape of our images. Considering how we have defined the format of the images in our data generators, the input shape is defined as 128 x 128 x 3. </p>



<p class="wp-block-paragraph">The subsequent layers are four convolutional layers. Each of these layers is followed by a pooling layer. In addition, we define a Dropoutrate of 20% for each convolutional layer. </p>



<p class="wp-block-paragraph">Finally, a fully connected output layer with 128 neurons and a binary layer for the output complete the structure of the CNN.</p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%">
<figure class="wp-block-image size-large is-resized"><img decoding="async" data-attachment-id="2608" data-permalink="https://www.relataly.com/image-classification-with-deep-learning/2485/image-8-7/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2020/12/image-8.png" data-orig-size="539,493" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-8" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2020/12/image-8.png" src="https://www.relataly.com/wp-content/uploads/2020/12/image-8.png" alt="3-dimensional Input Shape of our Neural Network " class="wp-image-2608" width="423" height="386" srcset="https://www.relataly.com/wp-content/uploads/2020/12/image-8.png 539w, https://www.relataly.com/wp-content/uploads/2020/12/image-8.png 300w" sizes="(max-width: 423px) 100vw, 423px" /><figcaption class="wp-element-caption">3-dimensional Input Shape of our Neural Network </figcaption></figure>



<p class="wp-block-paragraph"></p>
</div>
</div>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow">
<div class="wp-block-kadence-infobox kt-info-box_4a47ba-1f"><span class="kt-blocks-info-box-link-wrap info-box-link kt-blocks-info-box-media-align-top kt-info-halign-left"><div class="kt-blocks-info-box-media-container"><div class="kt-blocks-info-box-media kt-info-media-animate-drawborder"><div class="kadence-info-box-icon-container kt-info-icon-animate-drawborder"><div class="kadence-info-box-icon-inner-container"><span class="kb-svg-icon-wrap kb-svg-icon-fe_cpu kt-info-svg-icon"><svg viewBox="0 0 24 24"  fill="none" stroke="currentColor" stroke-width="1" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"  aria-hidden="true"><rect x="4" y="4" width="16" height="16" rx="2" ry="2"/><rect x="9" y="9" width="6" height="6"/><line x1="9" y1="1" x2="9" y2="4"/><line x1="15" y1="1" x2="15" y2="4"/><line x1="9" y1="20" x2="9" y2="23"/><line x1="15" y1="20" x2="15" y2="23"/><line x1="20" y1="9" x2="23" y2="9"/><line x1="20" y1="14" x2="23" y2="14"/><line x1="1" y1="9" x2="4" y2="9"/><line x1="1" y1="14" x2="4" y2="14"/></svg></span></div></div></div></div><div class="kt-infobox-textcontent"><h4 class="kt-blocks-info-box-title">Additional Info</h4><p class="kt-blocks-info-box-text"><strong><em>Loss function</em>:</strong> measures model accuracy during training. We try to minimize this function to &#8220;steer&#8221; the model in the right direction. We use binary_crossentropy.<br/><strong><em>Optimizer</em>:</strong> defines how the model weights are updated based on the data it sees and its loss function.<br/><strong><em>Metrics</em></strong> are<strong> </strong>used to monitor the steps during training and testing. The following example uses <em>accuracy</em>, which is the fraction of the correctly classified images.</p></div></span></div>
</div>
</div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># define the input format of the model
input_shape = (img_width, img_height, 3)
print(input_shape)

# define  model
model = Sequential()
model.add(Conv2D(32, (3, 3), strides=(1, 1), activation='relu', kernel_initializer='he_uniform', padding='same', input_shape=input_shape))
model.add(MaxPooling2D((2, 2)))
model.add(Dropout(0.20))
model.add(Conv2D(64, (3, 3), strides=(1, 1), activation='relu', kernel_initializer='he_uniform', padding='same'))
model.add(MaxPooling2D((2, 2)))
model.add(Dropout(0.20))
model.add(Conv2D(64, (3, 3), strides=(1, 1), activation='relu', kernel_initializer='he_uniform', padding='same'))
model.add(MaxPooling2D((2, 2)))
model.add(Dropout(0.20))
model.add(Conv2D(128, (3, 3),  strides=(1, 1),activation='relu', kernel_initializer='he_uniform', padding='same'))
model.add(MaxPooling2D((2, 2)))
model.add(Dropout(0.20))
model.add(Flatten())
model.add(Dense(128, activation='relu', kernel_initializer='he_uniform'))
model.add(Dropout(0.5))
model.add(Dense(1, activation='sigmoid'))

# compile the model and print its architecture
opt = SGD(lr=0.001, momentum=0.9)
history = model.compile(optimizer=opt, loss='binary_crossentropy', metrics=['accuracy'])
print(model.summary())</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">input_shape: (100, 100, 3)
Model: &quot;sequential&quot;
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
conv2d (Conv2D)              (None, 100, 100, 32)      896       
_________________________________________________________________
max_pooling2d (MaxPooling2D) (None, 50, 50, 32)        0         
_________________________________________________________________
dropout (Dropout)            (None, 50, 50, 32)        0         
_________________________________________________________________
conv2d_1 (Conv2D)            (None, 50, 50, 64)        18496     
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 25, 25, 64)        0         
_________________________________________________________________
dropout_1 (Dropout)          (None, 25, 25, 64)        0         
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 25, 25, 64)        36928     
_________________________________________________________________
max_pooling2d_2 (MaxPooling2 (None, 12, 12, 64)        0         
_________________________________________________________________
dropout_2 (Dropout)          (None, 12, 12, 64)        0         
_________________________________________________________________
conv2d_3 (Conv2D)            (None, 12, 12, 128)       73856     
_________________________________________________________________
...
Trainable params: 720,257
Non-trainable params: 0
_________________________________________________________________
None</pre></div>



<p class="wp-block-paragraph">At this point, we have defined and assembled our convolutional neural network. Next, it is time to train the model.</p>



<h3 class="wp-block-heading" id="h-step-7-train-the-model">Step #7 Train the Model</h3>



<p class="wp-block-paragraph">Before we train the image classifier, we still have to choose the number of epochs. More epochs can improve the model performance and lead to longer training times. In addition, the risk increases that the model overfits. Finding the optimal number of epochs is difficult and often requires a trial-and-error approach. I typically start with a small number of 5 epochs and then increase this number until increases do not lead to significant improvements.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># train the model
epochs = 40
early_stop = EarlyStopping(monitor='loss', patience=6, verbose=1)

history = model.fit(
    train_generator,
    epochs=epochs,
    callbacks=[early_stop],
    steps_per_epoch=len(train_generator),
    verbose=1,
    validation_data=validation_generator,
    validation_steps=len(validation_generator))</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">Epoch 1/35
625/625 [==============================] - 121s 194ms/step - loss: 0.7050 - accuracy: 0.5282 - val_loss: 0.6902 - val_accuracy: 0.5824
Epoch 2/35
625/625 [==============================] - 115s 183ms/step - loss: 0.6853 - accuracy: 0.5469 - val_loss: 0.6856 - val_accuracy: 0.5806
Epoch 3/35
625/625 [==============================] - 115s 184ms/step - loss: 0.6744 - accuracy: 0.5752 - val_loss: 0.6746 - val_accuracy: 0.5806
Epoch 4/35
625/625 [==============================] - 112s 180ms/step - loss: 0.6569 - accuracy: 0.5987 - val_loss: 0.6593 - val_accuracy: 0.6110
Epoch 5/35
625/625 [==============================] - 115s 185ms/step - loss: 0.6423 - accuracy: 0.6194 - val_loss: 0.6474 - val_accuracy: 0.6134
Epoch 6/35
625/625 [==============================] - 116s 185ms/step - loss: 0.6309 - accuracy: 0.6370 - val_loss: 0.6386 - val_accuracy: 0.6260
Epoch 7/35
625/625 [==============================] - 115s 183ms/step - loss: 0.6139 - accuracy: 0.6539 - val_loss: 0.6082 - val_accuracy: 0.6682</pre></div>



<p class="wp-block-paragraph">A quick comment on the required time to train the model. Although the model is not overly complex and the size of the data is still moderate, training the model can take some time. I made two training runs &#8211; the first run on my GPU (Nvidia Geforce 3080 RTX) and the second on my CPU (AMD Ryzen 3700x). On the GPU, training took approximately 10 minutes. The CPU training was much slower and took about 30 minutes, three times longer than the GPU.  </p>



<p class="wp-block-paragraph">After training, you may want to save the classification model and load it at a later time. You can do this with the code below:<br>However, we need to define the model strictly as it was during training before loading.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Safe the weights
model.save_weights('cats-and-dogs-weights-v1.h5')

# Define model as during training
# model architecture

# Loads the weights
model.load_weights('cats-and-dogs-weights-v1.h5')</pre></div>



<h3 class="wp-block-heading" id="h-step-8-visualize-model-performance">Step #8 Visualize Model Performance</h3>



<p class="wp-block-paragraph">After training the model, we want to check the performance of our image classification model. For this purpose, we can apply the same performance measures as in traditional classification projects. The code below illustrates the performance of our image classifier on the validation dataset. </p>



<p class="wp-block-paragraph">To learn more about measuring model performance, check out my <a href="https://www.relataly.com/measuring-classification-performance-with-python-and-scikit-learn/846/" target="_blank" rel="noreferrer noopener">previous post on Measuring Model Performance</a>. </p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}">def plot_loss(history, value1, value2, title):
    fig, ax = plt.subplots(figsize=(15, 5), sharex=True)
    plt.plot(history.history[value1], 'b')
    plt.plot(history.history[value2], 'r')
    plt.title(title)
    plt.ylabel(&quot;Loss&quot;)
    plt.xlabel(&quot;Epoch&quot;)
    ax.xaxis.set_major_locator(plt.MaxNLocator(epochs))
    plt.legend([&quot;Train&quot;, &quot;Validation&quot;], loc=&quot;upper left&quot;)
    plt.grid()
    plt.show()

# plot training &amp; validation loss values
plot_loss(history, &quot;loss&quot;, &quot;val_loss&quot;, &quot;Model loss&quot;)
# plot training &amp; validation loss values
plot_loss(history, &quot;accuracy&quot;, &quot;val_accuracy&quot;, &quot;Model accuracy&quot;)
</pre></div>



<figure class="wp-block-image size-large is-resized"><img decoding="async" data-attachment-id="2725" data-permalink="https://www.relataly.com/image-classification-with-deep-learning/2485/image-25-4/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2020/12/image-25.png" data-orig-size="894,333" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-25" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2020/12/image-25.png" src="https://www.relataly.com/wp-content/uploads/2020/12/image-25.png" alt="" class="wp-image-2725" width="865" height="322" srcset="https://www.relataly.com/wp-content/uploads/2020/12/image-25.png 894w, https://www.relataly.com/wp-content/uploads/2020/12/image-25.png 300w, https://www.relataly.com/wp-content/uploads/2020/12/image-25.png 768w" sizes="(max-width: 865px) 100vw, 865px" /><figcaption class="wp-element-caption"><img decoding="async" src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAA34AAAFNCAYAAABfWL0+AAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8vihELAAAACXBIWXMAAAsTAAALEwEAmpwYAABxYUlEQVR4nO3deZyN5f/H8ddl7Ft2kbWItFhTiGiRSlQIbbRY2jctolX7vpNKRUmifFu0R+uvkKyhJNmyC4Nhluv3x+cMY8yMWc5yz8z7+Xicx1nu5bzPPefMnM9c131dznuPiIiIiIiIFFxFYh1AREREREREIkuFn4iIiIiISAGnwk9ERERERKSAU+EnIiIiIiJSwKnwExERERERKeBU+ImIiIiIiBRwKvxERKRQcM7Vc85551zRbKzb3zn3QzRyiYiIRIMKPxERCRzn3HLn3B7nXJV0j88JFW/1YhRNREQkX1LhJyIiQfU30Df1jnPuWKBU7OIEQ3ZaLEVERNJT4SciIkE1Drg0zf1+wNi0KzjnDnHOjXXObXDO/eOcG+6cKxJaFuece8I5t9E5tww4O4NtX3PO/eucW+2ce8A5F5edYM6595xza51zW51z3znnjk6zrJRz7slQnq3OuR+cc6VCy05yzv3knPvPObfSOdc/9Ph059yVafaxX1fTUCvnNc65P4E/Q489G9rHNufcr8659mnWj3PO3emc+8s5tz20vLZz7kXn3JPpXstHzrkbs/O6RUQk/1LhJyIiQfUzUN45d1SoIOsNvJVuneeBQ4DDgZOxQvGy0LIBQFegOdAK6Jlu2zeBJKBBaJ3OwJVkz6dAQ6AaMBt4O82yJ4CWQFugEnAbkOKcqxPa7nmgKtAMmJPN5wM4FzgBaBK6PzO0j0rAeOA951zJ0LKbsdbSs4DywOXATuw1901THFcBTgXeyUEOERHJh1T4iYhIkKW2+p0OLAZWpy5IUwwO9d5v994vB54ELgmtcgHwjPd+pfd+M/Bwmm2rA2cCN3rvd3jv1wNPA32yE8p7Pyb0nLuBe4GmoRbEIliRdYP3frX3Ptl7/1NovYuAr7z373jvE733m7z3c3JwLB723m/23u8KZXgrtI8k7/2TQAmgUWjdK4Hh3vsl3swNrTsD2IoVe4Re73Tv/boc5BARkXxI5wmIiEiQjQO+A+qTrpsnUAUoDvyT5rF/gMNCt2sCK9MtS1UXKAb865xLfaxIuvUzFCo4HwR6YS13KWnylABKAn9lsGntTB7Prv2yOeduwQq8moDHWvZSB8PJ6rneBC4GvgxdP5uHTCIikk+oxU9ERALLe/8PNsjLWcD76RZvBBKxIi5VHfa1Cv6LFUBpl6VaCewGqnjvK4Qu5b33R3NwFwLdgdOwbqb1Qo+7UKYE4IgMtluZyeMAO4DSae4fmsE6PvVG6Hy+27FWzYre+wpYS15qFZvVc70FdHfONQWOAqZksp6IiBQgKvxERCTorgBO8d7vSPug9z4ZmAg86Jwr55yri53blnoe4ETgeudcLedcReCONNv+C3wBPOmcK++cK+KcO8I5d3I28pTDisZNWLH2UJr9pgBjgKecczVDg6y0cc6VwM4DPM05d4FzrqhzrrJzrllo0znA+c650s65BqHXfLAMScAGoKhz7m6sxS/Vq8AI51xDZ45zzlUOZVyFnR84Dpic2nVUREQKNhV+IiISaN77v7z3szJZfB3WWrYM+AEb5GRMaNkrwOfAXGwAlvQthpdiXUV/B7YAk4Aa2Yg0Fus2ujq07c/plg8B5mPF1WbgUaCI934F1nJ5S+jxOUDT0DZPA3uAdVhXzLfJ2ufYQDF/hLIksH9X0KewwvcLYBvwGvtPhfEmcCxW/ImISCHgvPcHX0tEREQKDOdcB6xltF6olVJERAo4tfiJiIgUIs65YsANwKsq+kRECg8VfiIiIoWEc+4o4D+sS+szMQ0jIiJRpa6eIiIiIiIiBZxa/ERERERERAo4FX4iIiIiIiIFXNFYBwinKlWq+Hr16u29v2PHDsqUKRO7QAHKEYQMQckRhAxByRGEDMoRvAxByRGEDEHJEYQMyhG8DEHJEYQMQckRhAzKEbwM0c7x66+/bvTeVz1ggfe+wFxatmzp05o2bZoPgiDkCEIG74ORIwgZvA9GjiBk8F45gpbB+2DkCEIG74ORIwgZvFeOoGXwPhg5gpDB+2DkCEIG75UjaBm8j24OYJbPoFZSV08REREREZECToWfiIiIiIhIAafCT0REREREpIArUIO7ZCQxMZFVq1aRkJAQswyHHHIIixYtitnzhzNDyZIlqVWrFsWKFQtDKhERERERiYYCX/itWrWKcuXKUa9ePZxzMcmwfft2ypUrF5PnDmcG7z2bNm1i1apV1K9fP0zJREREREQk0gp8V8+EhAQqV64cs6KvIHHOUbly5Zi2noqIiIiISM4V+MIPUNEXRjqWIiIiIiL5T6Eo/GJl06ZNNGvWjHbt2nHooYdy2GGH0axZM5o1a8aePXuy3HbWrFlcf/31UUoqIiIiIiIFWYE/xy+WKleuzJw5c9i+fTtPPvkkZcuWZciQIXuXJyUlUbRoxj+CVq1a0apVq2hFFRERERGRAiyiLX7OuS7OuSXOuaXOuTsyWH6Ic+4j59xc59xC59xlaZYtd87Nd87Ncc7NimTOaOrfvz8333wznTp14vbbb2fGjBm0bduW5s2b07ZtW5YsWQLA9OnT6dq1KwD33nsvl19+OR07duTwww/nueeei+VLEBEREREpVBIS4J9/4Jdf4MMPYfRoWL061qlyJmItfs65OOBF4HRgFTDTOfeh9/73NKtdA/zuvT/HOVcVWOKce9t7n9oPspP3fmOkMsbKH3/8wVdffUVcXBzbtm3ju+++o2jRonz11VfceeedTJ48+YBtFi9ezLRp09i+fTuNGjXiqquu0pQKIiIiIiK5lJgIGzbA2rV2Wbdu3+3097duPXD7jz+Gww6Lfu7cimRXz9bAUu/9MgDn3ASgO5C28PNAOWcjhpQFNgNJkQp0440wZ05499msGTzzTM626dWrF3FxcQBs3bqVfv368eeff+KcIzExMcNtzj77bEqUKEGJEiWoVq0a69ato1atWnkLLyIiIiJSwCQlwapVsHw5fPFFdWbO3L+IS729MZPmpfLl4dBDoXp1OO446Nx53/1DD913u3r1qL6sPItk4XcYsDLN/VXACenWeQH4EFgDlAN6e+9TQss88IVzzgMve+9HRzBrVJUpU2bv7bvuuotOnTrxwQcfsHz5cjp27JjhNiVKlNh7Oy4ujqSkiNXHIiIiIiL78x5WroQ6dWKdhJQUK9z+/tsuy5fvf3vFCkhOTl37KABKldpXtDVoACeddGAxl3q/VKkYvbAIi2Thl9G4/z7d/TOAOcApwBHAl865773324B23vs1zrlqoccXe++/O+BJnBsIDASoXr0606dP37ssPj6eQw45hO3btwMwYkSeX1OGQrvPVHJyMrt376ZYsWIkJiaya9euvZk2bdpEpUqV2L59Oy+//DLee7Zv387OnTtJSkpi+/bte7dN3SYlJYX4+Pi997MjOTk5R+tnJSEhYb/jnBPx8fG53jZcgpAhKDmCkEE5gpchKDmCkCEoOYKQQTmClyEoOYKQISg5gpAhIjm858gnn6TmJ5/wz8UX8/dll0GRgw8Vktsc3sPWrcVYu7Yk//5bkrVrS4Zul9p7OzFx/+evVGk3NWokUL9+Am3bJlCjRgKHHppA2bKbqVWrKKVLJ5PVrGQJCVY0Ll+e47jZEoT3RiQLv1VA7TT3a2Ete2ldBjzivffAUufc30BjYIb3fg2A9369c+4DrOvoAYVfqCVwNECrVq182haz6dOnU7JkScqVKxe2F5Ub27dv39tNs1ixYpQqVWpvpjvvvJN+/foxcuRITjnlFJxzlCtXjtKlS1O0aFHKlSu3d9vUbYoUKULZsmVz9Lq2b98etuNQsmRJmjdvnqttp0+fnmmrZrQEIUNQcgQhg3IEL0NQcgQhQ1ByBCGDcgQvQ1ByBCFDUHIEIUNEctx5J3zyCTRvTt233qJuQgK8+SaULp3rHNu372ulW7bswNa7HTv2X79yZahfH044wa7r14d69ey6bl0oVaoEUAI4JIMM7XP7ysMmCO+NSBZ+M4GGzrn6wGqgD3BhunVWAKcC3zvnqgONgGXOuTJAEe/99tDtzsD9Ecwacffee2+Gj7dp04Y//vhj7/0RoWbJjh077n1zpN92wYIFkYgoIiIiIrK/J5+Ehx+GQYNg5Eh4+mkYMsQqtP/9D2rWzHCzxERYs6YkX321f3GXep3+/Lpy5ayIO+IIOO20fUVdaoEX43acAiFihZ/3Psk5dy3wORAHjPHeL3TODQ4tHwWMAN5wzs3Huobe7r3f6Jw7HPjAxnyhKDDee/9ZpLKKiIiIiEg6b7xhRV6vXvDii+Ac3HwzNGyI79uXlFatWfjwRywo1vyAlrsVKyAl5cS9uypadF8x16OHXR9++L7irlIlsuyKKXkX0QncvfdTganpHhuV5vYarDUv/XbLgKaRzCYiIiIiIvskJMCWLfDff8CH/6PRnVey9pjTmdJmHJseimPDBmvoW7bsHMom/8jEf8/hiP4ncTdv8z/O5dBDrYhr1w4uvhj27FnMmWc25vDDbdqD0KD2EiMRLfxERERERCQ6vIf4eFi/vgTz5lkBt2XL/pf0j6W9n5Bg++nAt3xOb2bSklMXvM+Om210+XLlrNWuQQOof3pTvqwyg/PHdueDP88n8b6HKT78tv2a7aZPX0vHjo2jfRgkEyr8RERERERiLDERtm2zicK3bt3/dvr7md3evt2mOoA2GT6Hc3DIIVChAlSsaJejjtr/fsP43zj3mXPYXeVwir46ldn1ylKhgq1TvHj6PR4KN0+Hyy6j+N13wLLF8PLLGa0oAaDCT0REREQiY88emD8fZs2CmTNhzRq4/35o1SrWyaJm1y473y11xMrUyz//wObN+wq3XbsOvq/ixa1wS72UL2+DoaTeTn187dolnHhio73FXMWKVriVL3+Q7pZ//AEnnQFVK1Lsxy9oWavywUOVKgXvvAONG8N999mJfpMnQ5Uq2To+Ej0q/EREREQk75KS4PffrchLLfTmzbPiD2w8/iJF4OST4d13oWvX2OYNk9279xV2n39egy++2DclwfLlNtF4WsWK2fQDdevaXOjpi7a0t9PfL1Eie5mmT/+Xjh0b5eyFrF4NnUNDb3z5JdSqlf1tnYN774Ujj4TLL7c5Fz7+OGfPLxGnwi/COnbsyA033MB5552397FnnnmGP/74g5deeinD9Z944glatWrFWWedxfjx46lQocJ+69x7772ULVuWIUOGZPq8U6ZM4cgjj6RJkyYAPPDAA5x++umcdtpp4XlhIiIiUnilpFjr0MyZNJgyBYYNg99+29dsVb68terdeKNdH3+8VTrr1lnB1707vPACXHVVLF9FtuzZAytX7l/Mpb29Zr9ZqhtRtKgVdPXqwVln7RvJsl49u9SoEcBBTjZvtqJv82aYNs0KuNy48EJ7seeeC23aUHH4cIjl3HWLFnHI3LnQoUO2Jpwv6FT4RVjfvn2ZPHnyfoXfhAkTePzxxw+67dSpUw+6TmamTJlC165d9xZ+w4cPj/lE9iIiIpIPeW/d91Jb8WbNgtmz7YQyoEbJklbcDR5s161a2egfGX3RPvRQ+PZb6NMHrr7aKqeHH475l/KdO+Gvv+yydOn+l5UrU8+bM3FxULu2FXGdO+9f2P377//Ro0cbiuanb9jx8Vah/vUXfPYZtGyZt/21aQMzZkDXrhx3++1Qtqy9N6IlIQEmTbJzDX/4geZg/2S4+mq47DLr81pI5ae3Zb7Us2dPhg0bxu7duylRogTLly9nzZo1jB8/nptuuoldu3bRs2dP7rvvvgO2rVevHrNmzaJKlSo8+OCDjB07ltq1a1O1alVahj6Ur7zyCqNHj2bPnj00aNCAcePGMWfOHD788EO+/fZbHnjgASZPnszdd9/NeeedR8+ePfn6668ZMmQISUlJHH/88YwcOZISJUpQr149+vXrx0cffURiYiLvvfcejRtrJCYREZFCZdMmK85Su2zOmmVDPoL1NWzWDC69dG9L3vdr19Lx1FMz3d2//8JPP8GPP9rpfuXLl+HQmh/Qr9X1tH7sMVb8sILl975B1VolqF7dvpdHog7cti3jwm7p0vStdtYrtUEDOOmk/eeaq1fPpiUoVizj55g+fXf+Kvr27LFJ9WbOtPPywtU6V7cu/PQTmzt3pvJVV8GiRTYRfCQPzuLFMHo0vPmmtVw2bAiPP87vW7bQ5Ntvbf7B4cNtnolrroHjjotcloDKT2/NfKly5cq0bNmSzz77jO7duzNhwgR69+7N0KFDqVSpEsnJyZx66qnMmzeP4zJ5A/76669MmDCB3377jaSkJFq0aLG38Dv//PMZMGAAYK16r732Gtdddx3dunWja9eu9OzZc799JSQk0L9/f77++muOPPJILr30UkaOHMmNN94IQJUqVZg9ezYvvfQSTzzxBK+++mrkDo6IiIgEw5498Omn9qX5449tiMmiReHYY23y7tSWvGOOObDq2bBh783kZFiwwIq8n36yy99/27KSJW3zNWvgu/VFGbnxRW6hPo//dBvLO6+mHVPYQiWKFoWqVaFatexdSpfeF2Xz5gOLutRib/36/WMfeqgVd6efbteplyOOsMFQCrzkZCvgv/gCxoyx7pnhVK4c8x94gI6ffAJPPw1//gkTJlg34HDZvdsK1pdfhu++s/fmeefBoEHQqRM4x/rp02ny4IPWFfnFF2HsWCsQO3SwAvC88zKv5AuYwlX43XgjzJkT3n02awbPPJPlKj179mTChAl7C78xY8YwceJERo8eTVJSEv/++y+///57poXf999/z3nnnUfp0G+2bt267V22YMEChg8fzn///Ud8fDxnnHFGllmWLFlC/fr1OTLUd7tfv368+OKLewu/888/H4CWLVvy/vvvZ+MAiIiISL7kvXXZfPNNG5Vx40arpK691oq95s2tWsvCtm0wc2ZFpk+3Yu/nn63nIFhh1a4dXHcdtG1ru0s7yn9SkmPTpltZMbYOJw27lOVV2jL5yk9Zmlyf9evZe/nrL7tO3W96ZcpYobhpU7vU3qd71aplxVy3bvsXd4cfbnPSFVre28/53Xfh8cetC2QkxMXBU09Bo0ZWZLVtCx99ZM2nefHHH1a8vfGGtVAffjg88oi9jmrVMt6meXN49VV47DF4/XUrAnv3hpo1rVAcONDetAVY4Sr8YqRr164MGzaM2bNns2vXLipWrMgTTzzBzJkzqVixIv379ychdcbMTLg0k2Gm1b9/f6ZMmULTpk154403mD59epb78d5nubxEaLiouLg4kpKSslxXRERE8qE1a+Ctt6zlY+FCq8a6d4d+/eCMMzLtjue9td6ldtv86Sfruul9U4oUscbBSy+17/bt2llvv0y+vgD2NNWrA7f2hhNrUr57dy57+URrcTz++APW37nTGhfXrWO/wjD1sn37ejp0OGy/4q5UqTAds4Lmnntg1Ci4/XbIYrDAsBk0yJpSe/WyET+nTLE3Sk7s2QMffGCte9Om2Ruoe3fb96mnZr9/cKVKcMst1iD02Wd2/t8998ADD0DPnlYQt2mT9Zs3nypchd9BWuYipWzZsnTs2JHLL7+cvn37sm3bNsqUKcMhhxzCunXr+PTTT+mYRZ/qDh060L9/f+644w6SkpL46KOPGDRoEADbt2+nRo0aJCYm8vbbb3PYYYcBUK5cOban/7cX0LhxY5YvX87SpUv3nhN48sknR+R1i4iISEDs3An/+5+17n35pY1W0qaNffm/4IIM+zbu3m2949J220ydmqBcOdv8/POhbNm5DBjQNG89+Nq3tyc46yw7z+ydd6yZLo3SpfdNg5CR6dP/pGPHw/IQopB49lkYMQKuuMIG1omW006zJuGuXa0b5pgxcNFFB99u6VJ45RVrpduwwU60fOgha93LSwtdXBycfbZd/vwTRo60TO+8Y62D11wDffvu35c4nytchV8M9e3bl/PPP58JEybQuHFjmjdvztFHH83hhx9Ou3btsty2RYsW9O7dm2bNmlG3bl3at2+/d9mIESM44YQTqFu3Lscee+zeYq9Pnz4MGDCA5557jkmTJu1dv2TJkrz++uv06tVr7+Aug6M50pKIiIhEh/fwww+kvP4mbtJ7uO3b2FOjDusvHco/HS5l3SFHsn07bH/bBuhMe1m61MZ02b3bdnX44fa9vV07a6g5+uh9UxJMn74lPKdtNW4M//d/cM45dt7Vc8/Zl28Jn7fespau88+3oj/arVqNGlnx16OHDbKyeLFN+p6+tW7PHvtHxejR8NVX9mbr1s1a904/Pfyj/zRsaF1SR4yAt9+2VsArr4Rbb7UC+aqr7EOQz6nwi5Lzzjtvv26Wb7zxRobrpe2quXz58r23hw0bxrBhww5Y/6qrruKqDObAadeuHb///vve+6NGjdo7ncOpp57Kb7/9dsA2aZ+vVatWB+02KiIiItG1e7c1TixebKc5zZlzBG+nK9zKb1xG57Vj6b51LHVT/mYnZZhET96kH9/+ezL+jSLwxoH7jouzlrzy5W3kymuvtSKvbdsonvpUvbp14+vb1wL884+du6U52PLuk0+gf3845RQrbmI1/GjlyjagzFVXWffKxYutJbp0aZs2JLV1b906mxBxxAibFL5mzchnK1PGzvUbMAC+/94KwKefthFJzzrL3pOdO+fb96MKPxEREZGA2bjRvg+nv/z99/5zypUsWZMKFaBG6a2cl/Qe3baOpenW70nBseSwU3jnuPtY3uJ8SlYuw0XlYHA5K+4yupQsGZDTmsqUsXO5brjBBh755x8rDA4y0Ixk4fvv7fy15s3t/LpYH8vixW2glaOOgttuszd2akFYpIh1Bx00yM45jcVs987ZqJ8dOsDq1dby+PLLcOaZdgLpNddYEZ3P5gRU4SciIiISA8nJ9n03owJv06Z965UoYT3kWra0U6IaN7bLkUck89eoJzluzhwrlBIS4Mgj4bYHKXLxxRxVpw5HxezV5VFcHDz/vI3+OGSIDUgzZYoVB5Izc+da99m6dWHq1OAMZ+qc/WyPPBIuvNDOM733XutaWatWrNPtc9hh1h112DCbOuKFF+Cmm+z+pElWDOYTKvxEREREIig+HpYsObC4++MPO5UpVbVqVtD16LGvuGvc2Hq7HdDosWwZtDyD45YutS/Ml11mo3K2bh2QZrswcM5GX6xTBy65xPqcfvppgTjXKmr++stazcqVs9a0qlVjnehA3bpZYV+mTGxa97KreHHrgty3r4169NJLNrdlPlIoCj/vfabTIUjOHGw6CBERkXwnKckGEkktoHJpzx4r5ubP3//yzz/71omLs1HtGze2U4ZSi7tGjWyU+WxZvtxGRYyPZ+Hdd3P0nXdas2BB1asX1KhhQ/efGJruoXXrWKcKvn//tYFQkpLsvMk6dWKdKHPhnNQ9Gpo3t3MR85kCX/iVLFmSTZs2UblyZRV/eeS9Z9OmTZSMdb9wERGRcFm2zFqTfvrJ7q9ZY124suA9rF1bgo8/3r/AW7IEEhNtnaJFrZhr08bGiTjqKLscccT+k5jn2D//WNG3fTt8/TUbtm4t2EVfqpNOsp/RmWfum+6he/dYpwqsotu3W0vf+vVW9B2Vbzv9ShgV+MKvVq1arFq1ig0bNsQsQ0JCQsyLpXBlKFmyJLWC1O9aREQkN7yHceNslD7n7PYXX8Dw4bBjBzz4IDjHli0HtuAtWADbtrXZu6s6dWzy8q5d7frYY63oy1OBl5GVK21Exi1b4OuvrdWhMI3AnToVQNrpHq69NtapgmfnTo4dOtSGf/3kEzj++FgnkoAo8IVfsWLFqF+/fkwzTJ8+nebNmxf6DCIiIkHgN2/BDx5Mkfcmkty2PfEjx5FYsy4rj7qQ0qtK0+jhh5ny9g6uSXyGNf/u6y1UoYIVdRdfDCVK/EGPHkdyzDFwyCFRCL16tRV9GzfaBOwtW0bhSQOoWjVrwbrwQrjuOuv2+thj+XZ4/RzbuRPWrrVunJldr1hB+S1bYOJEm3xRJKTAF34iIiKSv3kPixbZuB7ffgurVjWlbFk7dSkx0a6ze2m3Zxqvp1zKoaxlOA/y6E+3k9I0dUCJIsBInilSmhtWPE2FBjuZ9cgojmkaxzHH2OB+qWeNTJ++hnbtjozOAfj3Xyv61q2zVsnCfn5b6dI2uuJNN9n8aitWwNixsU6VeykpNozrwQq6tWth27YDty9SxOY/rFHDLi1asKB+fY7t0SP6r0UCTYWfiIiIBE58vPVm/PRTu6xYYY83bgzFizuKFYNSpexcuswuxYrtu13C7eHM/7uLDr88zpZKDRh7wf9Rvm4rHk2zfrVqcOyxjoYNnoT7y9DxgQfoOG8n3PJm7Ca7XrvWir41a+Dzz21wE7FRcp59FurVs5E/V6+m2JAhsU6VPQkJMH48vPGGnWO6bp39VyK9smXh0EOtmGvaFLp02Xc/7XWVKgeMhrmpMHUBlmxT4SciIiIx5z38/vu+Qu/77601r2xZ6602bJh9761TB6ZPn0PHjh2zv/PFi20CvNmzYeBAKj/1FFeUKZPFBg5GjLDh5YcOhV27bDCRaA+isn49nHqqVb2ffWbTGcg+zsHNN9ub4uKLObFPHxtq/6qrgjmtxZo1NgXAyy9bl90mTWzUzfTFXOrtsmVjnVgKGBV+IiIiEhOZteodcwzceKMN4NiuXR4GSfEeRo2yFqHSpW0C8JyMBHnHHbbdDTfAuefC++9bM2M0bNhgRd/ff9uk2+3bR+d586OePeHoo1l7xx0cNnkyvPkmNGtmBeCFF8a+gJoxw1onJ06E5GQbnOaGG2x01qAVp1KgFZIzYUVERCTWvIeFC+GJJ6ymqVTJ6qm334YWLawhZMUKGznzscfse3Gui771621i6Kuvhg4dbKe5Gf7/+uvh1Vetm+VZZ9k0CpG2aZM1cy5danPW5aR1s7A66ij+vOkma1UbNcrebIMGQc2acM019vOPpsREmDDB5vM44QT46CPL8eef8L//WfddFX0SZWrxExERKYz++QdGjrTWkJ497eS5CAhNN8enn1pvxbC36mVk6lSbiH3rVmtpufbavI36eMUV1tJ36aXQubO9mEjZvNmKviVLrFg45ZTIPVdBVK6cFXwDB9rUD6NGwWuvWRfLdu1g8GB7v0domq1iW7fCQw/Z861eDQ0a2Huwf//8N0m5FDgq/ERERAqTpUvh4Yf3jYKYlAR33WXnG/XsCT162JwFuWyN2LMH5s610TczOldv+HA7V6927TC+plS7dsGtt8KLL9pr+Ooruw6HCy+04q93bzjlFIrdc0949pvWli1WWP7+u7UKnX56+J+jsHDOWtvatIGnnrLun6NGwSWX2H8cLr/cisMGDcLzfPPnw7PPcuK4cfYhOO00e76zzio8U01I4EX0neic6+KcW+KcW+qcuyOD5Yc45z5yzs11zi10zl2W3W1FREQkB37/3Saga9TIRhS86iobUXDVKnj+eRvS8oEHbPTAI4+089tmzbIuc5nw3nquvfWW9Yg88URrcGnd2uqv9evtO/Y331jvxQ8+gAEDIlT0zZkDrVpZ0XfTTXZeVbiKvlTnnQcffgiLFtHshhtsiP1w2boVzjgD5s2zA9WlS/j2XdhVrmyDwCxebP8M6NjRisGGDe2Yf/BBxqNqHkxy8r5um8cdB2+/zbrOnWHBAptrsWtXFX0SKBFr8XPOxQEvAqcDq4CZzrkPvfe/p1ntGuB37/05zrmqwBLn3NtAcja2FRERkYP57Td48EGb96xMGRvo5OabbdTAVNdea5f1620AlMmT7US8Rx+FunXh/POhZ0+2bIrj44+tpkq9bNliuyhd2uqu66+3U5pOPBFq1YrC60tJsS/xw4bZF/zPP7dWs0jp0gU+/ZSSZ55p5w5+/bWNKpkX27bZfufMsWN/1llhiSrpFCliJ5eeeqqdC/jaazB6tL2/a9a0/0oMGGATNmZl61Z4/XX7h8myZfZGf/hhGDCAP+bPp+bRR0fn9YjkUCT/DdEaWOq9X+a93wNMANKfVe2Bcs45B5QFNgNJ2dxWREREMvPzz9bi0KKFtXLcdZed1/fYY/sXfWlVq2bd3z7/nJ3L1/PH0NdZVvoYEp99Edq1o3XPPvx9znX88MB01v+bTM+e8Mor1ki1dat173z8cesxGpWib/VqK/JuvRXOPtuCRLLoS9WxI3OffNJG3mzf3rrP5tb27Xai46xZNurjOeeEL6dkrmZN+0z8/be12h13HNx/v/2j47zz4Isv7J8Kaf35p/1no1Yta1U+9FB4910r/u64w/7xIBJgkTzH7zBgZZr7q4AT0q3zAvAhsAYoB/T23qc457KzrYiIiKTlPXz3nc1B9/XX9kX0gQdsNMEKFTLdLDkZFi2CX37Z15I3f34lkpP7A/05pvZWrqz5CSevf4Or17zGdbtfgH+rgjsP6vSAxp2gaLFovUozaZIVqbt3W/V5xRVRHSVxW5MmMG2aFZodOlhx3aRJznYSH2+te7/8YgXEuedGJKtkoWhRG/21Wzcr4EaPhjFjrOX7iCNsoJgmTWwgpKlTbf3evW06hlatYp1eJEecz6Lvfp527Fwv4Azv/ZWh+5cArb3316VZpyfQDrgZOAL4EmgKnHGwbdPsYyAwEKB69eotJ0yYsHdZfHw8ZWM9d0tAcgQhQ1ByBCFDUHIEIYNyBC9DUHIEIUNQchw0g/dUnDmTum+9RYX589lTsSIr+vTh33POITk071xyMmzeXJwNG0qyfn2J0KUkf/1VhiVLyrFrl/0vuGzZRBo33s5RR22jcePtNG68jUqVEvfmKB8XR+UZM6j67bdU+vlniu7aRWK5cmxs146NHTqwuWVLfFiH6Nxf3M6d1H36aep89RXbGjdm0bBh7IpK8+L+Un8mpZcvp+ktt+CSk5n3+OPEN2yYre2L7NrFcUOHcsj8+fw+fDgbOnXKU45YCkKGcOZwe/ZQ9fvvqfnhh1SYNw+APRUrsuacc1jTrRt7smjZK2jHoiDkCEKGaOfo1KnTr977A/4zEcnCrw1wr/f+jND9oQDe+4fTrPMJ8Ij3/vvQ/W+AO4C4g22bkVatWvlZs2btvT99+nQ6BmDumyDkCEKGoOQIQoag5AhCBuUIXoag5AhChqDkyDRDSgp89BH+gQdws2axp3ptFnW7nR8bXc7ydaVYuZK9lzVrDhy/okwZOPpoG4zlhBPsukGDzMejOCBHQoJ1iZs0yQY92brVRnc55xw4+WRrgUxMtEtS0r7bWV0Ott6KFfj163FDh8I990CxKLc0ZnQsli6188a2bbM5K044SCelnTvtGE2fbiPj9O0bnhwxEoQMEcuxcKH9fM84I1tTQBToY5FPcwQhQ7RzOOcyLPwi2dVzJtDQOVcfWA30AS5Mt84K4FTge+dcdaARsAz4LxvbioiIFArx8XHMm7eviFv1TzKH/jiZM2c/yBE75vE3h/MQrzB23aUkvmKtbSVK2KlItWtbDVa79oGXChXy2DuyZMl93eT27LHhOydNsm5y48dnvl1cnBVsqZeiRfe/n9GlZEkrKmvVYs7JJ9P8+uvzEDzMGjSwLrannmrD+H/8sR30jOzaZRPJT5tmU2rkoeiTKDj6aLuIFAARK/y890nOuWuBz7EWvDHe+4XOucGh5aOAEcAbzrn5gANu995vBMho20hlFRERCZrERGtEGzUKvvqqPQBFSaQv73AnD9GYJSwv1ZgXTxzHynZ9aFq3KBPTFHVVq0b1lDebgb1LF7uMGmVNjBkVdEWL5nmI+63Tp4cnczjVrWuTFp52mh2DKVOslSithAQbOOTrr21UyIsvjklUESmcIjqBu/d+KjA13WOj0txeA2Q4/FZG24qIiBR0//xjY5W89hqsXWtF3GUXLuHqst9yzMePUHLN3/imTWH4e9Q77zyuiYuLdeQDFS2a9ykO8qMaNaz7ZufO1gqadsCW3buhRw+bbuK116Bfv1gmFZFCKKKFn4iIiBxccrINGPjyy3YNNjvB4MHQpc7vJJ7akZIbNthJeKOexXXtGuXmPMm2qlWtG+eZZ9q8FuPGWcHXq9e+H/Lll8c6pYgUQir8REREYiR1DulXXrFz9w491OYhv/JK6zlISgp0GEjK7t02iMppp6ngyw8qVLCfV7ducNFFNsH8rFnw0ks2BYWISAyo8BMREYmilBSb8m3UKDuHLzkZTj8dnnnGBnrcb5DKsWPhxx/569ZbaXz66bGKLLlRrhx88om19n32GTz/PFx1VaxTiUghpsJPREQkCtavt/E8Ro+2eaKrVIFbboEBA2xQyANs3gy33gpt27K2SxcaRz2x5Fnp0lbdL1sGjRrFOo2IFHIq/ERERCLEexvlf9QomDzZRurs0AEeeADOP9+mXMjUnXfCli0wcqQVgZI/FSumok9EAkGFn4iISJht3my9NF9+GRYvtlO+rr7aTu9q0iQbO/jlF2savPFGOO44GylSREQkD1T4iYiIhIH38PPP1ro3caJN2Xbiida984ILrNdftiQn27lgNWrAvfdGMrKIiBQiKvxERETSSUy0VrvNm2H+/PJs22a3N23a93jq7bTXO3ZA2bLQvz8MGgTNmuXiyV96CX77zeaAK18+zK9MREQKKxV+IiJS4HlvhdnSpfDXXzbQSkbFW+rt7dvTbt1iv33FxUGlSnapXBlq1bLemJUrWzfO3r1tQMdc+fdfGD7chvns1Su3L1dEROQAKvxERKRA8B42bLDibulS+PPPfbeXLoX//tt//SJFoGJFK9gqVbKelUcfva+gS71euXIunTo13ftY+fIRnEpvyBDrI/rii5qvT0REwkqFn4iI5BveW2td2qIu7e1t2/atW6QI1KtnUyVceCE0bGi3jzjCirzy5W2dg5k+fQutWkXsJe3zzTcwfjzcfbeFFRERCSMVfiIiEjgJCTB//iEsW3ZggRcfv2+9uDgr7ho2hLZtrbBLLfDq1YPixWP1CnJozx4b9vPww+GOO2KdRkRECiAVfiIiEgjx8TB1Krz/PnzyCcTHNwegaFGoX9+KuQ4d7Dq1wKtb16ZJy/eefBKWLLEDUKpUrNOIiEgBpMJPRERiZssW+Ogjm9z8889h926oVs26ZtatO5/evY+lbl0r/gqs5cthxAib0f3MM2OdRkRECqiC/KdUREQCaN06mDLFWva++QaSkqB2bRg82Gqfdu2sC+f06Zs44ohYp42CG26wkw2feSbWSUREpABT4SciIhG3cqUVepMnww8/2CAtDRrALbdAjx7QqlW6QSw/+YTDx42DNm2gRImY5Y64Dz+0y2OPWfUrIiISISr8REQkIv780wq999+HmTPtsWOPhXvusZa9Y47JYMaCtWutBWziROqAzafw4otRTh4lO3fC9dfb5H833hjrNCIiUsCp8BMRkbDwHubP39eyt2CBPX788fDII1bsZTpLQUoKvPoq3HabDek5YgQr586l9ksvWd/PCy+M2uuImgcfhH/+gW+/LSAj1IiISJCp8BMRkVzz3lrzUlv2li61Vrz27eHZZ+Hcc6FOnYPsZNEiGDjQ+oB27AgvvwxHHsmyr76i9tq1MGAANGtmLWMFxeLF8PjjcOmlNlSpiIhIhKnwExGRbEtKgrlz4fvvrU774QcbrKVoUTjlFLj1VujeHapXz8bOEhLg4YftUq4cvP469Ou3t/+nL1oU3n0Xmje3EwFnzLD18jvvbc6+MmWs+BMREYkCFX4iIpKpHTvgl1/2FXn/93/7JlCvXx86d4bTToNzzoGKFXOw42+/hUGDbO66iy6Cp56yeRzSq1kTJkywJxkwAN55J4MTA/OZd96BadNg5MiMX7OIiEgEqPATEZG9Nm60Au+dd47g9tth9mxr5XMOjjvOGuTat7fT7mrVysUTbNli5/G9+qpVjp99BmeckfU2nTrBAw/AnXfCSSfBtdfm6rUFwtatcPPNduLjgAGxTiMiIoWICj8RkULKe/j7732ted9/b6eeARQrdhgnnmg12kkn2awKFSrk8cnefddG7Ny0yXZ8zz1QunT2tr/9dvjpJyuaWrWCE0/MQ5gYuusuWL8ePvnEJisUERGJEhV+IiJBkJgI27fj9uyJ2FMkJ9uom6lF3g8/wJo1tqxCBWvF69/fCr0dO76nc+eTw/PEy5fDVVdZ616rVvD55zZYS04UKQJjx0KLFnDBBdYUWaVKePJFy+zZNjXF1VdDy5axTiMiIoWMCj8Rkdzw3k6A274988u2bVkvT3tJSACgTYUKVhi1bh2WmLt3wwcfwLhxVuht22aP16oFJ59s3TZPOgmOPtpqq1TTp/u8P3lSkg3teffd1lf0mWesm2ZuW7oqVoRJk6BtWzsvcOrU/NNqlpJixW+VKtZtVUREJMpU+ImI5MSePTbQyA8/WPGXHWXL2miU5cvbdblyULfuvtuplzJlSH7iCRse84MP4PTTcx1z8WJ45RV4803rWVmvnk2Fd9JJVuwddIqFvPr1VzuH7bffoGtXa+kKx5O2bAnPP28DwzzwgHUXzQ9efdVGJR03Lo99ZkVERHInooWfc64L8CwQB7zqvX8k3fJbgYvSZDkKqOq93+ycWw5sB5KBJO99q0hmFRHJltdft36S11yTcfGWQTG3X1PaQfxWty5t778fzj7bujb26ZPtbXftsvn0Ro+2iEWL2jx6AwfCqafmKEbuxcdbC9+zz9qIle+9Z1MxhHMkzgEDrPC+7z471+9gg8PE2oYNcMcd1sR60UUHX19ERCQCIlb4OefigBeB04FVwEzn3Ife+99T1/HePw48Hlr/HOAm7/3mNLvp5L3fGKmMIiI5sns3PPigjXTy/PMRmVZgT+XKNtVBt27WRLdx40FHsVywwFr3xo2zQTMbNIBHH7UROLM1n164fPKJnb+2YgUMHmzz80Widcs5GDXKWhMvusiua9cO//OEy+23W3fel17K/1NRiIhIvhXJ//+2BpZ675d57/cAE4DuWazfF3gngnlERPLmtddg5Uq4//7IfoFPPc/vnHPguuusBS1dt9KdO+GNN+x0t2OPtTrojDPgm29sarzbboti0bd2LfTubV06y5Sx5saRIyPbpbF0aWve3LMHevWy6yD64QdrJb7lFmjSJNZpRESkEItk4XcYsDLN/VWhxw7gnCsNdAEmp3nYA1845351zg2MWEoRkexISICHHrKT5E49NfLPV6qUFTaXXw4jRtjAIMnJzJ1rDYA1a8Jll8HmzfDEE7Bqlc0L3qlTlLp0ghWjr74KRx0FU6ZYQfzbb3aMouHII2HMGJthfsiQ6DxnTiQm2s+tTh2bxkFERCSGnM/u4AQ53bFzvYAzvPdXhu5fArT23l+Xwbq9gYu99+ekeaym936Nc64a8CVwnff+uwy2HQgMBKhevXrLCRMm7F0WHx9P2bJlw/zKci4IOYKQISg5gpAhKDmCkCG/5Djs/fdp+PzzzHnySf5r0SJ6Gbyn9sjXOOK9t/m8XDe6b3+XlGLFOfnkDXTtuobjjtsakcbHg/1MSqxdS6MnnqDSr7/yX9OmLLn5ZnaFecSY7L4vjnjxRWpPmsTCu+5iwymnhDVDTnKkV2viRBqMHMn8ESPYlMdiOD98RgpbjiBkCEqOIGQISo4gZFCO4GWIdo5OnTr9muH4KN77iFyANsDnae4PBYZmsu4HwIVZ7OteYMjBnrNly5Y+rWnTpvkgCEKOIGTwPhg5gpDB+2DkCEIG7/NBjp07va9Rw/uTT/Y+JSVqGX791ftBg7wvW9b7G3nKe/ArGnbym/7eGtEM6XPsJyXF+1GjLFSZMt6/9JL3ycnRzZDenj3et21rmRYtil2OtFautDxdu4blPRP4z0iUBSFHEDJ4H4wcQcjgfTByBCGD98oRtAzeRzcHMMtnUCtFskPQTKChc66+c6440Af4MP1KzrlDgJOB/6V5rIxzrlzqbaAzsCCCWUVEMjd6NPz7r40iGeHBOXbsiOPll23WgpYtbTqGHj2g14834ceOo/bf31Pp/I6wbl1Ec2To779tKovBg22ewQULrCtj1PqWZqJYMZg40brH9uhhI4vG2k032TyGzz2nAV1ERCQQIjaqp/c+yTl3LfA5Np3DGO/9Qufc4NDyUaFVzwO+8N7vSLN5deADZ38siwLjvfefRSqriEimdu600SlPOcWG4w+j3bth0SKYP3/fZfr0tiQk2IAtzz9vg1ZWrBjaoO3FUKWyFTft2sEXX8Dhh4c1U4ZSUmz0mNtusyLm5ZdtSoUgFTSHHQbjx0PnzlaYjhsXu3yffWYTzT/wANSvH5sMIiIi6UR0Hj/v/VRgarrHRqW7/wbwRrrHlgFNI5lNRCRbRo2y1rVJk3K9i5QUWL58/wJv/nz44w9ITrZ1ihe3MVJOO20dw4fXpHXrTOqWM8+Er7+2ef7atbMio2kEf10uWwZXXAHTp9uE8q+8YvMXBtFpp9kAM3fdZcfmqquinyEhwUbfOfLIYA44IyIihVZECz8RkXxtxw545BErKLI5OMeGDQcWeAsX2q5S1a9vLXrnn2/Xxx4LDRtaj8Xp0//ghBNqZv0kbdrYNAFnnAEdOsBHH9l1OKWkwAsv2Bx0cXFW8F1xRbBa+TJy553w009w443QqhUcf3x0n//RR+Gvv+DLL6FEieg+t4iISBZU+ImIZOall6ySu+++Axbt3Am//35gkZf21LsqVayou/zyfQXe0UdDuXJhyNakCfz4oxV/nTvDu+9C96ymSs2Bv/6i2U03wbx5tv/Ro21KgvygSBHr5tmiBfTsCbNnQ+XK0XnupUutW3CfPvbPAhERkQBR4ScikpH4eHjsMSt82rYFrLvm+PFWY82fv29O9ZIlraA788x9Bd6xx9oE6hFtIKtTxyZLP/tsaz585RWrMnMrtZVv6FDKOmcT1l92WfBb+dKrXNm65p50ElxyCXz8ceQGoElMhG++scFl3n/f+uw++WRknktERCQPVPiJiGTkhRdg40a23nwf40fC229bAxtYPXH33fsKvCOOsN6QMVGlip3z17OndcVcv966Z+a0WPvzT9v+++/hzDOZedlltOnVKzKZo+H44+GZZ+Dqq+Ghh2D48PDtOykJvv3W/gPw/vuwaZM14557rp3fV/MgXXVFRERiQIWfiEg6O9duI+7Bx5lX7Szann0CSUnWs/Khh6BvX6hXL9YJ0ylbFj78EPr3h6FDrfh74onstXIlJ9uUA8OGWWvV669Dv37s/vbbiMeOuMGD7VzIu++GE0/MW/fL5GQriidOhMmT7RiXKQPdukHv3tYyXLJk+LKLiIiEmQo/ERGsEWfGjIqMGQNHvPs89+zZzL2l7uXGG21KhaZNA97jsXhxeOstqFoVnn7azk0cM8ZGjMnMH39YV86ffrLuoi+/bNMiFBTO2fmJc+bAhRfCb7/l7PWlpMBPP9Hguees4l+71uYK7NrVir2zzrL7IiIi+YAKPxEptLyHmTOtG+e778K6dU2pXX4rI3mSjW3O4cPvj49dF87cKFLEujdWr24teJs2wXvvWctUWsnJtt7w4dZKNXYsXHxxwCvbXCpTxlroWrWCCy6waSmyKoa9h19+sTfEe+/B6tXUKF7cir0LLrDr9MdTREQkH1DhJyKFzp9/WrE3frzdTv1e37TpAoYmvU+xEVso88K9kJ+KvlTO2ZQGVataV8fTTrPBTVJHtly82Fr5fv4ZzjnHWvlq1Iht5khr3NgGqunTxyahf/rp/Zd7D7NmWTfOiRNhxQp7U3TpAo89xk8VKtD+rLNik11ERCRMVPiJSKGwbp014rz1lrXyOQcdO9o4KD16QIUK8MPHyyl28VM2SEeLFjFOnEcDBtjAL337Qvv2MHWqtWDddReULm0H4sILC2YrX0Z697bReZ55xiZ379HDuoCmFnvLlkHRojY1xogRdu5ehQoAJE+fHsPgIiIi4aHCT0QKrO3bYcoUa9376ivr4di0qc3S0Lcv1Kq1//q13nsPtm6Fe++NRdzwO+88+PxzK2IaNrQTGc89F0aOhEMPjXW66HviCZgxw1o877zTmnvj4qxVdNgwOzaVKsU6pYiISESo8BORAsF7WLnSvtf/8otdz5gBCQlQt6718LvoIptvL0ObN1Nr8mRrCWraNKrZI+rkk23qgSFDbLqGPn0KTytfesWLW6vnqafaHIi33mrFcZUqsU4mIiIScSr8RCRf+u8/67KZWuD98ot15wT7ft+8OQwaZNPbtW2bjZkNnnqKojt2wD33RDp69DVrZk2eArVr22imIiIihYwKPxEJvN27Yd68/VvylizZt7xRI5tGrXVruzRtasVftm3aBM8+y/qTT6basceGPb+IiIhIrKnwE5GDS0mxwUGKF7cqq3bt7E0Ongve26lXaVvy5syBPXtsefXqcMIJcMkldt2q1d4xOHLvySdhxw6W9+tHtTzuSkRERCSIVPiJSNY2bIB+/eDTT/c9VqqUDRbSuLEVgmkv5crlaPfbtsH//V9lvv7aCr2ZM2HLFltWurQVdjfcsK81r3btMJ+itmEDPPcc9O7Nzvr1w7hjERERkeBQ4Scimfv+exv+cuNGeP55OO4462O5eLFdz54NkyZZi2CqGjUyLgjr1iXtbOjx8fDsszbQ4n//HUuRInDMMXZOXmqR16SJjbAfUU88Abt22bl9a9dG+MlEREREYkOFn4gcKCUFHnkE7r4bDj/cJvtu1syWdeiw/7q7d8Nff1khmPby7rv7mu4ASpSAhg1JbtCIWdsbMfaXRsyKb0SXLo1o3Xk5Awc2o0yZqL1Cs349vPCCFbeNG6vwExERkQJLhZ+I7G/9ejuB7osvrCB6+eWsu2+WKGFNc02a7P+499ZSGCoEkxcuZuVXS0j+eD4tk6ZwAsm23mfw3+pjKdPjYyhTJ3KvKyOPPWbzPdx9d3SfV0RERCTKVPiJyD7Tp8OFF1pL3ejRcOWVuT+hzjmoWpXkSlV566+TuG8K/P23Ta3w0H2JnFx7mXUZnT+fsg8/DC1awPjx0LlzOF9R5tauhZdegosvhiOPjM5zioiIiMRIZIblE5H8JTkZ7r/fJrYuX96G0hwwIE+jqKSk2FzZxxwD/ftDxYo2MOgPP8DJpxWz8/66d4fhw/l11Cg7N7BLF8uR9pzBSHn0URsq9K67Iv9cIiIiIjGmwk+ksFu71ibBu+cea+2bNcsGcckl7+Hjj6FlS7jgApv1YdIk2+2ZZ2ZcS+6qXdvOI7z4YsvRtavNrRcpa9bAqFFw6aXQoEHknkdEREQkIFT4iRRiFX791QZt+ekneO01GDsWypbN9f6+/tq6cp5zjk3TMG6cTbzeo0c2Gg/LlIE337SC7OuvrXKcNSvXWbL0yCOQlATDh0dm/yIiIiIBo8JPpDBKToZ77qHprbdCpUo2ed7ll+e6a+f//R+ccgqcdhqsWmXjwSxebA14aWZwODjnYNAg6w/qPbRrZzvzPle5MrRqlZ2/2K+fjVgqIiIiUgio8BMpbNassQrt/vtZe8YZVvQdfXSudvXbb9Yrs21bWLgQnnkG/vwTBg6EYsXykPH4422OwFNOgcGD7STBnTvzsMM0Hn7YCl+19omIiEghosJPpDD54gvr2jljBrzxBktuv53cTJ63aBH06mUDcf74Izz0kE3ld8MNULJkmLJWrgyffAL33Wd9Rk880arKvFixAl591Vo369ULS0wRERGR/ECFn0hhkJQEw4bZqJnVqlkrX79+Od7NsmW22THHwGef2YCYf/8NQ4fm6dTAzBUpYnPsffoprF4NrVrBBx/kfn8PPWTdRocNC19GERERkXxAhZ9IQbdqlXWZfOgha+maMePAydYz4b0Vdu+8Y5s2agQTJ8LNN1sReP/9UKFCZOMDNurob79B48Zw/vlw661WzObE8uUwZozNTVgnyhPFi4iIiMRYRCdwd851AZ4F4oBXvfePpFt+K3BRmixHAVW995sPtq2IZMOnn8Ill0BCArz1Flx0UZarb9tmjYG//GKzK/z8M2zYYMtKl7Zz94YNg5o1o5A9vTp14LvvrOp84gkrYCdMsPn/suPBB23wmDvvjGxOERERkQCKWOHnnIsDXgROB1YBM51zH3rvf09dx3v/OPB4aP1zgJtCRd9BtxWRLCQm2uAljz1mc/JNnGjNdWkkJ8OyZWX48899hd7vv+8bQLNxYzjrLDu17oQTrHtnngZsCYcSJeDFF200mYED7STDd9+FDh2y3u7vv+GNN2ygmFq1ohJVREREJEgi2eLXGljqvV8G4JybAHQHMive+gLv5HJbEUm1YgX07Wtz8w0aBE8/DaVKsXatFXipRd7MmRAffzxgMzqccIJNuH7iiTaoZsWKMX4dWbnoImja1CYIPOUUm5fvllsyn47igQdsXomhQ6ObU0RERCQgIln4HQasTHN/FXBCRis650oDXYBrc7qtiISkpFh3zptuwu/Zw5/3vcMn5frwy2VW6P3zj61WtKgN7NmvHxxyyCL69z+KBg1yPYVf7BxzzL75B2+91Qrd11+HQw7Zf72lS21i+GuvjVEfVREREZHYcz6cEyOn3bFzvYAzvPdXhu5fArT23l+Xwbq9gYu99+fkYtuBwECA6tWrt5wwYcLeZfHx8ZSNyFCDOROEHEHIEJQcQcgQ7hwVZ82izgujqfjPn8wt2ZI+iW+zONm6dlavnsBRR23jqKO20aTJNho2jKdEiZSwZ8iLPOXwnlqTJnHEqFHsqlGDhffdx44jjti7uPEjj1B12jR+GT+ePZUrRy5HmAQhQ1ByBCFDUHIEIYNyBC9DUHIEIUNQcgQhg3IEL0O0c3Tq1OlX732rAxZ47yNyAdoAn6e5PxQYmsm6HwAX5mbbtJeWLVv6tKZNm+aDIAg5gpDB+2DkCEIG78OTY/uPc/3KY87wHvwy6vnevONPapvsb7/d+w8+8H7NmshnCIew5PjuO+9r1PC+VCnvx461x5Ys8b5IEe9vvjl6OfIoCBm8D0aOIGTwPhg5gpDBe+UIWgbvg5EjCBm8D0aOIGTwXjmClsH76OYAZvkMaqVIdvWcCTR0ztUHVgN9gAvTr+ScOwQ4Gbg4p9uKFEbJyfD9O6tw99xF+2VvsocKPFzlSdy11/Bo/xLUrRvrhDHSvj3Mnm3nN156qc0s/99/NiDMbbfFOp2IiIhITEWs8PPeJznnrgU+x6ZkGOO9X+icGxxaPiq06nnAF977HQfbNlJZRfKDBQtg4itbqfLaowzY8TQOz9fNhlDh0aHccXrF/HeOXiQceih8+aWNaProo/bYkCFQvXpsc4mIiIjEWETn8fPeTwWmpntsVLr7bwBvZGdbkcJmwwYYPx7Gv7GH1nNe5m7upyobWdH+Iqq/+iCnH1lYm/eyULSojfLZpo1N2K7WPhEREZHsFX7OuTLALu99inPuSKAx8Kn3PjGi6UQKod274aOPYOxY+HSqp3vyZN4rMZQ6LGVP+1Pgmcep06JFrGMGX/fudhERERERimRzve+Aks65w4CvgcvIoJVORHLHe5ty4aqroEYN6NUL3E8/srR6WybRizoNS8LUqRT/9iubtFxEREREJAey29XTee93OueuAJ733j/mnPstksFECoN//oFx46x1788/oVQpuOa0JQzZPJTqP35g88699ppNuhcXF+u4IiIiIpJPZbvwc861AS4CrsjhtiKSRmIivPsuPPlkU+bMscc6doT7rl5HjwX3UfyN0VC6NDz4INx4o90WEREREcmD7BZvN2Jz6X0QGpnzcGBaxFKJFEC7dlnj3eOPw4oVUKtWCUaMgEt77KDOpKfgrscgIQEGD4a774Zq1WIdWUREREQKiGwVft77b4FvAZxzRYCN3vvrIxlMpKDYuhVeegmeftpG6WzXzu6XLvF/dFr+N5x6N/z7L/ToAQ89BEceGevIIiIiIlLAZGtwF+fceOdc+dDonr8DS5xzt0Y2mkj+tm4dDB0KderAnXdCy5bw3Xfwww9wdqlvaD3gShgwAOrXt8nGJ01S0SciIiIiEZHdUT2beO+3Aedic+vVAS6JVCiR/Gz5crj2WqhXz+YQP+MMmD0bPv0U2rcHfvkFzjwTl5gIkydbJdi2bYxTi4iIiEhBlt3Cr5hzrhhW+P0vNH+fj1gqkXzo99/h0kuhQQMYPRouuggWL4aJE6F589BK69ZZl87DDmP2Sy/B+eeDczHNLSIiIiIFX3YHd3kZWA7MBb5zztUFtkUqlEh+MmMGPPwwTJliA3Bedx3ccgvUqpVuxcRE6N0bNm+Gn34i6b//YpBWRERERAqjbLX4ee+f894f5r0/y5t/gE4RziYSWN7D11/DaafBCSfA9Ok2EOc//9ggLgcUfQC33QbffguvvALNmkU5sYiIiIgUZtlq8XPOHQLcA3QIPfQtcD+wNUK5RAIpJQU+/NBa+GbMgBo1bHqGQYOgXLksNhw/Hp55Bm64wfqAioiIiIhEUXbP8RsDbAcuCF22Aa9HKpRI0CQmwtixcOyxcN55sHEjjBoFy5bBkCEHKfrmzoUrr4QOHaxKFBERERGJsuye43eE975Hmvv3OefmRCCPSKDs2gVjxli99s8/VviNHw+9ekHR7Hx6Nm+2SrFiRRvlpVixiGcWEREREUkvu4XfLufcSd77HwCcc+2AXZGLJRJ7U6fC5ZfbQJxt28ILL8DZZ+dgEM7kZOvWuWqVTeBXvXpE84qIiIiIZCa7hd9gYGzoXD+ALUC/yEQSia2UFLj/frscd5w11LVvn4tZF+65Bz77DF5+GU48MSJZRURERESyI1uFn/d+LtDUOVc+dH+bc+5GYF4Es4lE3ZYtcPHF1tp3ySV2Hl/p0rnY0ZQp8OCDcMUVMGBAuGOKiIiIiORIdgd3Aazg896nzt93cwTyiMTMnDnQqhV8+SW89BK8+WYui77Fi20m9+OPt/6hmqBdRERERGIsR4VfOvo2K8GQkmKjsOTBuHHQpg0kJNhUe1ddlct6bft2G8ylZEmYPNmuRURERERiLC+Fnw9bCpHcSkiAzp3hsMOs0MqhPXvgmmusge7EE2H2bCsAc8V76N8f/vzTTgysXTuXOxIRERERCa8sCz/n3Hbn3LYMLtuBmlHKKJKxxES44AL4+msbMbNnTzufbseObG2+ejWcfLJ167zlFuvimaeBNx95BN5/3+Z+6NgxDzsSEREREQmvLAs/73057335DC7lvPfZHRFUJPxSUqx17aOP4MUXYd48GDoUXnsNWrSAX3/NcvNvv7XV5s+3xrknnsjmvHyZ+eILGDYM+vSBG2/Mw45ERERERMIvL109RWLDe7j2WptJ/aGH4OqrbWL0hx6Cb76BnTutv+Zjj1mBmG7TiRNrceqpNqf6jBk2GXue/P23FXzHHguvvqrBXEREREQkcFT4Sf4zbBiMHAm33QZ33LH/so4dYe5c6N4dbr8dTjvNJlAH4uOtPhs5sgHdu1vR16RJHrPs3Annn28V5fvvQ5kyedyhiIiIiEj4qfCT/OXRR+Hhh2HQIDunLqPWtUqVrP/ma69ZdXfccax+/n1at4ZJk2DgwL+YNAnKl89jFu8tx9y51vp4xBF53KGIiIiISGSo8JP8Y9Qoa+Hr29fO68uqS6VzcPnl8NtvbKl0BIdd34Ohywbw1f920LfvyvD0xnz+eXjrLbj/fjjzzDDsUEREREQkMlT4Sf4wfrydy3f22TazelzcQTdJSoKhYxpS/a8feePQO7h4z2t0uqUFZZcsyXue776zoUC7dYM778z7/kREREREIiiihZ9zrotzbolzbqlz7o5M1unonJvjnFvonPs2zePLnXPzQ8tmRTKnBNxHH9lEex06wHvv2UAuB7FhA3TpYr1BLx9UnL7LH8Z9/TXs2EGLa6/NcOCXbFu92kaEOfxwGDsWiuj/JyIiIiISbBH7xuqciwNeBM4EmgB9nXNN0q1TAXgJ6Oa9PxpIP75iJ+99M+99q0jllICbNs2KrBYt4MMPoVSpg24yYwa0bAk//ABjxlgP0RIlgE6dYN48NrVtawO/nH66FXE5sXu3zRe4cyd88AEcckjuXpeIiIiISBRFsqmiNbDUe7/Me78HmAB0T7fOhcD73vsVAN779RHMI/nNjBnWlfKII+DTTw86Gov3MHo0tG9vjXA//QSXXZZupUqVWHjvvTbtws8/w3HHWQGXXTfcYNu98UYYhgQVEREREYmOSBZ+hwEr09xfFXosrSOBis656c65X51zl6ZZ5oEvQo8PjGBOCaIFC2zAlKpV4csvoXLlLFdPSIArr7RBNjt1svnbW7TIZGXn4Ior4LffoH59m45h4EDYsSPrTK+9Bi+/bAPM9OiRu9clIiIiIhIDznsfmR071ws4w3t/Zej+JUBr7/11adZ5AWgFnAqUAv4PONt7/4dzrqb3fo1zrhrwJXCd9/67DJ5nIDAQoHr16i0nTJiwd1l8fDxly5aNyOvLiSDkCEKG7OYouXo1zW+4AZzjt2efJaFmzSzXX7u2BPfccwx//FGOSy5ZTr9+y7Mc+yVtBpeYSP3XX6f2hAnsqlWL34cNI75RowO2KbdoEc1vuIH/mjZl3iOPZGtwmYMJws8kCBmUI3gZgpIjCBmCkiMIGZQjeBmCkiMIGYKSIwgZlCN4GaKdo1OnTr9meKqc9z4iF6AN8Hma+0OBoenWuQO4N83914BeGezrXmDIwZ6zZcuWPq1p06b5IAhCjiBk8D4bOVat8r5ePe8rV/Z+4cKD7m/RIu8PO8z78uW9/9//8pDhm29sR8WKef/oo94nJ+9btm6d97VqWa6NG7P3JLnNEWVByOC9cgQtg/fByBGEDN4HI0cQMnivHEHL4H0wcgQhg/fByBGEDN4rR9AyeB/dHMAsn0GtFMmunjOBhs65+s654kAf4MN06/wPaO+cK+qcKw2cACxyzpVxzpUDcM6VAToDCyKYVYJg40YbcGXTJvjss4OeQzdvHpx8MiQm2kAu3brl4bk7dbKJ2M85xwZ+6dzZBn5JSoLevS3b++8ftMupiIiIiEgQFY3Ujr33Sc65a4HPgThgjPd+oXNucGj5KO/9IufcZ8A8IAV41Xu/wDl3OPCBs1m2iwLjvfefRSqrBMC2bTb/wt9/W9HXKuuBXGfOhDPOgNKl4euvIYPemTlXuTJMmmTn8t1wgw380r49TJ9u0zY0bx6GJxERERERib6IFX4A3vupwNR0j41Kd/9x4PF0jy0DmkYymwTIrl3W0jZ3LkyZYs14WfjhBzjrLKhSxYq++vXDmMU5GyWmfXu46CL43//guuvgkkvC+CQiIiIiItEV0cJP5KD27LF58b7/HsaPh7PPznL1r76yLp116tjtWrUilKtRI5sPYto0OOWUCD2JiIiIiEh0RPIcP5GsJSfDpZfC1Kk2y3qfPlmu/tFH0LUrNGgA334bwaIvVfHi1p+0WLEIP5GIiIiISGSp8JPY8B6uugrefRcee8zm0cvCxIk23d5xx9kpd9WrRyemiIiIiEhBoMJPos97uO02eOUVGDYMbr01y9XffBP69oUTT7TunZUqRSmniIiIiEgBocJPou/hh+GJJ+Caa2DEiCxXHTkS+ve30+w++wzKl49ORBERERGRgkSFn0TVYR98YK18F18Mzz1no2hm4skn4eqrbcDPjz6CMmWiGFREREREpABR4SfRM24cDZ97Drp3h9dfhyIZv/28h/vvhyFD4IILYPJkKFkyyllFRERERAoQFX4SHT/9BJdfzpbmzWHCBCia8Uwi3sMdd8A990C/fjbDgwbVFBERERHJGxV+EnkbN0Lv3lCnDgvvvz/T5ruUFJsr/bHHbMDPMWMgLi7KWUVERERECiBN4C6RlZICl1wC69fD//0fSdu2ZbhacjIMGGA9QIcMseIvi9P/REREREQkB9TiJ5H16KM2HOczz0CLFhmukpgIF11kRd8996joExEREREJN7X4SeR8+y0MHw59+sDgwRmukpBgvUA//NAKvoNM6SciIiIiIrmgwk8iY906K/gaNIDRozNswtu5E849F778El54wab1ExERERGR8FPhJ+GXnAwXXgj//QdffAHlyh2wyvbt0LUr/PCDDeJy2WXRjykiIiIiUlio8JPwu/9++OYbq+iOPfaAxVu2QJcuMHu2TdfQu3cMMoqIiIiIFCIq/CS8vvwSRoywSfgyaMbbsqUYnTrBokU2MXu3bjHIKCIiIiJSyKjwk/BZvdqG52zSBF58McPFN97YjA0b4KOPoHPnGGQUERERESmENJ2DhEdSkg3msnMnvPcelCmzd5H3Nmpn27awYUMJPvtMRZ+IiIiISDSp8JPwGD7cRmoZPRqOOmrvw4sXw5lnQvfuULYsPP30XDp0iGFOEREREZFCSIWf5N0nn9hE7YMG2WiewNatcMstNrbLzz/b/O1z5kCjRttjGlVEREREpDDSOX6SN//8A5dcAs2awTPPkJICb74Jd9wBGzbAlVfCgw9C1aqxDioiIiIiUnip8JPc27PH5mJISoL33uOXuSW5/nqYMQPatIGpU6Fly1iHFBERERERdfWU3Lv9dvjlF7Y8NYbLHmzAiSfCypUwbhz8+KOKPhERERGRoFCLn+TO++/DM88w+6Tr6XhzTxISrA4cNgzKlYt1OBERERERSUuFn+TcX3+ReOnl/F6yNSf+8Didz4ann4aGDWMdTEREREREMqLCT3Lkr4UJ+HYXUHmH48Z67/LBC8U5++xYpxIRERERkazoHD/Jlvh4uPNO+Oq4m2mwdTbT+r3J50vqqegTEREREckHIlr4Oee6OOeWOOeWOufuyGSdjs65Oc65hc65b3OyrUSe9zB+PDRqBH8//A6DUkYSf9WtnP9GN4oXj3U6ERERERHJjogVfs65OOBF4EygCdDXOdck3ToVgJeAbt77o4Fe2d1WIu+336B9e7joIjix4hLeKj0Q2rWj7LMPxjqaiIiIiIjkQCRb/FoDS733y7z3e4AJQPd061wIvO+9XwHgvV+fg20lQjZsgEGDbDqGP/6A11/cyaQivYgrXRImTIBixWIdUUREREREcsB57yOzY+d6Al2891eG7l8CnOC9vzbNOs8AxYCjgXLAs977sdnZNs0+BgIDAapXr95ywoQJe5fFx8dTtmzZiLy+nAhCjuxmmDPnEO666xh27izK+eevol+/f2j50kMc+tlnzHvkEba0bh2VHJEUhAxByRGEDMoRvAxByRGEDEHJEYQMyhG8DEHJEYQMQckRhAzKEbwM0c7RqVOnX733rQ5Y4L2PyAXrtvlqmvuXAM+nW+cF4GegDFAF+BM4MjvbZnRp2bKlT2vatGk+CIKQIzsZVqzwvkoV7xs18n7hwtCDr7/uPXg/fHjUckRaEDJ4H4wcQcjgvXIELYP3wcgRhAzeByNHEDJ4rxxBy+B9MHIEIYP3wcgRhAzeK0fQMngf3RzALJ9BrRTJ6RxWAbXT3K8FrMlgnY3e+x3ADufcd0DTbG4rYZSQAD16wO7dMGUKNG4MLFgAV18NnTrBvffGOKGIiIiIiORWJM/xmwk0dM7Vd84VB/oAH6Zb539Ae+dcUedcaeAEYFE2t5Uwuv56mDkT3nwzVPTFx0OvXlC+vA3rGRcX64giIiIiIpJLEWvx894nOeeuBT4H4oAx3vuFzrnBoeWjvPeLnHOfAfOAFKx75wKAjLaNVNbC7pVX7DJ0KJx3HjaHw6BBNrLLV1/BoYfGOqKIiIiIiORBJLt64r2fCkxN99iodPcfBx7PzrYSfjNmwLXXwumnw4gRoQdfecVa+UaMsG6eIiIiIiKSr0V0AncJtvXr7by+GjXgnXdCvTkXLLB+n2ecAXfeGeuIIiIiIiISBhFt8ZPgSkqC3r1h40b48UeoXDm04J57oGRJGDcOiuj/AiIiIiIiBYG+2RdSd9wB06fDqFHQokXowd9/h/ffh+uug6pVYxlPRERERETCSIVfITRxIjz5JFxzDfTrl2bBww9D6dJwww0xyyYiIiIiIuGnwq+QWbAALr8c2raFp55Ks2DZMjvRb/BgqFIlZvlERERERCT8VPgVIv/9B+efD+XKwXvvQfHiaRY+9piN7nLLLbGKJyIiIiIiEaLBXQqJlBS49FL4+2/45huoWTPNwtWr4fXX4bLL0i0QEREREZGCQIVfIfHWW3X56CN47jlo3z7dwqeeguRkuP32mGQTEREREZHIUlfPQuDTT+GNN+px8cU2Wft+Nm60oT0vvBDq149JPhERERERiSwVfgXcX39ZTXf44Tt4+WVwLt0Kzz4LO3fa/A4iIiIiIlIgqfArwHbutMFcnIP7719A6dLpVti2DZ5/3lZq0iQmGUVEREREJPJ0jl8B5T0MGADz58PUqVCyZMKBK730EmzdCnfeGf2AIiIiIiISNWrxK6Cefx7Gj4f774cuXTJYYedOG9TljDOgZcuo5xMRERERkehR4VcAffedTcfXrVsWjXmvvQYbNsCwYVHNJiIiIiIi0afCr4BZvRouuMAG6Bw7Fopk9BPes8cmbG/fPoO5HUREREREpKDROX4FyJ490LMnxMfD11/DIYdksuK4cbBqFbzySlTziYiIiIhIbKjwK0Buugl+/hkmToSjj85kpeRkeOQRO6/vjDOimk9ERERERGJDhV8B8cYbNkjnrbdCr15ZrPjee7B0KUyenMGkfiIiIiIiUhDpHL8CYPZsGDwYTjkFHnooixVTUmyFo46Cc8+NVjwREREREYkxtfjlcxs32vzr1arBhAlQNKuf6Mcf28R+mY76IiIiIiIiBZEKv3wsORn69oV//4UffoCqVbNY2Xt48EEb7rNv36hlFBERERGR2FPhl48NHw5ffQWvvgrHH5/1uhVmz4YZM2DUqIM0C4qIiIiISEGj/n751Pvv2+CcAwfCFVccfP26b78NNWpAv36RDyciIiIiIoGipp98aOFCq99at4bnnsvGBv/3f1T87Td48kkoWTLi+UREREREJFjU4pfPbN4M3btD2bLW6leiRDY2eughEsuXt+ZBEREREREpdFT45SNJSTYuy4oVNg3fYYdlY6O5c+Hjj1nVo4dViyIiIiIiUuhEtPBzznVxzi1xzi11zt2RwfKOzrmtzrk5ocvdaZYtd87NDz0+K5I584uhQ+GLL2yi9rZts7nRww9DuXKsPu+8iGYTEREREZHgitg5fs65OOBF4HRgFTDTOfeh9/73dKt+773vmsluOnnvN0YqY37y9tvwxBNw9dVw5ZXZ3OiPP2DiRLj9dpLKlYtoPhERERERCa5Itvi1BpZ675d57/cAE4DuEXy+AuvXX63Y69ABnnkmBxs+8oidBHjjjRFKJiIiIiIi+UEkC7/DgJVp7q8KPZZeG+fcXOfcp865o9M87oEvnHO/OucK7agk69fDeedBtWrw3ntQrFg2N1yxAsaNgwEDoHr1iGYUEREREZFgc977yOzYuV7AGd77K0P3LwFae++vS7NOeSDFex/vnDsLeNZ73zC0rKb3fo1zrhrwJXCd9/67DJ5nIDAQoHr16i0nTJiwd1l8fDxlAzCgSW5zJCY6hgxpypIl5Xj++d9o2DA+29s2eO45an70Eb+8/Ta7q1XL98eioGUISo4gZFCO4GUISo4gZAhKjiBkUI7gZQhKjiBkCEqOIGRQjuBliHaOTp06/eq9b3XAAu99RC5AG+DzNPeHAkMPss1yoEoGj98LDDnYc7Zs2dKnNW3aNB8Euc1x1VXeg/fjx+dww7VrvS9Z0vsrrshzhnALQo4gZPA+GDmCkMF75QhaBu+DkSMIGbwPRo4gZPBeOYKWwftg5AhCBu+DkSMIGbxXjqBl8D66OYBZPoNaKZJdPWcCDZ1z9Z1zxYE+wIdpV3DOHeqcc6HbrbGup5ucc2Wcc+VCj5cBOgMLIpg1cF55BUaOhNtusykccuTpp2HPHrj99ohkExERERGR/CVio3p675Occ9cCnwNxwBjv/ULn3ODQ8lFAT+Aq51wSsAvo4733zrnqwAehmrAoMN57/1mksgbNjz/CNddAly7w0EM53HjLFpvv4YILoGHDiOQTEREREZH8JWKFH4D3fiowNd1jo9LcfgF4IYPtlgFNI5ktqFatgh49oG5dGD8e4uJyuIPnn4ft223SPxERERERESJc+EnO7NplI3ju2AHffAMVK+ZwB/Hx8OyzcM45cNxxEckoIiIiIiL5jwq/gPAeBg2CWbNgyhRo0iQXO3n5Zdi8GYYNC3c8ERERERHJxyI5uIvkwDPP2LR7990H3XMzzX1CAjzxBJx6KpxwQrjjiYiIiIhIPqYWvwD46isYMsS6eQ4fnsudvP46rF0Lb78d1mwiIiIiIpL/qcUvxpYtg9694aij4M03oUhufiKJifDYY3DiidCpU9gzioiIiIhI/qYWvxiKj7dund7D//4H5crlckfvvAPLl9uInjYFhoiIiIiIyF4q/GLEe+jfH37/HT77DI44Ipc7SkmBhx+2UTzPPjucEUVEREREpIBQ4RcjDz4IkyfDk0/C6afnYUcffACLF8OECWrtExERERGRDOkcvxj46CO46y64+GK46aY87Mh7qyCPPBJ69gxbPhERERERKVjU4hdlixbBRRdBq1YwenQeG+k++wx++w3GjIG4uLBlFBERERGRgkUtflH03382mEupUvD++3adJw89BLVrWyUpIiIiIiKSCbX4RUlyMlx4oQ2++c03Vq/lyXffwQ8/2EiexYuHI6KIiIiIiBRQavGLkuHD4dNPrU476aQ87uzzz+GKK6BaNbsWERERERHJggq/KPjmm6o88ggMGmSXXFu0CM46C7p0sYFdJkwIQ39REREREREp6FT4RdicOfDYY4056SR47rlc7mTTJrjuOjj2WPjpJ3jiCVi4EDp1CmdUEREREREpoHSOXwRt2ADnngvlyycyaVJczk/F27MHXnwR7r8ftm2DwYPh3nuhatUIpBURERERkYJKhV8E7doFNWpA//4LqV69ZfY39B4+/hhuuQX+/BM6d4annoKjj45cWBERERERKbDU1TOC6tSxnpmNGm3P/kbz5sHpp0O3bjY33yef2Hx9KvpERERERCSXVPhFWLYnaF+3DgYOhObNbVL255+3IvCss/I4y7uIiIiIiBR26uoZawkJ8Oyz8OCD1jf0+uvhrrugUqVYJxMRERERkQJChV+seA+TJ8Ntt8Hff8M558Djj0OjRrFOJiIiIiIiBYy6esbCr7/CySdDr15Qpgx8+SV8+KGKPhERERERiQgVftG0Zg307w/HHw+LF8OoUXY+32mnxTqZiIiIiIgUYOrqGQVFEhJgxAh45BFISoJbb4U774RDDol1NBERERERKQRU+EVSSgq88w6tb7rJZnPv0QMeewwOPzzWyUREREREpBBR4RdJ8+bBxReT2LAhJSdNgg4dYp1IREREREQKIRV+kdSsGUybxq8pKXRU0SciIiIiIjES0cFdnHNdnHNLnHNLnXN3ZLC8o3Nuq3NuTuhyd3a3zTc6doQiGkNHRERERERiJ2Itfs65OOBF4HRgFTDTOfeh9/73dKt+773vmsttRURERERE5CAi2RTVGljqvV/mvd8DTAC6R2FbERERERERSSOShd9hwMo091eFHkuvjXNurnPuU+fc0TncVkRERERERA7Cee8js2PnegFneO+vDN2/BGjtvb8uzTrlgRTvfbxz7izgWe99w+xsm2YfA4GBANWrV285YcKEvcvi4+MpW7ZsRF5fTgQhRxAyBCVHEDIEJUcQMihH8DIEJUcQMgQlRxAyKEfwMgQlRxAyBCVHEDIoR/AyRDtHp06dfvXetzpggfc+IhegDfB5mvtDgaEH2WY5UCU323rvadmypU9r2rRpPgiCkCMIGbwPRo4gZPA+GDmCkMF75QhaBu+DkSMIGbwPRo4gZPBeOYKWwftg5AhCBu+DkSMIGbxXjqBl8D66OYBZPoNaKZJdPWcCDZ1z9Z1zxYE+wIdpV3DOHeqcc6HbrbGup5uys62IiIiIiIhkT8RG9fTeJznnrgU+B+KAMd77hc65waHlo4CewFXOuSRgF9AnVKVmuG2ksoqIiIiIiBRkEZ3A3Xs/FZia7rFRaW6/ALyQ3W1FREREREQk5zSzuIiIiIiISAGnwk9ERERERKSAi9h0DrHgnNsA/JPmoSrAxhjFSSsIOYKQAYKRIwgZIBg5gpABlCNoGSAYOYKQAYKRIwgZQDmClgGCkSMIGSAYOYKQAZQjaBkgujnqeu+rpn+wQBV+6TnnZvmM5rAohDmCkCEoOYKQISg5gpBBOYKXISg5gpAhKDmCkEE5gpchKDmCkCEoOYKQQTmClyEoOdTVU0REREREpIBT4SciIiIiIlLAFfTCb3SsA4QEIUcQMkAwcgQhAwQjRxAygHKkFYQMEIwcQcgAwcgRhAygHGkFIQMEI0cQMkAwcgQhAyhHWkHIAAHIUaDP8RMREREREZGC3+InIiIiIiJS6BWows8518s5t9A5l+Kcy3TUHOfccufcfOfcHOfcrDA9dxfn3BLn3FLn3B0ZLHfOuedCy+c551qE43nTPccY59x659yCTJZ3dM5tDb3uOc65uyOQoaRzboZzbm7oZ3FfButE/Fikea4459xvzrmPM1gW8eMRep4KzrlJzrnFzrlFzrk26ZZH9Hg45xqleY1znHPbnHM3plsnWsfiBufcgtB748YMlkfkWGT02XDOVXLOfemc+zN0XTGTbcPy+yKTDI+H3hfznHMfOOcqZLJtlr9fwpBjRCjDHOfcF865mplsG8ljca9zbnWa9+BZmWwb0WMRevy60HMsdM49lsm2kTwWzZxzP6fu2znXOpNtI/2+aOqc+7/Q6/zIOVc+k23DdSxqO+emhX5PLnTO3RB6PLt/28NyPLLIkd3Pa56PR2YZ0iwf4pzzzrkqmWwf6WOR3c9rxI6Fc+7dNM+/3Dk3J5PtI30ssvt5DdfnJMPvWS77f9PyfDyyyBDV7+FZ5Mju37SIHYs0yw/2WQ17TZIl732BuQBHAY2A6UCrLNZbDlQJ4/PGAX8BhwPFgblAk3TrnAV8CjjgROCXCLz+DkALYEEmyzsCH0f4Z+CAsqHbxYBfgBOjfSzSPNfNwPiMXnc0jkfoed4ErgzdLg5UiOHxiAPWYvO7RPu9cQywACgNFAW+AhpG41hk9NkAHgPuCN2+A3g0k23D8vsikwydgaKh249mlCE7v1/CkKN8mtvXA6NicCzuBYZk4/0b6WPRKfTeLBG6Xy0Gx+IL4MzQ7bOA6TE6FjOBk0O3LwdGRPhY1ABahG6XA/4AmpCNv+3hPB5Z5Djo5zVcxyOzDKH7tYHPsXmLD3ieKB2Lg35eo3Es0qzzJHB3jI7FQT+vYf6cZPg9i2z8TQvX8cgiQ1S/h2eR46B/0yJ9LEL3s/yshvNYZPdSoFr8vPeLvPdLYvDUrYGl3vtl3vs9wASge7p1ugNjvfkZqOCcqxHOEN7774DN4dxnLjJ473186G6x0CX9iaQRPxYAzrlawNnAq+Hedw4ylMe+UL0G4L3f473/L91qUTkeIacCf3nv/4nQ/rNyFPCz936n9z4J+BY4L906ETkWmXw2umNFOaHrc/P6PDnN4L3/InQsAH4GamWwaXZ+v+Q1x7Y0d8tw4Gc2rPLwuyrixwK4CnjEe787tM763O4/Dxk8kNq6dgiwJoNNo3EsGgHfhW5/CfTI7f6zmeFf7/3s0O3twCLgsGz+bQ/b8cgiR3Y+r2GRWYbQ4qeB28j8cxrxY5GbfeXWwTI45xxwAfBOBptH41hk5/MaNll8z8rO37SwHI/MMkT7e3gWObLzNy2ixyJ0/2Cf1agrUIVfDnjgC+fcr865gWHY32HAyjT3V3HgL8bsrBMNbULN0Z86546OxBM46145B1gPfOm9/yXdKtE6Fs9gH7iULNaJ9PE4HNgAvO6sy+mrzrky6daJ5nujDxn/cYTIH4sFQAfnXGXnXGnsP6O1060TzWNR3Xv/L9gfdKBaJuuF+/dFZi7HWjvTi8oxcc496JxbCVwEZNbVN9LH4tpQ95wxmXRTisaxOBJo75z7xTn3rXPu+EzWi+SxuBF4PPTzeAIYmsE60TgWC4Buodu9OPDzmirsx8I5Vw9ojv33PDsicjyyyJHZ5xXCfDzSZnDOdQNWe+/nZrFJtI7FwT6vEMFjkebh9sA67/2fGWwSjWNxIwf/vEIYj0Um37Oy8zctbMcjG9/1shLpY5Gdv2kRPRbZ/KxC9L5jAPmw8HPOfeXsHKH0l5xU6e289y2AM4FrnHMd8horg8fSV/fZWSfSZmNd/JoCzwNTIvEk3vtk730z7L+hrZ1zx6RbJeLHwjnXFVjvvf81i9WicTyKYt2nRnrvmwM7sC4Y+8XNYLuwvzecc8WxL3HvZbA44sfCe78I6x71JfAZ1q0iKd1qQficpBfu3xcHcM4Nw47F2xktzuCxsB8T7/0w733tUIZrM1ktksdiJHAE0Az4F+u6lV40jkVRoCLWXehWYGKoRSG9SB6Lq4CbQj+Pmwj1GEgnGsficuy1/Yp1bduTyXphPRbOubLAZODGdP+5z3KzDB7L0/HILMdBPq8QxuORNkPoOYeR+T9m9m6WwWPhPhbZ+bxChI5FuvdFXzL/h2Y0jkV2Pq8QxmORje9ZmcbPaHdRzgBROBbZ+JsWyWNxHNn7rEIUvmOkle8KP+/9ad77YzK4/C8H+1gTul4PfIA19+bFKvb/T2gtDmzqz846EeW935baHO29nwoUy+xk0zA9339YP+8u6RZF41i0A7o555ZjzfenOOfeSpcvGsdjFbAqzX/CJmGFYPp1ovHeOBOY7b1fl35BtN4b3vvXvPctvPcdsG5l6f9DG83PybrUbqSh6wy79EXg98V+nHP9gK7ARd77jP7oRPt3x3gy6dIXyWPhvV8X+uOZArySyb6jcSxWAe+Huu/MwHoMHPBZiPD7oh/wfuj2e5nsO+LHwnu/2Hvf2XvfEvti/Vcm64XtWDjnimFfqt/23r9/sPXTCOvxyCxHNj6vYTseGWQ4AqgPzA39basFzHbOHZpu04gfi2x+XiN5LFIfLwqcD7ybyabReF9k5/Makd8Z6b5nZedvWth/b2TxXS+rbSJ9LNLK7G9aJI9Fd7L3WY34d4z08l3hl1fOuTLOuXKpt7GTtTMcBTMHZgINnXP1Q60qfYAP063zIXCpMycCW1Ob5KPFOXdo6n+unY06VQTYFObnqOpCI50550oBpwGL060W8WPhvR/qva/lva+H/Ty+8d5fnC5rxI+H934tsNI51yj00KnA7+lWi9Z7I9P/ikbjWIT2XS10XQf7Y50+TzQ/Jx9if7AJXR/wz6MI/b5Iu/8uwO1AN+/9zkxWy87vl7zmaJjmbjcO/MxG41ikPZfzvEz2HfFjgbV2nxLKdCR20v/GdFkjeiywLx8nh26fwoH/IIHovC9SP69FgOHAqAzWCduxCP0Oeg1Y5L1/Koebh+14ZJYjO5/XcB2PjDJ47+d776t57+uF/ratwgYbWZtu82gci4N+XiN5LNI4DVjsvV+VyeYRPxZk4/Ma5s9JZt+zDvo3jTAdj2x+18ts24gfi+z8TSOyx+K37HxWo/C35EA+SqPIROOC/fJZBewG1gGfhx6vCUwN3T4c62I2F1gIDAvTc5+FjfD0V+o+gcHAYL9v1J8XQ8vnk8VoR3nI8A7W5SIxdByuSJfh2tBrnoudmN42AhmOA34D5mFv3rtjcSzSZepIaMTKaB+P0PM0A2aFjskUrBtZtN8bpbFC7pA0j8XiWHyPFb5zgVOj9d7I5LNRGfga+yP9NVAptG5Efl9kkmEpdo7BnNBlVPoMofsH/H4Jc47Joc/rPOAjbCCLaB+LcaGf+Tzsj2+NGB2L4sBboeMxGzglBsfiJODX0P5/AVrG6FjcENr/H8AjgIvwsTgJ62o1L81n4iyy8bc9nMcjixwH/byG63hkliHdOssJjQYYg2Nx0M9rNI4F8Aahvx9p1o/2sTjo5zVcxyK0r8y+Zx30b1q4jkcWGaL6PTyLHAf9mxbpY5Gdz2o4j0V2L6m/xEVERERERKSAKnRdPUVERERERAobFX4iIiIiIiIFnAo/ERERERGRAk6Fn4iIiIiISAGnwk9ERERERKSAU+EnIiKSjnMu2Tk3J83ljjDuu55zLrJzNYmIiKRTNNYBREREAmiX975ZrEOIiIiEi1r8REREssk5t9w596hzbkbo0iD0eF3n3NfOuXmh6zqhx6s75z5wzs0NXdqGdhXnnHvFObfQOfeFc65UzF6UiIgUCir8REREDlQqXVfP3mmWbfPetwZeAJ4JPfYCMNZ7fxzwNvBc6PHngG+9902BFsDC0OMNgRe990cD/wE9IvpqRESk0HPe+1hnEBERCRTnXLz3vmwGjy8HTvHeL3POFQPWeu8rO+c2AjW894mhx//13ldxzm0Aannvd6fZRz3gS+99w9D924Fi3vsHovDSRESkkFKLn4iISM74TG5ntk5Gdqe5nYzOuRcRkQhT4SciIpIzvdNc/1/o9k9An9Dti4AfQre/Bq4CcM7FOefKRyukiIhIWvoPo4iIyIFKOefmpLn/mfc+dUqHEs65X7B/nvYNPXY9MMY5dyuwAbgs9PgNwGjn3BVYy95VwL+RDi8iIpKezvETERHJptA5fq289xtjnUVERCQn1NVTRERERESkgFOLn4iIiIiISAGnFj8REREREZECToWfiIiIiIhIAafCT0REREREpIBT4SciIiIiIlLAqfATEREREREp4FT4iYiIiIiIFHD/D9ULPX7n9fTTAAAAAElFTkSuQmCC"></figcaption></figure>



<p class="wp-block-paragraph">Next, let&#8217;s print the accuracy and a confusion matrix on the predictions from the validation dataset.  </p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># function that returns the label for a given probability
def getLabel(prob):
    if(prob &gt; .5):
               return 'dog'
    else:
               return 'cat'

# get the predictions for the validation data
val_df = validate_df.copy()
val_df['pred'] = &quot;&quot;
val_pred_prob = model.predict(validation_generator)

for i in range(val_pred_prob.shape[0]):
    val_df['pred'][i] = getLabel(val_pred_prob[i])
          
# create a confusion matrix
y_val = val_df['category']
y_pred = val_df['pred']

print('Accuracy: {:.2f}'.format(accuracy_score(y_val, y_pred)))
cnf_matrix = confusion_matrix(y_val, y_pred)

# plot the confusion matrix in form of a heatmap

%matplotlib inline
class_names=[False, True] # name  of classes
fig, ax = plt.subplots(figsize=(8, 8))
tick_marks = np.arange(len(class_names))
plt.xticks(tick_marks, class_names)
plt.yticks(tick_marks, class_names)
sns.heatmap(pd.DataFrame(cnf_matrix), annot=True, cmap=&quot;YlGnBu&quot;, fmt='g')
plt.title('Confusion matrix')
plt.ylabel('Actual label')
plt.xlabel('Predicted label')</pre></div>



<pre class="wp-block-preformatted">Accuracy: 0.82</pre>



<figure class="wp-block-image size-large"><img decoding="async" width="479" height="496" data-attachment-id="2716" data-permalink="https://www.relataly.com/image-classification-with-deep-learning/2485/image-24-4/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2020/12/image-24.png" data-orig-size="479,496" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-24" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2020/12/image-24.png" src="https://www.relataly.com/wp-content/uploads/2020/12/image-24.png" alt="confusion matrix for an image classification model" class="wp-image-2716" srcset="https://www.relataly.com/wp-content/uploads/2020/12/image-24.png 479w, https://www.relataly.com/wp-content/uploads/2020/12/image-24.png 290w" sizes="(max-width: 479px) 100vw, 479px" /></figure>



<h3 class="wp-block-heading" id="h-step-9-image-classification-on-sample-images">Step #9 Image Classification on Sample Images</h3>



<p class="wp-block-paragraph">Now that we have trained the model, I bet you can&#8217;t wait to test the image classifier on some sample data. For this purpose, ensure that you have some sample images in the &#8220;sample&#8221; folder. Running the code below will feed the image classifier with the test dataset. Based on this dataset, the model will then predict the labels for the images from the sample folder. Finally, the code below prints the images in an image grid and the predicted labels. </p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># set the path to the sample images
sample_path = &quot;data/images/cats-and-dogs/sample/&quot;
sample_df = createImageDf(sample_path)
sample_df['category'] = sample_df['category'].replace({0:'cat',1:'dog'})
sample_df['pred'] = &quot;&quot;

# create an image data generator for the sample images - we will only rescale the images
test_datagen = ImageDataGenerator(rescale=1./255)
test_generator = test_datagen.flow_from_dataframe(
    sample_df, 
    sample_path,    
    shuffle=False,
    x_col='filename', y_col='category',
    target_size=target_size)

# make the predictions 
pred_prob = model.predict(test_generator)
image_number = pred_prob.shape[0]

# define the plot size
for i in range(pred_prob.shape[0]):
    sample_df['pred'][i] = getLabel(pred_prob[i])
    
print('Accuracy: {:.2f}'.format(accuracy_score(sample_df['category'], sample_df['pred'])))

nrows = 6
ncols = int(round(image_number / nrows, 0))
fig, axs = plt.subplots(nrows, ncols, figsize=(15, 15))
for i, ax in enumerate(fig.axes):
    if i &lt; sample_df.shape[0]:
        filepath = sample_path + sample_df.at[i ,'filename']
        ax = ax
        img = Image.open(filepath).resize(target_size)
        ax.imshow(img)
        ax.set_title(sample_df.at[i ,'filename'] + '\n' + ' predicted: '  + str(sample_df.at[i ,'pred']))
        result = [True if sample_df.at[i ,'pred'] == sample_df.at[i ,'category'] else False]
        ax.set_xlabel(str(result))
        ax.set_xticks([]); ax.set_yticks([])</pre></div>



<figure class="wp-block-image size-full"><img decoding="async" width="862" height="864" data-attachment-id="6516" data-permalink="https://www.relataly.com/image-classification-with-deep-learning/2485/output-5/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/04/output.png" data-orig-size="862,864" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="output" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/04/output.png" src="https://www.relataly.com/wp-content/uploads/2022/04/output.png" alt="image classification - the image shows several dogs and cats" class="wp-image-6516" srcset="https://www.relataly.com/wp-content/uploads/2022/04/output.png 862w, https://www.relataly.com/wp-content/uploads/2022/04/output.png 300w, https://www.relataly.com/wp-content/uploads/2022/04/output.png 150w, https://www.relataly.com/wp-content/uploads/2022/04/output.png 768w" sizes="(max-width: 862px) 100vw, 862px" /></figure>



<p class="wp-block-paragraph">Our image classifier achieves an accuracy of around 83% on the validation set. The model is not perfect, but it should have labeled most images correctly. With deeper architectures, more data, and training runs, you can create classification models that achieve better results over 95%.</p>



<h2 class="wp-block-heading" id="h-summary">Summary</h2>



<p class="wp-block-paragraph">In this tutorial, you learned how to train an image classification model. We have prepared a dataset and performed several transformations to bring the data in shape for training. Finally, we have trained a convolutional neural network to distinguish between dogs and cats. You can now use this knowledge to train image classification models that determine other objects. </p>



<p class="wp-block-paragraph">There are many other cool things that you can do with CNNs. For example, object localization in images and videos and even stock market prediction. But these are topics for further articles.</p>



<p class="wp-block-paragraph">I am always happy to receive feedback. I hope you enjoyed the article and would be happy if you left a comment. Cheers</p>



<h2 class="wp-block-heading" id="h-sources-and-further-reading">Sources and Further Reading</h2>



<ol class="wp-block-list"><li><a href="https://amzn.to/3MAy8j5" target="_blank" rel="noreferrer noopener">Andriy Burkov Machine Learning Engineering</a></li><li><a href="https://amzn.to/3D0gB0e" target="_blank" rel="noreferrer noopener">Oliver Theobald (2020) Machine Learning For Absolute Beginners: A Plain English Introduction</a></li><li><a href="https://amzn.to/3MyU6Tj" target="_blank" rel="noreferrer noopener">Charu C. Aggarwal (2018) Neural Networks and Deep Learning</a></li><li><a href="https://amzn.to/3S9Nfkl" target="_blank" rel="noreferrer noopener">Aurélien Géron (2019) Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems </a></li><li><a href="https://amzn.to/3EKidwE" target="_blank" rel="noreferrer noopener">David Forsyth (2019) Applied Machine Learning Springer</a></li><li>[1] D. H. Hubel and T. N. Wiesel &#8211; Receptive Fields of Neurons in the Cat&#8217;s Striate Cortex, The Journal of physiology (1959)</li></ol>



<p class="has-contrast-2-color has-base-3-background-color has-text-color has-background wp-block-paragraph"><em>The links above to Amazon are affiliate links. By buying through these links, you support the Relataly.com blog and help to cover the hosting costs. Using the links does not affect the price.</em></p>
<p>The post <a href="https://www.relataly.com/image-classification-with-deep-learning/2485/">Image Classification with Convolutional Neural Networks &#8211; Classifying Cats and Dogs in Python</a> appeared first on <a href="https://www.relataly.com">relataly.com</a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://www.relataly.com/image-classification-with-deep-learning/2485/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">2485</post-id>	</item>
		<item>
		<title>Customer Churn Prediction &#8211; Understanding Models with Feature Permutation Importance using Python</title>
		<link>https://www.relataly.com/predicting-the-customer-churn-of-a-telecommunications-provider/2378/</link>
					<comments>https://www.relataly.com/predicting-the-customer-churn-of-a-telecommunications-provider/2378/#respond</comments>
		
		<dc:creator><![CDATA[Florian Follonier]]></dc:creator>
		<pubDate>Sun, 02 Aug 2020 13:24:28 +0000</pubDate>
				<category><![CDATA[Churn Prediction]]></category>
		<category><![CDATA[Classification (two-class)]]></category>
		<category><![CDATA[Data Science]]></category>
		<category><![CDATA[Data Sources]]></category>
		<category><![CDATA[Feature Permutation Importance]]></category>
		<category><![CDATA[Hyperparameter Tuning]]></category>
		<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[Random Decision Forests]]></category>
		<category><![CDATA[Retail]]></category>
		<category><![CDATA[Scikit-Learn]]></category>
		<category><![CDATA[Seaborn]]></category>
		<category><![CDATA[Use Cases]]></category>
		<category><![CDATA[AI in E-Commerce]]></category>
		<category><![CDATA[AI in Marketing]]></category>
		<category><![CDATA[Digital Transformation]]></category>
		<category><![CDATA[Intermediate Tutorials]]></category>
		<category><![CDATA[Model Interpretation]]></category>
		<category><![CDATA[Multivariate Models]]></category>
		<category><![CDATA[Permutation Feature Importance]]></category>
		<category><![CDATA[Supervised Learning]]></category>
		<category><![CDATA[Two-Label Classification]]></category>
		<guid isPermaLink="false">https://www.relataly.com/?p=2378</guid>

					<description><![CDATA[<p>Customer retention is a prime objective for service companies, and understanding the patterns that lead to customer churn can be the key to maintaining long-lasting client relationships. Businesses incur significant costs when customers discontinue their services, hence it&#8217;s vital to identify potential churn risks and take preemptive actions to retain these customers. Machine Learning models ... <a title="Customer Churn Prediction &#8211; Understanding Models with Feature Permutation Importance using Python" class="read-more" href="https://www.relataly.com/predicting-the-customer-churn-of-a-telecommunications-provider/2378/" aria-label="Read more about Customer Churn Prediction &#8211; Understanding Models with Feature Permutation Importance using Python">Read more</a></p>
<p>The post <a href="https://www.relataly.com/predicting-the-customer-churn-of-a-telecommunications-provider/2378/">Customer Churn Prediction &#8211; Understanding Models with Feature Permutation Importance using Python</a> appeared first on <a href="https://www.relataly.com">relataly.com</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">Customer retention is a prime objective for service companies, and understanding the patterns that lead to customer churn can be the key to maintaining long-lasting client relationships. Businesses incur significant costs when customers discontinue their services, hence it&#8217;s vital to identify potential churn risks and take preemptive actions to retain these customers. Machine Learning models can be instrumental in identifying these patterns and providing valuable insights into customer behavior.</p>



<p class="wp-block-paragraph">An intriguing technique, Permutation Feature Importance, allows us to discern the significance of different features of our machine learning model, thereby shedding light on their influence on customer churn. This tutorial guides you through the intricacies of this technique and its implementation.</p>



<p class="wp-block-paragraph">The structure of this tutorial is as follows:</p>



<ul class="wp-block-list">
<li>We begin by discussing the business problem of customer churn and its implications.</li>



<li>We introduce the concept of Permutation Feature Importance, a powerful tool to identify essential features in our machine learning model.</li>



<li>We transition into the hands-on coding segment, where we build a churn prediction model using Python.</li>



<li>Our model undergoes a classification process and hyperparameter tuning to select the most effective parameters.</li>



<li>Utilizing the trained model, we predict the churn probabilities for a test set of customers.</li>



<li>Finally, we create a feature ranking based on their impact on the model&#8217;s performance.</li>
</ul>



<p class="wp-block-paragraph">By employing permutation feature importance, this tutorial offers a deep-dive into the correlation between input variables and model predictions, providing actionable insights for effective customer churn management.</p>



<p class="wp-block-paragraph">Also: </p>



<ul class="wp-block-list">
<li><a href="https://www.relataly.com/building-fair-machine-machine-learning-models-with-fairlearn/12804/" target="_blank" rel="noreferrer noopener">Using Fairlearn to Build Fair Machine Machine Learning Models with Python: Step-by-Step Towards More Responsible AI</a></li>



<li><a href="https://www.relataly.com/customer-segmentation-using-hierarchical-clustering-in-python/11335/" target="_blank" rel="noreferrer noopener">How to Use Hierarchical Clustering For Customer Segmentation in Python</a></li>
</ul>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%"><div class="wp-block-image">
<figure class="aligncenter size-large is-resized"><img decoding="async" data-attachment-id="2402" data-permalink="https://www.relataly.com/predicting-the-customer-churn-of-a-telecommunications-provider/2378/image-47/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2020/08/image.png" data-orig-size="448,173" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2020/08/image.png" src="https://www.relataly.com/wp-content/uploads/2020/08/image.png" alt="machine learning. It is particularly effective when combined with feature permutation importance" class="wp-image-2402" width="324" height="127"/><figcaption class="wp-element-caption">Customer churn prediction is a compelling use case for machine learning. It is particularly effective when combined with feature permutation importance.</figcaption></figure>
</div></div>
</div>



<div style="height:26px" aria-hidden="true" class="wp-block-spacer"></div>



<h2 class="wp-block-heading" id="h-what-is-churn-prediction">What is Churn Prediction?</h2>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">A company&#8217;s effort to persuade a new customer to sign a contract is many times higher than the costs incurred in retaining existing customers. According to industry experts, winning a new customer is four times more expensive than keeping an existing one. Providers that can identify churn candidates and manage to retain them can significantly reduce costs. </p>



<p class="wp-block-paragraph">A crucial point is whether the provider succeeds in getting the churn candidates to stay. Sometimes it may be enough to contact the churn candidate and inquire about customer satisfaction. In other cases, this may not be enough, and the provider needs to increase the service value, for example, by offering free services or a discount. However, actions should be well thought out, as they can also negatively affect. For instance, if a customer hardly ever uses his contract, a call from the provider may even increase the desire to cancel the contract. Machine learning can help assess cases individually and identify the optimal anti-churn action. </p>



<h2 class="wp-block-heading" id="h-about-permutation-feature-importance">About Permutation Feature Importance</h2>



<p class="wp-block-paragraph">Feature importance is a helpful technique for understanding the contribution of input variables (features) to a predictive model. The results from this technique can be as valuable as the predictions themselves, as they can help us understand the business context better. For example, let&#8217;s say we have trained a model that predicts which of our customers will likely churn. Wouldn&#8217;t it be interesting to know why specific customers are more likely to churn than others? Permutation feature importance can help us answer this question by providing us with a ranking of the input variables in our model by their usefulness. The order can validate assumptions about the business context and uncover causal relations in the data.</p>



<p class="wp-block-paragraph">Compared to neural networks, one of the most significant advantages of traditional prediction models, such as a decision tree, is their interpretability. Neural networks are black boxes because it is tough to understand the relationship between input and model predictions. In traditional models, on the other hand, we can calculate the meaning of the features and use it to interpret the model and optimize its performance, for example, by removing features from the model that are not important. We, therefore, start with a simple model first and move on to more complex models once we understand the data.</p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%"></div>
</div>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<h2 class="wp-block-heading" id="h-implementing-a-customer-churn-prediction-model-in-python">Implementing a Customer Churn Prediction Model in Python</h2>



<p class="wp-block-paragraph">In the following, we will implement a customer churn prediction model. We will train a decision forest model on a data set from Kaggle and optimize it using <a aria-label="undefined (opens in a new tab)" href="https://www.relataly.com/hyperparameter-tuning-with-grid-search/2261/" target="_blank" rel="noreferrer noopener">grid search</a>. The data contains customer-level information for a telecom provider and a binary prediction label of which customers canceled their contracts and did not. Finally, we will calculate the feature importance to understand how the model works. </p>



<p class="wp-block-paragraph">The code is available on the GitHub repository.</p>



<div class="wp-block-kadence-advancedbtn kb-buttons-wrap kb-btns_bddeda-14"><a class="kb-button kt-button button kb-btn_b5bf96-e2 kt-btn-size-standard kt-btn-width-type-full kb-btn-global-inherit kt-btn-has-text-true kt-btn-has-svg-true wp-block-button__link wp-block-kadence-singlebtn" href="https://github.com/flo7up/relataly-public-python-tutorials/blob/master/02%20Classification/017%20Permutation%20Feature%20Importance%20-%20Customer%20Churn%20Prediction%20using%20Random%20Decision%20Forest.ipynb" target="_blank" rel="noreferrer noopener"><span class="kb-svg-icon-wrap kb-svg-icon-fe_eye kt-btn-icon-side-left"><svg viewBox="0 0 24 24"  fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"  aria-hidden="true"><path d="M1 12s4-8 11-8 11 8 11 8-4 8-11 8-11-8-11-8z"/><circle cx="12" cy="12" r="3"/></svg></span><span class="kt-btn-inner-text">View on GitHub </span></a>

<a class="kb-button kt-button button kb-btn_8e2f54-ca kt-btn-size-standard kt-btn-width-type-full kb-btn-global-inherit kt-btn-has-text-true kt-btn-has-svg-true wp-block-button__link wp-block-kadence-singlebtn" href="https://github.com/flo7up/relataly-public-python-API-tutorials" target="_blank" rel="noreferrer noopener"><span class="kb-svg-icon-wrap kb-svg-icon-fa_github kt-btn-icon-side-left"><svg viewBox="0 0 496 512"  fill="currentColor" xmlns="http://www.w3.org/2000/svg"  aria-hidden="true"><path d="M165.9 397.4c0 2-2.3 3.6-5.2 3.6-3.3.3-5.6-1.3-5.6-3.6 0-2 2.3-3.6 5.2-3.6 3-.3 5.6 1.3 5.6 3.6zm-31.1-4.5c-.7 2 1.3 4.3 4.3 4.9 2.6 1 5.6 0 6.2-2s-1.3-4.3-4.3-5.2c-2.6-.7-5.5.3-6.2 2.3zm44.2-1.7c-2.9.7-4.9 2.6-4.6 4.9.3 2 2.9 3.3 5.9 2.6 2.9-.7 4.9-2.6 4.6-4.6-.3-1.9-3-3.2-5.9-2.9zM244.8 8C106.1 8 0 113.3 0 252c0 110.9 69.8 205.8 169.5 239.2 12.8 2.3 17.3-5.6 17.3-12.1 0-6.2-.3-40.4-.3-61.4 0 0-70 15-84.7-29.8 0 0-11.4-29.1-27.8-36.6 0 0-22.9-15.7 1.6-15.4 0 0 24.9 2 38.6 25.8 21.9 38.6 58.6 27.5 72.9 20.9 2.3-16 8.8-27.1 16-33.7-55.9-6.2-112.3-14.3-112.3-110.5 0-27.5 7.6-41.3 23.6-58.9-2.6-6.5-11.1-33.3 2.6-67.9 20.9-6.5 69 27 69 27 20-5.6 41.5-8.5 62.8-8.5s42.8 2.9 62.8 8.5c0 0 48.1-33.6 69-27 13.7 34.7 5.2 61.4 2.6 67.9 16 17.7 25.8 31.5 25.8 58.9 0 96.5-58.9 104.2-114.8 110.5 9.2 7.9 17 22.9 17 46.4 0 33.7-.3 75.4-.3 83.6 0 6.5 4.6 14.4 17.3 12.1C428.2 457.8 496 362.9 496 252 496 113.3 383.5 8 244.8 8zM97.2 352.9c-1.3 1-1 3.3.7 5.2 1.6 1.6 3.9 2.3 5.2 1 1.3-1 1-3.3-.7-5.2-1.6-1.6-3.9-2.3-5.2-1zm-10.8-8.1c-.7 1.3.3 2.9 2.3 3.9 1.6 1 3.6.7 4.3-.7.7-1.3-.3-2.9-2.3-3.9-2-.6-3.6-.3-4.3.7zm32.4 35.6c-1.6 1.3-1 4.3 1.3 6.2 2.3 2.3 5.2 2.6 6.5 1 1.3-1.3.7-4.3-1.3-6.2-2.2-2.3-5.2-2.6-6.5-1zm-11.4-14.7c-1.6 1-1.6 3.6 0 5.9 1.6 2.3 4.3 3.3 5.6 2.3 1.6-1.3 1.6-3.9 0-6.2-1.4-2.3-4-3.3-5.6-2z"/></svg></span><span class="kt-btn-inner-text">Relataly GitHub Repo </span></a></div>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%"></div>
</div>



<h3 class="wp-block-heading">Prerequisites</h3>



<p class="wp-block-paragraph">Before starting the coding part, make sure that you have set up your <a href="https://www.python.org/downloads/" target="_blank" rel="noreferrer noopener">Python 3</a> environment and required packages. If you don&#8217;t have an environment, you can follow&nbsp;<a href="https://www.relataly.com/anaconda-python-environment-machine-learning/1663/" target="_blank" rel="noreferrer noopener">this tutorial</a>&nbsp;to set up the&nbsp;<a href="https://www.anaconda.com/products/individual" target="_blank" rel="noreferrer noopener">Anaconda environment</a>.</p>



<p class="wp-block-paragraph">Make sure you install all required packages. In this tutorial, we will be working with the following packages:&nbsp;</p>



<ul class="wp-block-list">
<li>Pandas</li>



<li>NumPy</li>



<li>Matplotlib</li>



<li>Seaborn</li>
</ul>



<p class="wp-block-paragraph">In addition, we will be using <strong><em>Keras&nbsp;</em></strong>(2.0 or higher) with <strong><em>Tensorflow</em> </strong>backend and the machine learning library <strong><em>Scikit-learn</em></strong>.</p>



<p class="wp-block-paragraph">You can install packages using console commands:</p>



<ul class="wp-block-list">
<li><em>pip install &lt;package name&gt;</em></li>



<li><em>conda install &lt;package name&gt;</em>&nbsp;(if you are using the anaconda packet manager)</li>
</ul>



<h3 class="wp-block-heading" id="h-step-1-loading-the-customer-churn-data">Step #1 Loading the Customer Churn Data</h3>



<p class="wp-block-paragraph">We begin by loading a customer churn <a href="https://www.kaggle.com/barun2104/telecom-churn" target="_blank" rel="noreferrer noopener">dataset from Kaggle</a>. If you work with the Kaggle Python environment, you can directly save the dataset into your Kaggle project. After completing the download, put the dataset under the file path of your choice, but don&#8217;t forget to adjust the file path variable in the code. </p>



<p class="wp-block-paragraph">The dataset contains 3333 records and the following attributes.</p>



<ul class="wp-block-list">
<li><strong>Churn</strong>: The prediction label: 1 if the customer canceled service, 0 if not.</li>



<li><strong>AccountWeeks</strong>: number of weeks the customer has had an active account</li>



<li><strong>ContractRenewal</strong>: 1 if customer recently renewed contract, 0 if not</li>



<li><strong>DataPlan</strong>: 1 if the customer has a data plan, 0 if not</li>



<li><strong>DataUsage</strong>: gigabytes of monthly data usage</li>



<li><strong>CustServCalls</strong>: number of calls into customer service</li>



<li><strong>DayMins</strong>: average daytime minutes per month</li>



<li><strong>DayCalls</strong>: average number of daytime calls</li>



<li><strong>MonthlyCharge</strong>: average monthly bill</li>



<li><strong>OverageFee</strong>: The most considerable overage fee in the last 12 months</li>
</ul>



<p class="wp-block-paragraph">The following code will load the data from your local folder into your anaconda Python project:</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}">import numpy as np 
import pandas as pd 
import math
from pandas.plotting import register_matplotlib_converters
import matplotlib.pyplot as plt 
import matplotlib.colors as mcolors
import matplotlib.dates as mdates 

from sklearn.metrics import confusion_matrix, classification_report
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.inspection import permutation_importance
import seaborn as sns


# set file path
filepath = &quot;data/Churn-prediction/&quot;

# Load train and test datasets
train_df = pd.read_csv(filepath + 'telecom_churn.csv')
train_df.head()</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">	Churn	AccountWeeks	ContractRenewal	DataPlan	DataUsage	CustServCalls	DayMins	DayCalls	MonthlyCharge	OverageFee	RoamMins
0	0		128				1				1			2.7			1				265.1		110		89.0			9.87		10.0
1	0		107				1				1			3.7			1				161.6		123		82.0			9.78		13.7
2	0		137				1				0			0.0			0				243.4		114		52.0			6.06		12.2
3	0		84				0				0			0.0			2				299.4		71		57.0			3.10		6.6
4	0		75				0				0			0.0			3				166.7		113		41.0			7.42		10.1</pre></div>



<h3 class="wp-block-heading" id="h-step-2-exploring-the-data">Step #2 Exploring the Data</h3>



<p class="wp-block-paragraph">Before we begin with the preprocessing, we will quickly explore the data. For this purpose, we will create histograms for the different attributes in our data.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># # Create histograms for feature columns separated by prediction label value
df_plot = train_df.copy()

# class_columnname = 'Churn'
sns.pairplot(df_plot, hue=&quot;Churn&quot;, height=2.5, palette='muted')</pre></div>



<figure class="wp-block-image size-large"><img decoding="async" width="1024" height="990" data-attachment-id="6808" data-permalink="https://www.relataly.com/predicting-the-customer-churn-of-a-telecommunications-provider/2378/pairplots-churn-prediction/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/04/pairplots-churn-prediction.png" data-orig-size="1828,1768" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="pairplots-churn-prediction" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/04/pairplots-churn-prediction.png" src="https://www.relataly.com/wp-content/uploads/2022/04/pairplots-churn-prediction-1024x990.png" alt="" class="wp-image-6808" srcset="https://www.relataly.com/wp-content/uploads/2022/04/pairplots-churn-prediction.png 1024w, https://www.relataly.com/wp-content/uploads/2022/04/pairplots-churn-prediction.png 300w, https://www.relataly.com/wp-content/uploads/2022/04/pairplots-churn-prediction.png 768w, https://www.relataly.com/wp-content/uploads/2022/04/pairplots-churn-prediction.png 1536w, https://www.relataly.com/wp-content/uploads/2022/04/pairplots-churn-prediction.png 1828w" sizes="(max-width: 1024px) 100vw, 1024px" /><figcaption class="wp-element-caption">Histograms of the churn prediction dataset separated by prediction label (red=churn, blue= no churn)</figcaption></figure>



<p class="wp-block-paragraph">We can see that the data distribution for several attributes looks quite good and resembles a normal distribution, for example, for OverageFeed, DayMins, and DayCalls. However, the distribution for the prediction label is unbalanced. This is because more customers remain with their contract (prediction label class = 0) than those that cancel their contract (prediction label class = 1). </p>



<h3 class="wp-block-heading" id="h-step-3-data-preprocessing">Step #3 Data Preprocessing</h3>



<p class="wp-block-paragraph">The next step is to preprocess the data. I have reduced this part to a minimum to keep this tutorial simple. For example, I do not treat the unbalanced label classes. However, this would be appropriate to improve the model performance in a real business context. The imbalanced data is also why I chose a decision forest as a model type. Decision forests can handle unbalanced data relatively well compared to traditional models such as logistic regression. </p>



<p class="wp-block-paragraph">The following code splits the data into the train (x_train) and test data (x_test) and creates the respective datasets, which only contain the label class (y_train, y_test). The ratio is 0.7, resulting in 2333 records in the training dataset and 1000 in the test dataset.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Create Training Dataset
x_df = train_df[train_df.columns[train_df.columns.isin(['AccountWeeks', 'ContractRenewal', 'DataPlan','DataUsage', 'CustServCalls', 'DayCalls', 'MonthlyCharge', 'OverageFee', 'RoamMins'])]].copy()
y_df = train_df['Churn'].copy()

# Split the data into x_train and y_train data sets
x_train, x_test, y_train, y_test = train_test_split(x_df, y_df, train_size=0.7, random_state=0)
x_train</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">		AccountWeeks	ContractRenewal	DataPlan	DataUsage	CustServCalls	DayCalls	MonthlyCharge	OverageFee	RoamMins
2918	58				1				0			0.00		4				112			53.0			13.29		0.0
1884	51				0				1			3.32		2				60			74.2			10.03		12.3
2823	87				1				0			0.00		2				80			50.0			9.35		16.6
2319	83				1				1			2.35		3				105			91.5			12.65		8.7
2980	84				1				0			0.00		3				86			62.0			13.78		14.3
...		...				...				...			...			...				...			...				...			...
835	27	1				0				0.00		1			75				31.0		10.43			9.9
3264	89				1				1			1.59		0				98			50.9			10.36		5.9
1653	93				0				0			0.00		1				78			42.0			10.99		11.1
2607	91				1				0			0.00		3				100			53.0			11.97		9.9
2732	130				0				0			0.00		5				106			68.0			18.19		16.9</pre></div>



<h3 class="wp-block-heading" id="h-step-4-fit-an-optimized-decision-forest-model-for-churn-prediction-using-grid-search">Step #4 Fit an Optimized Decision Forest Model for Churn Prediction using Grid Search</h3>



<p class="wp-block-paragraph">Now comes the exciting part. We will train a series of 36 decision forests and then choose the best-performing model. The technique used in this process is called hyperparameter tuning (more specifically, grid search), and I have recently published <a aria-label="undefined (opens in a new tab)" href="https://www.relataly.com/hyperparameter-tuning-with-grid-search/2261/" target="_blank" rel="noreferrer noopener">a separate article on this topic</a>.</p>



<p class="wp-block-paragraph">The following code defines the parameters the grid search will test (max_depth, n_estimators, and min_samples_split). Then the code runs the grid search and trains the decision forests. Finally, we print out the model ranking along with model parameters. </p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Define parameters
max_depth=[2, 4, 8, 16]
n_estimators = [64, 128, 256]
min_samples_split = [5, 20, 30]

param_grid = dict(max_depth=max_depth, n_estimators=n_estimators, min_samples_split=min_samples_split)

# Build the gridsearch
dfrst = RandomForestClassifier(n_estimators=n_estimators, max_depth=max_depth, min_samples_split=min_samples_split, class_weight='balanced')
grid = GridSearchCV(estimator=dfrst, param_grid=param_grid, cv = 5)
grid_results = grid.fit(x_train, y_train)

# Summarize the results in a readable format
results_df = pd.DataFrame(grid_results.cv_results_)
results_df.sort_values(by=['rank_test_score'], ascending=True, inplace=True)

# Reduce the results to selected columns
results_filtered = results_df[results_df.columns[results_df.columns.isin(['param_max_depth', 'param_min_samples_split', 'param_n_estimators','std_fit_time', 'rank_test_score', 'std_test_score', 'mean_test_score'])]].copy()
results_filtered</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">std_fit_time	param_max_depth	param_min_samples_split	param_n_estimators	mean_test_score	std_test_score	rank_test_score
28				0.004742		16						5					128	0.931415	0.006950		1
27				0.002620		16						5					64	0.925848	0.008177		2
29				0.015711		16						5					256	0.925846	0.006156		3
20				0.006258		8						5					256	0.923704	0.007961		4
19				0.001816		8						5					128	0.921988	0.006458		5
18				0.002161		8						5					64	0.919847	0.007716		6
31				0.003728		16						20					128	0.902690	0.011642		7
30				0.002057		16						20					64	0.901836	0.009789		8
32				0.004940		16						20					256	0.899691	0.009813		9
21				0.001994		8						20					64	0.898408	0.008710		10
22				0.003761		8						20					128	0.897121	0.007529		11
23				0.003828		8						20					256	0.895833	0.009159		12
33				0.003798		16						30					64	0.885546	0.010394		13
26				0.005560		8						30					256	0.885541	0.014937		14
...</pre></div>



<p class="wp-block-paragraph">The best-performing model is model number 29, which scores 92,7 %. Its hyperparameters are as follows:</p>



<ul class="wp-block-list">
<li>max_depth = 16</li>



<li>min_samples_split = 5</li>



<li>n_estimators 256</li>
</ul>



<p class="wp-block-paragraph">We will proceed with this model. So what does this model tell us?</p>



<p class="wp-block-paragraph">We can gain an overview of the distributions of our customers according to their churn probability. Just use the following code:</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Predicting Probabilities
y_pred_prob = best_clf.predict_proba(x_test) 
churnproba = y_pred_prob[:,1]

# Create histograms for feature columns separated by prediction label value
sns.histplot(data=churnproba)</pre></div>



<figure class="wp-block-image size-full"><img decoding="async" width="517" height="324" data-attachment-id="6810" data-permalink="https://www.relataly.com/predicting-the-customer-churn-of-a-telecommunications-provider/2378/image-12-12/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/04/image-12.png" data-orig-size="517,324" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-12" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/04/image-12.png" src="https://www.relataly.com/wp-content/uploads/2022/04/image-12.png" alt="Customer Base According to their Churn Rate" class="wp-image-6810" srcset="https://www.relataly.com/wp-content/uploads/2022/04/image-12.png 517w, https://www.relataly.com/wp-content/uploads/2022/04/image-12.png 300w" sizes="(max-width: 517px) 100vw, 517px" /><figcaption class="wp-element-caption">Customer Base According to their Churn Rate</figcaption></figure>



<p class="wp-block-paragraph">Customers who tend to churn have a churn probability greater than 0.5. They are further to the right in the diagram. So, we don&#8217;t have to worry about the customers on the far left (&lt;0.5).</p>



<h3 class="wp-block-heading" id="h-step-5-best-model-performance-insights">Step #5 Best Model Performance Insights</h3>



<p class="wp-block-paragraph">Let&#8217;s take a more detailed look at the performance of the best model. We do this by calculating the confusion matrix. </p>



<p class="wp-block-paragraph">If you want to learn more about measuring the performance of classification models, check out<a href="https://www.relataly.com/measuring-classification-performance-with-python-and-scikit-learn/846/" target="_blank" rel="noreferrer noopener"> this tutorial</a>.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Extract the best decision forest 
best_clf = grid_results.best_estimator_
y_pred = best_clf.predict(x_test)

# Create a confusion matrix
cnf_matrix = confusion_matrix(y_test, y_pred)

# Create heatmap from the confusion matrix
class_names=[False, True] 
tick_marks = [0.5, 1.5]
fig, ax = plt.subplots(figsize=(7, 6))
sns.heatmap(pd.DataFrame(cnf_matrix), annot=True, cmap=&quot;Blues&quot;, fmt='g')
ax.xaxis.set_label_position(&quot;top&quot;)
plt.tight_layout()
plt.title('Confusion matrix')
plt.ylabel('Actual label'); plt.xlabel('Predicted label')
plt.yticks(tick_marks, class_names); plt.xticks(tick_marks, class_names)</pre></div>



<figure class="wp-block-image size-large"><img decoding="async" width="486" height="452" data-attachment-id="2387" data-permalink="https://www.relataly.com/predicting-the-customer-churn-of-a-telecommunications-provider/2378/image-14-4/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2020/07/image-14.png" data-orig-size="486,452" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-14" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2020/07/image-14.png" src="https://www.relataly.com/wp-content/uploads/2020/07/image-14.png" alt="Confusion matrix on churn probabilities calculated with feature permutation importance" class="wp-image-2387" srcset="https://www.relataly.com/wp-content/uploads/2020/07/image-14.png 486w, https://www.relataly.com/wp-content/uploads/2020/07/image-14.png 300w" sizes="(max-width: 486px) 100vw, 486px" /></figure>



<p class="wp-block-paragraph">From 1000 customers in the test dataset, our model correctly classified 100 customers as churn candidates. For 832 customers, the model accurately predicted that these customers are unlikely to churn. In 30 cases, the model falsely classified customers as churn candidates, and 38 were missed and falsely classified as non-churn candidates. The result is a model accuracy of 93,2 % (based on a 0.5 threshold). </p>



<h3 class="wp-block-heading" id="h-step-6-permutation-feature-importance">Step #6 Permutation Feature Importance</h3>



<p class="wp-block-paragraph">Now that we have trained a model that gives good results, we want to understand the importance of the model&#8217;s features. With the following code, we calculate the Feature Importance score. Then we visualize the results in a barplot.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Load the data
r = permutation_importance(best_clf, x_test, y_test, n_repeats=30, random_state=0)

# Set the color range
clist = [(0, &quot;purple&quot;), (1, &quot;blue&quot;)]
rvb = mcolors.LinearSegmentedColormap.from_list(&quot;&quot;, clist)
colors = rvb(data_im['feature_permuation_score']/len(x_test.columns))

# Plot the barchart
data_im = pd.DataFrame(r.importances_mean, columns=['feature_permuation_score'])
data_im['feature_names'] = x_test.columns
data_im = data_im.sort_values('feature_permuation_score', ascending=False)

fig, ax = plt.subplots(figsize=(16, 5))
sns.barplot(y=data_im['feature_names'], x=&quot;feature_permuation_score&quot;, data=data_im, palette='nipy_spectral')
ax.set_title(&quot;Random Forest Feature Importances&quot;)</pre></div>



<figure class="wp-block-image size-full"><img decoding="async" width="1013" height="334" data-attachment-id="6801" data-permalink="https://www.relataly.com/predicting-the-customer-churn-of-a-telecommunications-provider/2378/output-2-2/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/04/output-2.png" data-orig-size="1013,334" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="output-2" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/04/output-2.png" src="https://www.relataly.com/wp-content/uploads/2022/04/output-2.png" alt="" class="wp-image-6801" srcset="https://www.relataly.com/wp-content/uploads/2022/04/output-2.png 1013w, https://www.relataly.com/wp-content/uploads/2022/04/output-2.png 300w, https://www.relataly.com/wp-content/uploads/2022/04/output-2.png 768w" sizes="(max-width: 1013px) 100vw, 1013px" /></figure>



<p class="wp-block-paragraph">The feature ranking can provide the starting point for deeper analysis. As we can see, the most important features are the monthly fee, data usage, and customer service calls (CustServCalls). Of particular interest is the importance of customer service calls, as this could indicate that customers who encounter customer service have negative experiences.</p>



<h2 class="wp-block-heading" id="h-summary">Summary</h2>



<p class="wp-block-paragraph">This article has shown how to implement a churn prediction model using Python and scikit-learn Machine Learning. We have calculated the permutation feature importance to analyze which features contribute to the performance of our model. You have learned that permutation feature importance can provide data scientists with new insights into the context of a prediction model. Therefore, the technique is often a good starting point for forthleading investigations. </p>



<p class="wp-block-paragraph">I am always interested in improving my articles and learning from my audience. If you liked this article, show your appreciation by leaving a comment. And if you didn&#8217;t, let me know too. Cheers </p>



<h2 class="wp-block-heading">Sources and Further Reading</h2>



<ol class="wp-block-list">
<li><a href="https://amzn.to/3MAy8j5" target="_blank" rel="noreferrer noopener">Andriy Burkov (2020) Machine Learning Engineering</a></li>



<li><a href="https://amzn.to/3D0gB0e" target="_blank" rel="noreferrer noopener">Oliver Theobald (2020) Machine Learning For Absolute Beginners: A Plain English Introduction</a></li>



<li><a href="https://amzn.to/3S9Nfkl" target="_blank" rel="noreferrer noopener">Aurélien Géron (2019) Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems </a></li>



<li><a href="https://amzn.to/3EKidwE" target="_blank" rel="noreferrer noopener">David Forsyth (2019) Applied Machine Learning Springer</a></li>
</ol>



<p class="has-contrast-2-color has-base-3-background-color has-text-color has-background wp-block-paragraph"><em>The links above to Amazon are affiliate links. By buying through these links, you support the Relataly.com blog and help to cover the hosting costs. Using the links does not affect the price.</em></p>



<p class="wp-block-paragraph">And if you are interested in text mining and customer satisfaction, consider taking a look at my recent blog about sentiment analysis:</p>



<figure class="wp-block-embed is-type-wp-embed is-provider-relataly-com wp-block-embed-relataly-com"><div class="wp-block-embed__wrapper">
<blockquote class="wp-embedded-content" data-secret="HQ0lUMzbZR"><a href="https://www.relataly.com/simple-sentiment-analysis-using-naive-bayes-and-logistic-regression/2007/">Sentiment Analysis with Naive Bayes and Logistic Regression in Python</a></blockquote><iframe loading="lazy" class="wp-embedded-content" sandbox="allow-scripts" security="restricted"  title="&#8220;Sentiment Analysis with Naive Bayes and Logistic Regression in Python&#8221; &#8212; relataly.com" src="https://www.relataly.com/simple-sentiment-analysis-using-naive-bayes-and-logistic-regression/2007/embed/#?secret=WMWtohaT3c#?secret=HQ0lUMzbZR" data-secret="HQ0lUMzbZR" width="600" height="338" frameborder="0" marginwidth="0" marginheight="0" scrolling="no"></iframe>
</div></figure>
<p>The post <a href="https://www.relataly.com/predicting-the-customer-churn-of-a-telecommunications-provider/2378/">Customer Churn Prediction &#8211; Understanding Models with Feature Permutation Importance using Python</a> appeared first on <a href="https://www.relataly.com">relataly.com</a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://www.relataly.com/predicting-the-customer-churn-of-a-telecommunications-provider/2378/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">2378</post-id>	</item>
		<item>
		<title>Tuning Model Hyperparameters with Grid Search at the Example of Training a Random Forest Classifier in Python</title>
		<link>https://www.relataly.com/hyperparameter-tuning-with-grid-search/2261/</link>
					<comments>https://www.relataly.com/hyperparameter-tuning-with-grid-search/2261/#comments</comments>
		
		<dc:creator><![CDATA[Florian Follonier]]></dc:creator>
		<pubDate>Mon, 06 Jul 2020 21:16:52 +0000</pubDate>
				<category><![CDATA[Classification (two-class)]]></category>
		<category><![CDATA[Hyperparameter Tuning]]></category>
		<category><![CDATA[Insurance]]></category>
		<category><![CDATA[Logistics]]></category>
		<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[Random Decision Forests]]></category>
		<category><![CDATA[Risk Management]]></category>
		<category><![CDATA[Scikit-Learn]]></category>
		<category><![CDATA[Seaborn]]></category>
		<category><![CDATA[Intermediate Tutorials]]></category>
		<category><![CDATA[Random Forest Classification]]></category>
		<category><![CDATA[Supervised Learning]]></category>
		<category><![CDATA[Titanic Dataset]]></category>
		<category><![CDATA[Tuning Random Decision Forests]]></category>
		<guid isPermaLink="false">https://www.relataly.com/?p=2261</guid>

					<description><![CDATA[<p>Are you struggling to find the best hyperparameters for your machine learning model? With Python&#8217;s Scikit-learn library, you can use grid search to fine-tune your model and improve its performance. In this article, we&#8217;ll guide you through the process of hyperparameter tuning for a classification model, using a random decision forest that predicts the survival ... <a title="Tuning Model Hyperparameters with Grid Search at the Example of Training a Random Forest Classifier in Python" class="read-more" href="https://www.relataly.com/hyperparameter-tuning-with-grid-search/2261/" aria-label="Read more about Tuning Model Hyperparameters with Grid Search at the Example of Training a Random Forest Classifier in Python">Read more</a></p>
<p>The post <a href="https://www.relataly.com/hyperparameter-tuning-with-grid-search/2261/">Tuning Model Hyperparameters with Grid Search at the Example of Training a Random Forest Classifier in Python</a> appeared first on <a href="https://www.relataly.com">relataly.com</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">Are you struggling to find the best hyperparameters for your machine learning model? With Python&#8217;s Scikit-learn library, you can use grid search to fine-tune your model and improve its performance. In this article, we&#8217;ll guide you through the process of hyperparameter tuning for a classification model, using a random decision forest that predicts the survival of Titanic passengers as an example.</p>



<p class="wp-block-paragraph">We&#8217;ll start by explaining the concept of grid search and how it works. Then, we&#8217;ll dive into the development and optimization of the random decision forest using Python. By defining a parameter grid and feeding it to the grid search algorithm, we can explore all possible hyperparameter combinations and find the optimal configuration for our model.</p>



<p class="wp-block-paragraph">Finally, we&#8217;ll compare the performance of different model configurations to determine the best one for our classification task. Whether you&#8217;re new to machine learning or looking to boost the performance of an existing model, this step-by-step guide to hyperparameter tuning with grid search will help you achieve better results. Let&#8217;s get started!</p>



<p class="wp-block-paragraph">Also: <a href="https://www.relataly.com/multivariate-outlier-detection-using-isolation-forests-in-python-detecting-credit-card-fraud/4233/" target="_blank" rel="noreferrer noopener">Multivariate Anomaly Detection on Time-Series Data in Python</a></p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%">
<figure class="wp-block-image size-large is-resized"><img decoding="async" data-attachment-id="2353" data-permalink="https://www.relataly.com/hyperparameter-tuning-with-grid-search/2261/image-7-5/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2020/07/image-7.png" data-orig-size="837,539" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-7" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2020/07/image-7.png" src="https://www.relataly.com/wp-content/uploads/2020/07/image-7.png" alt="Grid Search - parameter grid for hyperparameter tuning" class="wp-image-2353" width="386" height="248" srcset="https://www.relataly.com/wp-content/uploads/2020/07/image-7.png 837w, https://www.relataly.com/wp-content/uploads/2020/07/image-7.png 300w, https://www.relataly.com/wp-content/uploads/2020/07/image-7.png 768w, https://www.relataly.com/wp-content/uploads/2020/07/image-7.png 80w" sizes="(max-width: 386px) 100vw, 386px" /><figcaption class="wp-element-caption">Exemplary parameter grid for the tuning of a random decision forest with four hyperparameters</figcaption></figure>
</div>
</div>



<h2 class="wp-block-heading">What are Hyperparameters?</h2>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">Hyperparameters play a crucial role in the performance of a machine learning model. They are adjustable parameters that influence the model training process and control how a machine learning algorithm learns and how it behaves. </p>



<p class="wp-block-paragraph">Unlike the internal parameters (coefficients, etc.) that the algorithm automatically optimizes during model training, hyperparameters are model characteristics (e.g., the number of estimators for an ensemble model) that we must set in advance. </p>



<p class="wp-block-paragraph">Which hyperparameters are available, depends on the algorithm.  For example, a random decision forest model may have hyperparameters such as the number of trees and tree depth, while a neural network model may have hyperparameters such as the number of hidden layers and nodes in each layer. Finding the optimal configuration of hyperparameters can be a challenging task, as there is often no way to know in advance what the ideal values should be. </p>



<p class="wp-block-paragraph">This requires experimentation with different hyperparameter settings, which can be time-consuming if done manually. Grid search is a useful tool for automating this process and efficiently finding the best hyperparameter configuration for a given model.</p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%">
<figure class="wp-block-image size-full is-resized"><img decoding="async" data-attachment-id="9871" data-permalink="https://www.relataly.com/hyperparameter-tuning-with-grid-search/2261/picture2-min/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/10/Picture2-min.png" data-orig-size="696,472" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="Picture2-min" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/10/Picture2-min.png" src="https://www.relataly.com/wp-content/uploads/2022/10/Picture2-min.png" alt="" class="wp-image-9871" width="385" height="260" srcset="https://www.relataly.com/wp-content/uploads/2022/10/Picture2-min.png 696w, https://www.relataly.com/wp-content/uploads/2022/10/Picture2-min.png 300w" sizes="(max-width: 385px) 100vw, 385px" /><figcaption class="wp-element-caption">Hyperparameters are the little screws that we can adjust to tune a predictive model.</figcaption></figure>
</div>
</div>



<h2 class="wp-block-heading" id="h-efficient-hyperparameter-tuning-with-exhaustive-grid-search">Efficient Hyperparameter Tuning with Exhaustive&nbsp;Grid Search</h2>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">When we train a machine learning model, it is usually unclear which hyperparameters lead to good results. While there are estimates and rules of thumb, there is often no way to avoid trying out hyperparameters in experiments. However, machine learning models often have several hyperparameters that affect the model&#8217;s performance in a nonlinear way.</p>



<p class="wp-block-paragraph">We can use grid search to automate searching for optimal model hyperparameters. The search grid algorithm exhaustively generates models from parameter permutations of a grid of parameter values. Let&#8217;s take a look at how this works.</p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%"></div>
</div>



<h2 class="wp-block-heading">Hyperparameter Tuning with Grid Search: How it Works</h2>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">The idea behind the grid search technique is quite simple. We have a model with parameters, and the challenge is to test various configurations until we are satisfied with the result. Grid search is exhaustive in that it tests all permutations of a parameter grid. The number of model variants results from the parameter grid and the specified parameters.</p>



<p class="wp-block-paragraph">The grid search algorithm requires us to provide the following information:</p>



<ul class="wp-block-list">
<li>The hyperparameters that we want to configure (e.g., tree depth)</li>



<li>For each hyperparameter, a range of values (e.g., [50, 100, 150])</li>



<li>A performance metric so that the algorithm knows how to measure performance (e.g., accuracy for a classification model)</li>
</ul>



<p class="wp-block-paragraph">For example, imagine we have a range of [16, 32, and 64] for n_estimators and a range of [8, 16, and 32] for max_depth. Then, the search grid will test 9 different parameter configurations.</p>



<h3 class="wp-block-heading">Early Stopping</h3>



<p class="wp-block-paragraph">Running parameter optimization against an entire grid can be time-consuming, but there are ways to shorten the process. Depending on how much time you want to invest in the search process, you can test all combinations exhaustively or shorten the process with an early stopping logic. A stopping logic defines that the search ends early when a specific criterion is met. Such a criterion could be, for example, that newly trained models underperform the average performance of previously trained models by a certain value. In this case, the search stops and returns the best models found up to that point. When you define a large search grid with many parameters, defining an early stopping logic is recommended. </p>



<h3 class="wp-block-heading" id="h-strengths-and-weaknesses-of-grid-search">Strengths and Weaknesses of Grid Search</h3>



<p class="wp-block-paragraph">The advantage of the grid search is that the algorithm automatically identifies the optimal parameter configuration from the parameter grid. However, the number of possible configurations increases exponentially with the number of values in the parameter grid. So, in practice, defining a sparse parameter grid or defining stopping criteria is essential.</p>



<p class="wp-block-paragraph">Grid Search is only one of several techniques that can be used to tune the hyperparameters of a predictive model. Alternative techniques include  <a href="https://www.relataly.com/using-random-search-to-tune-the-hyperparameters-of-a-random-decision-forest-with-python/6875/" target="_blank" rel="noreferrer noopener">Random Search</a>. In contrast to Grid Search, Random Search is a none exhaustive hyperparameter-tuning technique, which randomly selects and tests specific configurations from a predefined search space. Further optimization techniques are Bayesian Search and Gradient Descent.</p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%"><div class="wp-block-image">
<figure class="aligncenter size-large is-resized"><img decoding="async" data-attachment-id="2354" data-permalink="https://www.relataly.com/hyperparameter-tuning-with-grid-search/2261/image-8-5/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2020/07/image-8.png" data-orig-size="384,196" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-8" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2020/07/image-8.png" src="https://www.relataly.com/wp-content/uploads/2020/07/image-8.png" alt="Grid Search - A search grid with two hyperparameters and three hyperparameter values" class="wp-image-2354" width="397" height="201"/><figcaption class="wp-element-caption">A parameter grid with two hyperparameters and respectively three hyperparameter values</figcaption></figure>
</div></div>
</div>



<h2 class="wp-block-heading">Evaluation Metrics</h2>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">The question of which metric to optimize against inevitably arises when we talk about optimization. Generally, all common metrics available for classification or regression come into question.</p>



<p class="wp-block-paragraph">Metrics for regression (<a href="https://www.relataly.com/regression-error-metrics-python/923/" target="_blank" rel="noreferrer noopener">more detailed description</a>)</p>



<ul class="wp-block-list">
<li>Mean Absolute Error (MAE) </li>



<li>Root Mean Squared Absolute Error (RMSAE) </li>



<li>Relative Squared Error (RSE).</li>
</ul>



<p class="wp-block-paragraph">Metrics for classification (<a href="https://www.relataly.com/measuring-classification-performance-with-python-and-scikit-learn/846/" target="_blank" rel="noreferrer noopener">more detailed description</a>)</p>



<ul class="wp-block-list">
<li>Accuracy</li>



<li>Precision</li>



<li>F-1 Score</li>



<li>Recall</li>
</ul>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%"></div>
</div>



<h2 class="wp-block-heading">Tuning the Hyperparameters of a Random Decision Forest Classifier in Python using Grid Search</h2>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow">
<p class="wp-block-paragraph">Now that we have familiarized ourselves with the basic concept of hyperparameter tuning, let&#8217;s move on to the Python hands-on part! In this part, we will work with the Titanic dataset. We will apply the grid search optimization technique to a classification model. We will develop our Machine Learning model based on the Titanic dataset.</p>



<p class="wp-block-paragraph">The sinking of the Titanic was one of the most catastrophic ship disasters, leading to more than 1500 casualties (The exact number is unknown due to several passengers being unregistered). The Titanic dataset contains a list of passengers with passenger information such as age, gender, cabin, ticket cost, etc., and whether they survived the Titanic sinking. The information about the passengers shows certain patterns that allow conclusions about the likelihood of the passengers surviving the accident. These data can be used to train a predictive model.</p>



<p class="wp-block-paragraph">In the following, we will use the survival flag as a label and passenger information as input for a classification model. The goal is to predict whether a passenger will survive the Titanic sinking or not. The algorithm will be a random decision forest algorithm that classifies the passengers into two groups, survivors and non-survivors. Once we have trained a baseline model, we will apply grid search to optimize the hyperparameters of this model and select the best model.</p>
</div>
</div>



<p class="wp-block-paragraph">The code is available on the GitHub repository.</p>



<div class="wp-block-group"><div class="wp-block-group__inner-container is-layout-flow wp-block-group-is-layout-flow">
<div class="wp-block-kadence-advancedbtn kb-buttons-wrap kb-btns_9836db-d0"><a class="kb-button kt-button button kb-btn_12cb2e-6a kt-btn-size-standard kt-btn-width-type-full kb-btn-global-inherit kt-btn-has-text-true kt-btn-has-svg-true wp-block-button__link wp-block-kadence-singlebtn" href="https://github.com/flo7up/relataly-public-python-tutorials/blob/master/11%20Hyperparamter%20Tuning/016%20Hyperparameter%20Tuning%20of%20Random%20Decision%20Forests%20using%20Grid%20Search.ipynb" target="_blank" rel="noreferrer noopener"><span class="kb-svg-icon-wrap kb-svg-icon-fe_eye kt-btn-icon-side-left"><svg viewBox="0 0 24 24"  fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"  aria-hidden="true"><path d="M1 12s4-8 11-8 11 8 11 8-4 8-11 8-11-8-11-8z"/><circle cx="12" cy="12" r="3"/></svg></span><span class="kt-btn-inner-text">View on GitHub </span></a>

<a class="kb-button kt-button button kb-btn_f9b732-8b kt-btn-size-standard kt-btn-width-type-full kb-btn-global-inherit kt-btn-has-text-true kt-btn-has-svg-true wp-block-button__link wp-block-kadence-singlebtn" href="https://github.com/flo7up/relataly-public-python-API-tutorials" target="_blank" rel="noreferrer noopener"><span class="kb-svg-icon-wrap kb-svg-icon-fa_github kt-btn-icon-side-left"><svg viewBox="0 0 496 512"  fill="currentColor" xmlns="http://www.w3.org/2000/svg"  aria-hidden="true"><path d="M165.9 397.4c0 2-2.3 3.6-5.2 3.6-3.3.3-5.6-1.3-5.6-3.6 0-2 2.3-3.6 5.2-3.6 3-.3 5.6 1.3 5.6 3.6zm-31.1-4.5c-.7 2 1.3 4.3 4.3 4.9 2.6 1 5.6 0 6.2-2s-1.3-4.3-4.3-5.2c-2.6-.7-5.5.3-6.2 2.3zm44.2-1.7c-2.9.7-4.9 2.6-4.6 4.9.3 2 2.9 3.3 5.9 2.6 2.9-.7 4.9-2.6 4.6-4.6-.3-1.9-3-3.2-5.9-2.9zM244.8 8C106.1 8 0 113.3 0 252c0 110.9 69.8 205.8 169.5 239.2 12.8 2.3 17.3-5.6 17.3-12.1 0-6.2-.3-40.4-.3-61.4 0 0-70 15-84.7-29.8 0 0-11.4-29.1-27.8-36.6 0 0-22.9-15.7 1.6-15.4 0 0 24.9 2 38.6 25.8 21.9 38.6 58.6 27.5 72.9 20.9 2.3-16 8.8-27.1 16-33.7-55.9-6.2-112.3-14.3-112.3-110.5 0-27.5 7.6-41.3 23.6-58.9-2.6-6.5-11.1-33.3 2.6-67.9 20.9-6.5 69 27 69 27 20-5.6 41.5-8.5 62.8-8.5s42.8 2.9 62.8 8.5c0 0 48.1-33.6 69-27 13.7 34.7 5.2 61.4 2.6 67.9 16 17.7 25.8 31.5 25.8 58.9 0 96.5-58.9 104.2-114.8 110.5 9.2 7.9 17 22.9 17 46.4 0 33.7-.3 75.4-.3 83.6 0 6.5 4.6 14.4 17.3 12.1C428.2 457.8 496 362.9 496 252 496 113.3 383.5 8 244.8 8zM97.2 352.9c-1.3 1-1 3.3.7 5.2 1.6 1.6 3.9 2.3 5.2 1 1.3-1 1-3.3-.7-5.2-1.6-1.6-3.9-2.3-5.2-1zm-10.8-8.1c-.7 1.3.3 2.9 2.3 3.9 1.6 1 3.6.7 4.3-.7.7-1.3-.3-2.9-2.3-3.9-2-.6-3.6-.3-4.3.7zm32.4 35.6c-1.6 1.3-1 4.3 1.3 6.2 2.3 2.3 5.2 2.6 6.5 1 1.3-1.3.7-4.3-1.3-6.2-2.2-2.3-5.2-2.6-6.5-1zm-11.4-14.7c-1.6 1-1.6 3.6 0 5.9 1.6 2.3 4.3 3.3 5.6 2.3 1.6-1.3 1.6-3.9 0-6.2-1.4-2.3-4-3.3-5.6-2z"/></svg></span><span class="kt-btn-inner-text">Relataly GitHub Repo </span></a></div>
</div></div>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%"><div class="wp-block-image">
<figure class="aligncenter is-resized"><img decoding="async" src="https://cdn.britannica.com/s:800x1000/72/153172-050-EB2F2D95/Titanic.jpg" alt="Operated by the White Star Line, RMS Titanic was the largest and most luxurious ocean liner of her time." width="357" height="227"/><figcaption class="wp-element-caption">Source: <em>The National Archives/Heritage-Images/Imagestate</em></figcaption></figure>
</div></div>
</div>



<h3 class="wp-block-heading" id="h-prerequisites">Prerequisites</h3>



<p class="wp-block-paragraph">Before starting the coding part, make sure that you have set up your <a href="https://www.python.org/downloads/" target="_blank" rel="noreferrer noopener">Python 3</a> environment and required packages. If you don&#8217;t have a Python environment, follow&nbsp;<a href="https://www.relataly.com/anaconda-python-environment-machine-learning/1663/" target="_blank" rel="noreferrer noopener">this tutorial</a>&nbsp;to set up the&nbsp;<a href="https://www.anaconda.com/products/individual" target="_blank" rel="noreferrer noopener">Anaconda environment</a>.</p>



<p class="wp-block-paragraph">Also, make sure you install all required packages. In this tutorial, we will be working with the following standard packages:&nbsp;</p>



<ul class="wp-block-list">
<li><em><a href="https://pandas.pydata.org/" target="_blank" rel="noreferrer noopener">pandas</a></em></li>



<li><em><a href="https://numpy.org/" target="_blank" rel="noreferrer noopener">NumPy</a></em></li>



<li><a href="https://docs.python.org/3/library/math.html" target="_blank" rel="noreferrer noopener">math</a></li>



<li><em><a href="https://matplotlib.org/" target="_blank" rel="noreferrer noopener">matplotlib</a></em></li>
</ul>



<p class="wp-block-paragraph">In addition, we will be using the Python Machine Learning library <a href="https://scikit-learn.org/stable/" target="_blank" rel="noreferrer noopener">Scikit-learn</a> to implement the random forest and the grid search technique. </p>



<p class="wp-block-paragraph">You can install packages using console commands:</p>



<ul class="wp-block-list">
<li><em>pip install &lt;package name&gt;</em></li>



<li><em>conda install &lt;package name&gt;</em>&nbsp;(if you are using the anaconda packet manager)</li>
</ul>



<h3 class="wp-block-heading" id="h-about-the-titanic-dataset">About the Titanic Dataset</h3>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">In this article, we will be working with the popular titanic dataset for classification. The Titanic dataset is a well-known dataset that contains information about the passengers on the Titanic, a British passenger liner that sank in the North Atlantic Ocean in 1912 after colliding with an iceberg. The dataset includes variables such as the passenger&#8217;s name, age, fare, and class, as well as whether or not the passenger survived.</p>



<p class="wp-block-paragraph">The titanic dataset contains the following information on passengers of the titanic:</p>



<ul class="wp-block-list">
<li><strong>Survival</strong>: Survival 0 = No, 1 = Yes (Prediction Label)</li>



<li><strong>Pclass</strong>: Ticket class 1 = 1st, 2 = 2nd, 3 = 3rd</li>



<li><strong>Sex</strong>: Sex</li>



<li><strong>Age</strong>: Age in years</li>



<li><strong>SibSp</strong>: # of siblings/spouses aboard the Titanic</li>



<li><strong>Parch</strong>: # of parents/children aboard the Titanic</li>



<li><strong>Ticket</strong>: Ticket number</li>



<li><strong>Fare</strong>: Passenger fare</li>



<li><strong>Cabin</strong>: Cabin number</li>



<li><strong>Embarked</strong>: Port of Embarkation C = Cherbourg, Q = Queenstown, S = Southampton</li>
</ul>



<p class="wp-block-paragraph">The Survival column contains the prediction label, which states whether a passenger survived the sinking of the Titanic or not.</p>



<p class="wp-block-paragraph">You can download the titanic dataset from <a href="https://www.kaggle.com/c/titanic" target="_blank" rel="noreferrer noopener">the Kaggle website</a>. Once you have completed the download, you can place the dataset in the file path of your choice. Using the Kaggle Python environment, you can directly save the dataset into your Kaggle project.</p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%">
<figure class="wp-block-image size-full is-resized"><img decoding="async" data-attachment-id="7036" data-permalink="https://www.relataly.com/hyperparameter-tuning-with-grid-search/2261/picture28/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/04/Picture28.png" data-orig-size="693,720" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="Picture28" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/04/Picture28.png" src="https://www.relataly.com/wp-content/uploads/2022/04/Picture28.png" alt="We can assume that the cabin location of the passengers had an impact on their chance to survive the sinking. Developing a machine learning model for prediction of titanic passenger survival and optimizing its hyperparameters using grid search" class="wp-image-7036" width="377" height="391" srcset="https://www.relataly.com/wp-content/uploads/2022/04/Picture28.png 693w, https://www.relataly.com/wp-content/uploads/2022/04/Picture28.png 289w" sizes="(max-width: 377px) 100vw, 377px" /><figcaption class="wp-element-caption">We can assume that the cabin location of the passengers had an impact on their chance of surviving the sinking. </figcaption></figure>
</div>
</div>



<h2 class="wp-block-heading">Step #1 Load the Titanic Data</h2>



<p class="wp-block-paragraph">The following code will load the titanic data into our python project. If you have placed the data outside the path shown below, don&#8217;t forget to adjust the file path in the code.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}">import math 
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt 
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
import seaborn as sns
from pandas.plotting import register_matplotlib_converters

# set file path
filepath = &quot;data/titanic-grid-search/&quot;

# Load train and test datasets
titanic_train_df = pd.read_csv(filepath + 'titanic-train.csv')
titanic_test_df = pd.read_csv(filepath + 'titanic-test.csv')
titanic_train_df.head()</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">	PassengerId	Survived	Pclass	Name							Sex	Age	SibSp	Parch	Ticket				Fare	Cabin	Embarked
0	1			0			3		Braund, Mr. Owen Harris			male	22.0	1	0	A/5 21171			7.2500	NaN		S
1	2			1			1		Cumings, Mrs. John Bradley ...	female	38.0	1	0	PC 17599			71.2833	C85		C
2	3			1			3		Heikkinen, Miss. Laina			female	26.0	0	0	STON/O2. 3101282	7.9250	NaN		S
3	4			1			1		Futrelle, Mrs. Jacques ...		female	35.0	1	0	113803				53.1000	C123	S
4	5			0			3		Allen, Mr. William Henry		male	35.0	0	0	373450				8.0500	NaN		S</pre></div>



<h3 class="wp-block-heading" id="h-step-2-preprocessing-and-exploring-the-data">Step #2 Preprocessing and Exploring the Data</h3>



<p class="wp-block-paragraph">Before we can train a model, we preprocess the data: </p>



<ul class="wp-block-list">
<li>Firstly, we clean the missing values in the data and replace them with the mean. </li>



<li>Second, we transform categorical features (<em>Embarked </em>and <em>Sex</em>) into numeric values. In addition, we will delete some columns to reduce model complexity. </li>



<li>Finally, we delete the prediction label from the training dataset and place it into a separate dataset named y_df.</li>
</ul>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Define a function for preprocessing the train and test data 
def preprocess(df):
    
    # Delete some columns that we will not use
    new_df = df[df.columns[~df.columns.isin(['Cabin', 'PassengerId', 'Name', 'Ticket'])]].copy()
    
    # Replace missing values
    for i in new_df.select_dtypes(include=['int16', 'int32', 'int64', 'float16', 'float32', 'float64']).columns:
        new_df[i].fillna(new_df[i].mean(), inplace=True)
    new_df['Embarked'].fillna('C', inplace=True)
    
    # Decode categorical values as integer values
    new_df_b = new_df.copy()
    new_df_b['Sex'] = np.where(new_df_b['Sex']=='male', 0, 1) 
    
    cleanups = {&quot;Sex&quot;:     {&quot;m&quot;: 0, &quot;f&quot;: 1},
                &quot;Embarked&quot;: {&quot;S&quot;: 1, &quot;Q&quot;: 2, &quot;C&quot;: 3}}
    new_df_b.replace(cleanups, inplace=True)
    x = new_df_b.drop(columns=['Survived'])
    y = new_df_b['Survived']  
    
    return x, y

# Create the training dataset train_df and the label dataset
x_df, y_df = preprocess(train_df)
x_df.head()</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">		Pclass	Sex	Age		SibSp	Parch	Fare	Embarked
0		3		0	22.0	1		0		7.2500	1
1		1		1	38.0	1		0		71.2833	3
2		3		1	26.0	0		0		7.9250	1
3		1		1	35.0	1		0		53.1000	1
4		3		0	35.0	0		0		8.0500	1</pre></div>



<p class="wp-block-paragraph">Let&#8217;s take a quick look at the data by creating paired plots for the columns of our data set. Pair plots help us to understand the relationships between pairs of variables in a dataset. </p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># # Create histograms for feature columns separated by prediction label value
df_plot = titanic_train_df.copy()

# class_columnname = 'Churn'
sns.pairplot(df_plot, hue=&quot;Survived&quot;, height=2.5, palette='muted')</pre></div>



<figure class="wp-block-image size-large is-resized"><img decoding="async" data-attachment-id="6803" data-permalink="https://www.relataly.com/hyperparameter-tuning-with-grid-search/2261/hyperparameter-tuning-random-decision-forests/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/04/hyperparameter-tuning-random-decision-forests.png" data-orig-size="1124,1062" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="hyperparameter-tuning-random-decision-forests" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/04/hyperparameter-tuning-random-decision-forests.png" src="https://www.relataly.com/wp-content/uploads/2022/04/hyperparameter-tuning-random-decision-forests-1024x968.png" alt="paired plot created with seaborn" class="wp-image-6803" width="768" height="726" srcset="https://www.relataly.com/wp-content/uploads/2022/04/hyperparameter-tuning-random-decision-forests.png 1024w, https://www.relataly.com/wp-content/uploads/2022/04/hyperparameter-tuning-random-decision-forests.png 300w, https://www.relataly.com/wp-content/uploads/2022/04/hyperparameter-tuning-random-decision-forests.png 768w, https://www.relataly.com/wp-content/uploads/2022/04/hyperparameter-tuning-random-decision-forests.png 1124w" sizes="(max-width: 768px) 100vw, 768px" /></figure>



<p class="wp-block-paragraph">The histograms tell us various things. For example, most passengers were between 25 and 35 years old. In addition, we can see that most passengers had low-fare tickets, while some passengers had significantly more expensive tickets. </p>



<h3 class="wp-block-heading" id="h-step-3-splitting-the-data">Step #3 Splitting the Data</h3>



<p class="wp-block-paragraph">Next, we will split the data set into training data (x_train, y_train) and test data (x_test, y_test) using a split ratio of 70/30.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Split the data into x_train and y_train data sets
x_train, x_test, y_train, y_test = train_test_split(x_df, y_df, train_size=0.7, random_state=0)</pre></div>



<h3 class="wp-block-heading" id="h-step-4-building-a-single-random-forest-model">Step #4 Building a Single Random Forest Model</h3>



<p class="wp-block-paragraph">After completing the preprocessing, we can train the first model. The model uses a random forest algorithm. The random forest algorithm has a large number of hyperparameters.</p>



<h4 class="wp-block-heading" id="h-4-1-about-the-random-forest-algorithm">4.1 About the Random Forest Algorithm</h4>



<p class="wp-block-paragraph">A random forest is a robust predictive algorithm that can handle classification and regression tasks. As a so-called ensemble model, the random forest considers predictions from a group of several independent estimators. </p>



<p class="wp-block-paragraph">Random decision forests have several hyperparameters that we can use to influence their behavior. However, not all of these hyperparameters have the same influence on model performance. Limiting the number of models by defining a sparse parameter grid is essential to reduce the amount of time needed to test the hyperparameters. </p>



<p class="wp-block-paragraph">Therefore, we restrict the hyperparameters optimized by the grid search approach to the following two:</p>



<ul class="wp-block-list">
<li><strong>n_estimators</strong> determine the number of decision trees in the forest</li>



<li><strong>max_depth</strong> defines the maximum number of branches in each decision tree</li>
</ul>



<p class="wp-block-paragraph">In the <a href="https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html?highlight=random%20forest#sklearn.ensemble.RandomForestClassifier" target="_blank" rel="noreferrer noopener">scikit-learn documentation</a>, you also find a full list of available hyperparameters. For the rest of these hyperparameters, we will use the default value defined by scikit-learn.</p>



<h4 class="wp-block-heading" id="h-4-2-implementing-a-random-forest-model">4.2 Implementing a Random Forest Model</h4>



<p class="wp-block-paragraph">We train a simple baseline model and make a test prediction with the x_test dataset. Then we visualize the performance of the baseline model in a confusion matrix:</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Train a single random forest classifier
clf = RandomForestClassifier(max_depth=2, random_state=0, n_estimators = 100)
clf.fit(x_train, y_train)
y_pred = clf.predict(x_test)

# Create a confusion matrix
cnf_matrix = confusion_matrix(y_test, y_pred)

# Create heatmap from the confusion matrix
%matplotlib inline
class_names=[False, True] # name  of classes
fig, ax = plt.subplots(figsize=(7, 6))
sns.heatmap(pd.DataFrame(cnf_matrix), annot=True, cmap=&quot;YlGnBu&quot;, fmt='g')
ax.xaxis.set_label_position(&quot;top&quot;)
plt.tight_layout()
plt.title('Confusion matrix')
plt.ylabel('Actual label')
plt.xlabel('Predicted label')
tick_marks = [0.5, 1.5]
plt.xticks(tick_marks, class_names)
plt.yticks(tick_marks, class_names)</pre></div>



<figure class="wp-block-image size-full"><img decoding="async" width="486" height="452" data-attachment-id="8471" data-permalink="https://www.relataly.com/hyperparameter-tuning-with-grid-search/2261/output-1-2/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/05/output-1.png" data-orig-size="486,452" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="output-1" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/05/output-1.png" src="https://www.relataly.com/wp-content/uploads/2022/05/output-1.png" alt="Confusion matrix of the best-guess random forest model before hyperparameter tuning" class="wp-image-8471" srcset="https://www.relataly.com/wp-content/uploads/2022/05/output-1.png 486w, https://www.relataly.com/wp-content/uploads/2022/05/output-1.png 300w" sizes="(max-width: 486px) 100vw, 486px" /><figcaption class="wp-element-caption">Confusion matrix of the best-guess random forest model</figcaption></figure>



<p class="wp-block-paragraph">Our best-guess model accurately predicted that 151 passengers would not survive. The dark-blue number in the top-left is the group of titanic passengers that did not survive the sinking, and our model classified them correctly as non-survivors. The green area below shows the passengers who survived the sinking and were correctly classified. The other sections show the number of times our model was wrong. </p>



<p class="wp-block-paragraph">In total, these results correspond to a model accuracy of 80%. Considering that this was a best-guess model, these results are pretty good. However, we can further optimize these results by using the grid search approach for hyperparameter tuning.</p>



<h3 class="wp-block-heading" id="h-step-5-hyperparameter-tuning-a-classification-model-using-the-grid-search-technique">Step #5 Hyperparameter Tuning a Classification Model using the Grid Search Technique</h3>



<p class="wp-block-paragraph">By comparing the performance of different model configurations, we can find the best set of hyperparameters that yields the highest accuracy. This approach is a powerful tool for fine-tuning machine learning models and improving their performance. So let&#8217;s get started and see if we can beat the results of our best-guess model using the grid search technique! </p>



<h4 class="wp-block-heading">Training and Tuning the Model</h4>



<p class="wp-block-paragraph">Next, we will use the grid search technique to optimize a random decision forest model that predicts the survival of Titanic passengers. We&#8217;ll define a grid of hyperparameter values in Python and then use the Scikit-learn library to train and test the model with different hyperparameter configurations. First, we will define a parameter range:</p>



<ul class="wp-block-list">
<li>max_depth = [2, 8, 16]</li>



<li>n_estimators = [64, 128, 256]</li>
</ul>



<p class="wp-block-paragraph">We leave the other parameters at their default value. In addition, we need to define against which metric we want the grid search algorithm to evaluate the model performance. Since we have no personal preference and our dataset is well-balanced, we choose the mean test score as the evaluation metric. Then we run the grid search algorithm. </p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Define Parameters
max_depth=[2, 8, 16]
n_estimators = [64, 128, 256]
param_grid = dict(max_depth=max_depth, n_estimators=n_estimators)

# Build the grid search
dfrst = RandomForestClassifier(n_estimators=n_estimators, max_depth=max_depth)
grid = GridSearchCV(estimator=dfrst, param_grid=param_grid, cv = 5)
grid_results = grid.fit(x_train, y_train)

# Summarize the results in a readable format
print(&quot;Best: {0}, using {1}&quot;.format(grid_results.cv_results_['mean_test_score'], grid_results.best_params_))
results_df = pd.DataFrame(grid_results.cv_results_)
results_df</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">Best: [0.79611613 0.78005161 0.79290323 0.81387097 0.82187097 0.81867097 
 0.78818065 0.78816774 0.78498065], using {'max_depth': 8, 'n_estimators': 128}
 
 	mean_fit_time	std_fit_time	mean_score_time	std_score_time	param_max_depth	param_n_estimators	params	split0_test_score				split1_test_score	split2_test_score	split3_test_score	split4_test_score	mean_test_score	std_test_score	rank_test_score
0	0.057045		0.001108		0.005001		0.000001		2				64					{'max_depth': 2, 'n_estimators': 64}	0.824				0.800				0.784				0.774194			0.798387		0.796116	0.016883	4
1	0.112051		0.002088		0.009490		0.000775		2				128					{'max_depth': 2, 'n_estimators': 128}	0.760				0.824				0.784				0.750000			0.782258		0.780052	0.025523	9
2	0.221600		0.003740		0.016487		0.000448		2				256					{'max_depth': 2, 'n_estimators': 256}	0.792				0.824				0.784				0.774194			0.790323		0.792903	0.016756	5
3	0.061998		0.001410		0.005801		0.000400		8				64					{'max_depth': 8, 'n_estimators': 64}	0.784				0.824				0.792				0.806452			0.862903		0.813871	0.028044	3
4	0.122886		0.002652		0.009587		0.000480		8				128					{'max_depth': 8, 'n_estimators': 128}	0.784				0.848				0.808				0.806452			0.862903		0.821871	0.029089	1
5	0.250295		0.007654		0.018557		0.000836		8				256					{'max_depth': 8, 'n_estimators': 256}	0.800				0.824				0.800				0.806452			0.862903		0.818671	0.023797	2
6	0.065602		0.000505		0.005800		0.000399		16				64					{'max_depth': 16, 'n_estimators': 64}	0.736				0.808				0.784				0.766129			0.846774		0.788181	0.037557	6
7	0.127662		0.003297		0.008600		0.004080		16				128					{'max_depth': 16, 'n_estimators': 128}	0.752				0.800				0.784				0.758065			0.846774		0.788168	0.034078	7
8	0.259617		0.003121		0.018873		0.000537		16				256					{'max_depth': 16, 'n_estimators': 256}	0.752				0.784				0.776				0.766129			0.846774		0.784981	0.032690	8</pre></div>



<p class="wp-block-paragraph">The list above is an overview of the tested model configurations, ranked by their prediction scores. Model number five achieved the best results. The parameters of this model are a maximum depth of 8 and several estimators of 256. </p>



<h4 class="wp-block-heading">Model Evaluation</h4>



<p class="wp-block-paragraph">We select the best model and use it to predict the test data set. We visualize the results in another confusion matrix. </p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Extract the best decision forest 
best_clf = grid_results.best_estimator_
y_pred = best_clf.predict(x_test)

# Create a confusion matrix
cnf_matrix = confusion_matrix(y_test, y_pred)

# Create heatmap from the confusion matrix
%matplotlib inline
class_names=[False, True] # name  of classes
fig, ax = plt.subplots(figsize=(7, 6))
sns.heatmap(pd.DataFrame(cnf_matrix), annot=True, cmap=&quot;YlGnBu&quot;, fmt='g')
ax.xaxis.set_label_position(&quot;top&quot;)
plt.tight_layout()
plt.title('Confusion matrix')
plt.ylabel('Actual label')
plt.xlabel('Predicted label')
tick_marks = [0.5, 1.5]
plt.xticks(tick_marks, class_names)
plt.yticks(tick_marks, class_names)</pre></div>



<figure class="wp-block-image size-large is-resized"><img decoding="async" data-attachment-id="2297" data-permalink="https://www.relataly.com/hyperparameter-tuning-with-grid-search/2261/image-5-5/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2020/07/image-5.png" data-orig-size="486,452" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-5" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2020/07/image-5.png" src="https://www.relataly.com/wp-content/uploads/2020/07/image-5.png" alt="confusion matrix on the best model returned by the grid search hyperparameter tuning approach in Python" class="wp-image-2297" width="529" height="491" srcset="https://www.relataly.com/wp-content/uploads/2020/07/image-5.png 486w, https://www.relataly.com/wp-content/uploads/2020/07/image-5.png 300w" sizes="(max-width: 529px) 100vw, 529px" /><figcaption class="wp-element-caption">Confusion matrix of the best grid search model</figcaption></figure>



<p class="wp-block-paragraph">The confusion matrix shows the best model results from the grid search technique. The result is an overall model accuracy of 83,5 %, which shows that the best grid search model outperforms our initial best guess model. This optimal model has correctly classified that 148 passengers would not survive and 76 passengers would survive. In 44 cases, the model was wrong.</p>



<h2 class="wp-block-heading" id="h-summary">Summary</h2>



<p class="wp-block-paragraph">This article has shown how we can use grid Search in Python to efficiently search for the optimal hyperparameter configuration of a machine learning model. In the conceptual part, you learned about hyperparameters and how to use grid search to try out all permutations of a predefined parameter grid. </p>



<p class="wp-block-paragraph">In the hands-on part of this article, we developed a random decision forest that predicts the survival of Titanic passengers using Python and scikit-learn. The grid search technique applies not only to classification models but can also be used to optimize the performance of regression models. First, we developed a baseline model with best-guess parameters. Subsequently, we defined a parameter grid and used the grid search technique to tune the hyperparameters of the random decision forest. In this way, we quickly identified a configuration that outperforms our initial baseline model. In this way, we have demonstrated how Gid Search can help optimize the classification model parameters. </p>



<p class="wp-block-paragraph">I hope this article was helpful. I am always interested to learn and improve. So, if you have any questions or suggestions, please write them in the comments. </p>



<h2 class="wp-block-heading">Sources and Further Reading</h2>



<div style="display: inline-block;">
  <iframe sandbox="allow-popups allow-scripts allow-modals allow-forms allow-same-origin" style="width:120px;height:240px;" marginwidth="0" marginheight="0" scrolling="no" frameborder="0" src="//ws-eu.amazon-adsystem.com/widgets/q?ServiceVersion=20070822&amp;OneJS=1&amp;Operation=GetAdHtml&amp;MarketPlace=DE&amp;source=ss&amp;ref=as_ss_li_til&amp;ad_type=product_link&amp;tracking_id=flo7up-21&amp;language=de_DE&amp;marketplace=amazon&amp;region=DE&amp;placement=3030181162&amp;asins=3030181162&amp;linkId=669e46025028259138fbb5ccec12dfbe&amp;show_border=true&amp;link_opens_in_new_window=true"></iframe>
<iframe sandbox="allow-popups allow-scripts allow-modals allow-forms allow-same-origin" style="width:120px;height:240px;" marginwidth="0" marginheight="0" scrolling="no" frameborder="0" src="//ws-eu.amazon-adsystem.com/widgets/q?ServiceVersion=20070822&amp;OneJS=1&amp;Operation=GetAdHtml&amp;MarketPlace=DE&amp;source=ss&amp;ref=as_ss_li_til&amp;ad_type=product_link&amp;tracking_id=flo7up-21&amp;language=de_DE&amp;marketplace=amazon&amp;region=DE&amp;placement=1999579577&amp;asins=1999579577&amp;linkId=91d862698bf9010ff4c09539e4c49bf4&amp;show_border=true&amp;link_opens_in_new_window=true"></iframe>
<iframe sandbox="allow-popups allow-scripts allow-modals allow-forms allow-same-origin" style="width:120px;height:240px;" marginwidth="0" marginheight="0" scrolling="no" frameborder="0" src="//ws-eu.amazon-adsystem.com/widgets/q?ServiceVersion=20070822&amp;OneJS=1&amp;Operation=GetAdHtml&amp;MarketPlace=DE&amp;source=ss&amp;ref=as_ss_li_til&amp;ad_type=product_link&amp;tracking_id=flo7up-21&amp;language=de_DE&amp;marketplace=amazon&amp;region=DE&amp;placement=1839217715&amp;asins=1839217715&amp;linkId=356ba074068849ff54393f527190825d&amp;show_border=true&amp;link_opens_in_new_window=true"></iframe>
<iframe sandbox="allow-popups allow-scripts allow-modals allow-forms allow-same-origin" style="width:120px;height:240px;" marginwidth="0" marginheight="0" scrolling="no" frameborder="0" src="//ws-eu.amazon-adsystem.com/widgets/q?ServiceVersion=20070822&amp;OneJS=1&amp;Operation=GetAdHtml&amp;MarketPlace=DE&amp;source=ss&amp;ref=as_ss_li_til&amp;ad_type=product_link&amp;tracking_id=flo7up-21&amp;language=de_DE&amp;marketplace=amazon&amp;region=DE&amp;placement=1492032646&amp;asins=1492032646&amp;linkId=2214804dd039e7103577abd08722abac&amp;show_border=true&amp;link_opens_in_new_window=true"></iframe>
</div>



<p class="has-contrast-2-color has-base-3-background-color has-text-color has-background wp-block-paragraph"><em>The links above to Amazon are affiliate links. By buying through these links, you support the Relataly.com blog and help to cover the hosting costs. Using the links does not affect the price.</em></p>



<p class="wp-block-paragraph"></p>
<p>The post <a href="https://www.relataly.com/hyperparameter-tuning-with-grid-search/2261/">Tuning Model Hyperparameters with Grid Search at the Example of Training a Random Forest Classifier in Python</a> appeared first on <a href="https://www.relataly.com">relataly.com</a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://www.relataly.com/hyperparameter-tuning-with-grid-search/2261/feed/</wfw:commentRss>
			<slash:comments>2</slash:comments>
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">2261</post-id>	</item>
		<item>
		<title>Classifying Purchase Intention of Online Shoppers with Python</title>
		<link>https://www.relataly.com/predicting-the-purchase-intention-of-online-shoppers/982/</link>
					<comments>https://www.relataly.com/predicting-the-purchase-intention-of-online-shoppers/982/#respond</comments>
		
		<dc:creator><![CDATA[Florian Follonier]]></dc:creator>
		<pubDate>Mon, 11 May 2020 21:42:35 +0000</pubDate>
				<category><![CDATA[Algorithms]]></category>
		<category><![CDATA[Classification (two-class)]]></category>
		<category><![CDATA[Data Science]]></category>
		<category><![CDATA[Data Sources]]></category>
		<category><![CDATA[Feature Permutation Importance]]></category>
		<category><![CDATA[Insurance]]></category>
		<category><![CDATA[Kaggle Competitions]]></category>
		<category><![CDATA[Logistic Regression]]></category>
		<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[Marketing Automation]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[Retail]]></category>
		<category><![CDATA[Sales Forecasting]]></category>
		<category><![CDATA[Scikit-Learn]]></category>
		<category><![CDATA[Seaborn]]></category>
		<category><![CDATA[AI in E-Commerce]]></category>
		<category><![CDATA[AI in Marketing]]></category>
		<category><![CDATA[Beginner Tutorials]]></category>
		<category><![CDATA[Classic Machine Learning]]></category>
		<category><![CDATA[Classification Error Metrics]]></category>
		<category><![CDATA[Confusion Matrix]]></category>
		<category><![CDATA[Supervised Learning]]></category>
		<category><![CDATA[Whisker Plots]]></category>
		<guid isPermaLink="false">https://www.relataly.com/?p=982</guid>

					<description><![CDATA[<p>Online shopping has become a part of our daily lives, and online stores are continually seeking to improve their sales. One way to achieve this is by using machine learning to predict customers&#8217; purchase intentions. This innovative process can help businesses understand their customers&#8217; behavior and tailor their marketing strategies accordingly. In this article, we ... <a title="Classifying Purchase Intention of Online Shoppers with Python" class="read-more" href="https://www.relataly.com/predicting-the-purchase-intention-of-online-shoppers/982/" aria-label="Read more about Classifying Purchase Intention of Online Shoppers with Python">Read more</a></p>
<p>The post <a href="https://www.relataly.com/predicting-the-purchase-intention-of-online-shoppers/982/">Classifying Purchase Intention of Online Shoppers with Python</a> appeared first on <a href="https://www.relataly.com">relataly.com</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">Online shopping has become a part of our daily lives, and online stores are continually seeking to improve their sales. One way to achieve this is by using machine learning to predict customers&#8217; purchase intentions. This innovative process can help businesses understand their customers&#8217; behavior and tailor their marketing strategies accordingly.</p>



<p class="wp-block-paragraph">In this article, we will explore the practical side of purchase intention prediction. Our focus is on developing a classification model that predicts whether a visitor will make a purchase or not. We&#8217;ll use Scikit-Learn&#8217;s machine learning library to train a Logistic Regression algorithm, and evaluate the model&#8217;s performance. Our ultimate goal is to provide insights into the circumstances under which customers make purchase decisions.</p>



<p class="wp-block-paragraph">Predicting purchase intentions can offer significant benefits to online stores, such as identifying potential customers who are most likely to buy and targeting their marketing efforts accordingly. By understanding the practical application of machine learning for purchase intention prediction, online businesses can gain a competitive edge and increase their revenue.</p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%"></div>
</div>



<p class="wp-block-paragraph">Also: <a href="https://www.relataly.com/simple-sentiment-analysis-using-naive-bayes-and-logistic-regression/2007/" target="_blank" rel="noreferrer noopener">Sentiment Analysis with Naive Bayes and Logistic Regression in Python</a></p>



<h2 class="wp-block-heading">About Modeling Customer Purchase Intentions</h2>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">Customer purchase intention prediction is the process of using machine learning algorithms to predict the likelihood that a particular customer will make a purchase. This can be useful for various applications, such as identifying potential customers most likely interested in a particular product or service and targeting marketing and sales efforts accordingly.</p>



<p class="wp-block-paragraph">To make accurate predictions about customer purchase intentions, it is important to have access to high-quality data about the customer, such as their demographic information, purchasing history, and other relevant factors. By analyzing this data and applying appropriate machine learning algorithms, it is possible to identify patterns and trends that can predict the likelihood that a particular customer will make a purchase.</p>



<p class="wp-block-paragraph">There are many different approaches to customer purchase intention prediction, and the specific methods used can vary depending on the application and the data available. Some common techniques for predicting customer purchase intentions include using regression analysis to model the relationship between purchase intentions and other variables and using classification algorithms to classify customers as likely or unlikely to make a purchase. By using these techniques, it is possible to make more accurate and useful predictions about customer purchase intentions.</p>



<p class="wp-block-paragraph">Also: <a href="https://www.relataly.com/predicting-the-customer-churn-of-a-telecommunications-provider/2378/" target="_blank" rel="noreferrer noopener">Customer Churn Prediction &#8211; Understanding Models with Feature Permutation Importance</a></p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%">
<figure class="wp-block-image size-full"><img decoding="async" width="478" height="500" data-attachment-id="12685" data-permalink="https://www.relataly.com/men-and-woman-doing-groceries-machine-learning-customer-purchase-intention-prediction-relataly-midjourney-min/" data-orig-file="https://www.relataly.com/wp-content/uploads/2023/03/men-and-woman-doing-groceries-machine-learning-customer-purchase-intention-prediction-relataly-midjourney-min.png" data-orig-size="478,500" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="men and woman doing groceries machine learning customer purchase intention prediction relataly midjourney-min" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2023/03/men-and-woman-doing-groceries-machine-learning-customer-purchase-intention-prediction-relataly-midjourney-min.png" src="https://www.relataly.com/wp-content/uploads/2023/03/men-and-woman-doing-groceries-machine-learning-customer-purchase-intention-prediction-relataly-midjourney-min.png" alt="Customer purchase intentions sometimes follow patterns that can be used for predictive purposes. Image created with Midjourney." class="wp-image-12685" srcset="https://www.relataly.com/wp-content/uploads/2023/03/men-and-woman-doing-groceries-machine-learning-customer-purchase-intention-prediction-relataly-midjourney-min.png 478w, https://www.relataly.com/wp-content/uploads/2023/03/men-and-woman-doing-groceries-machine-learning-customer-purchase-intention-prediction-relataly-midjourney-min.png 287w" sizes="(max-width: 478px) 100vw, 478px" /><figcaption class="wp-element-caption">Customer purchase intentions sometimes follow patterns that can be used for predictive purposes. Image created with <a href="http://www.midjourney.com" target="_blank" rel="noreferrer noopener">Midjourney</a>.</figcaption></figure>
</div>
</div>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<h2 class="wp-block-heading">How Modeling Purchase Intentions can Lead to a Better Customer Understanding</h2>



<p class="wp-block-paragraph">Predicting the purchase intentions of online shoppers can be a step for online stores to understand their customers better. Creating predictive models makes it possible to conclude the factors influencing customers&#8217; buying behavior. At what time of day are our customers most inclined to buy? For which products do customers often abandon the purchase process? Such questions are fascinating for marketing departments. Once understood, they can enable marketers to optimize their customers&#8217; buying experience and achieve a higher conversion rate. In this way, intention prediction can help online stores target customers with the right products at the right time and thus take a step toward marketing automation.</p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%"></div>
</div>



<figure class="wp-block-image size-large is-resized"><img decoding="async" data-attachment-id="6828" data-permalink="https://www.relataly.com/predicting-the-purchase-intention-of-online-shoppers/982/image-13-12/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/04/image-13.png" data-orig-size="1846,861" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="Classifying Purchase Intentions of Online Shoppers with Python" data-image-description="&lt;p&gt;Classifying Purchase Intentions of Online Shoppers with Python&lt;/p&gt;
" data-image-caption="&lt;p&gt;Classifying Purchase Intentions of Online Shoppers with Python&lt;/p&gt;
" data-large-file="https://www.relataly.com/wp-content/uploads/2022/04/image-13.png" src="https://www.relataly.com/wp-content/uploads/2022/04/image-13-1024x478.png" alt="A classification model that predicts the buying intention of online shoppers" class="wp-image-6828" width="760" height="355" srcset="https://www.relataly.com/wp-content/uploads/2022/04/image-13.png 1024w, https://www.relataly.com/wp-content/uploads/2022/04/image-13.png 300w, https://www.relataly.com/wp-content/uploads/2022/04/image-13.png 768w, https://www.relataly.com/wp-content/uploads/2022/04/image-13.png 1536w, https://www.relataly.com/wp-content/uploads/2022/04/image-13.png 1846w" sizes="(max-width: 760px) 100vw, 760px" /></figure>



<h2 class="wp-block-heading" id="h-implementing-a-prediction-model-for-purchase-intentions-with-python">Implementing a Prediction Model for Purchase Intentions with Python</h2>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">Logistic regression is a widely-used algorithm in machine learning that is particularly useful for solving two-class classification problems. One of the primary benefits of using logistic regression models is that they can help us understand the factors that influence the predictions made by the model. This interpretability is a key advantage of logistic regression, making it a popular choice in many real-world applications.</p>



<p class="wp-block-paragraph">In the next steps of our analysis, we will develop a two-class classification model that utilizes the logistic regression algorithm to predict the purchase intentions of online shoppers. By analyzing a set of features that are likely to influence a shopper&#8217;s decision to purchase, such as product price, customer reviews, and shipping time, we can build a model that accurately predicts the likelihood of a shopper completing a purchase. The logistic regression algorithm will be particularly useful in this case, as it allows us to identify which features are the most significant predictors of purchase intention.</p>



<p class="wp-block-paragraph">The code is available on the GitHub repository.</p>



<div class="wp-block-kadence-advancedbtn kb-buttons-wrap kb-btns_d5d832-9e"><a class="kb-button kt-button button kb-btn_7d1c88-9e kt-btn-size-standard kt-btn-width-type-full kb-btn-global-inherit kt-btn-has-text-true kt-btn-has-svg-true wp-block-button__link wp-block-kadence-singlebtn" href="https://github.com/flo7up/relataly-public-python-tutorials/blob/master/02%20Classification/019%20%20Classifying%20Shopper%20Buying%20Intention%20using%20Logistic%20Regression.ipynb" target="_blank" rel="noreferrer noopener"><span class="kb-svg-icon-wrap kb-svg-icon-fe_eye kt-btn-icon-side-left"><svg viewBox="0 0 24 24"  fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"  aria-hidden="true"><path d="M1 12s4-8 11-8 11 8 11 8-4 8-11 8-11-8-11-8z"/><circle cx="12" cy="12" r="3"/></svg></span><span class="kt-btn-inner-text">View on GitHub </span></a>

<a class="kb-button kt-button button kb-btn_040040-16 kt-btn-size-standard kt-btn-width-type-full kb-btn-global-inherit kt-btn-has-text-true kt-btn-has-svg-true wp-block-button__link wp-block-kadence-singlebtn" href="https://github.com/flo7up/relataly-public-python-API-tutorials" target="_blank" rel="noreferrer noopener"><span class="kb-svg-icon-wrap kb-svg-icon-fa_github kt-btn-icon-side-left"><svg viewBox="0 0 496 512"  fill="currentColor" xmlns="http://www.w3.org/2000/svg"  aria-hidden="true"><path d="M165.9 397.4c0 2-2.3 3.6-5.2 3.6-3.3.3-5.6-1.3-5.6-3.6 0-2 2.3-3.6 5.2-3.6 3-.3 5.6 1.3 5.6 3.6zm-31.1-4.5c-.7 2 1.3 4.3 4.3 4.9 2.6 1 5.6 0 6.2-2s-1.3-4.3-4.3-5.2c-2.6-.7-5.5.3-6.2 2.3zm44.2-1.7c-2.9.7-4.9 2.6-4.6 4.9.3 2 2.9 3.3 5.9 2.6 2.9-.7 4.9-2.6 4.6-4.6-.3-1.9-3-3.2-5.9-2.9zM244.8 8C106.1 8 0 113.3 0 252c0 110.9 69.8 205.8 169.5 239.2 12.8 2.3 17.3-5.6 17.3-12.1 0-6.2-.3-40.4-.3-61.4 0 0-70 15-84.7-29.8 0 0-11.4-29.1-27.8-36.6 0 0-22.9-15.7 1.6-15.4 0 0 24.9 2 38.6 25.8 21.9 38.6 58.6 27.5 72.9 20.9 2.3-16 8.8-27.1 16-33.7-55.9-6.2-112.3-14.3-112.3-110.5 0-27.5 7.6-41.3 23.6-58.9-2.6-6.5-11.1-33.3 2.6-67.9 20.9-6.5 69 27 69 27 20-5.6 41.5-8.5 62.8-8.5s42.8 2.9 62.8 8.5c0 0 48.1-33.6 69-27 13.7 34.7 5.2 61.4 2.6 67.9 16 17.7 25.8 31.5 25.8 58.9 0 96.5-58.9 104.2-114.8 110.5 9.2 7.9 17 22.9 17 46.4 0 33.7-.3 75.4-.3 83.6 0 6.5 4.6 14.4 17.3 12.1C428.2 457.8 496 362.9 496 252 496 113.3 383.5 8 244.8 8zM97.2 352.9c-1.3 1-1 3.3.7 5.2 1.6 1.6 3.9 2.3 5.2 1 1.3-1 1-3.3-.7-5.2-1.6-1.6-3.9-2.3-5.2-1zm-10.8-8.1c-.7 1.3.3 2.9 2.3 3.9 1.6 1 3.6.7 4.3-.7.7-1.3-.3-2.9-2.3-3.9-2-.6-3.6-.3-4.3.7zm32.4 35.6c-1.6 1.3-1 4.3 1.3 6.2 2.3 2.3 5.2 2.6 6.5 1 1.3-1.3.7-4.3-1.3-6.2-2.2-2.3-5.2-2.6-6.5-1zm-11.4-14.7c-1.6 1-1.6 3.6 0 5.9 1.6 2.3 4.3 3.3 5.6 2.3 1.6-1.3 1.6-3.9 0-6.2-1.4-2.3-4-3.3-5.6-2z"/></svg></span><span class="kt-btn-inner-text">Relataly GitHub Repo </span></a></div>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%"></div>
</div>



<h3 class="wp-block-heading" id="h-prerequisites">Prerequisites</h3>



<p class="wp-block-paragraph">Before starting the coding part, make sure that you have set up your <a href="https://www.python.org/downloads/" target="_blank" rel="noreferrer noopener">Python 3</a> environment and required packages. If you don&#8217;t have an environment, consider the&nbsp;<a href="https://www.anaconda.com/products/individual" target="_blank" rel="noreferrer noopener">Anaconda Python environment</a>. To set it up, you can follow the steps in&nbsp;<a href="https://www.relataly.com/category/data-science/setup-anaconda-environment/" target="_blank" rel="noreferrer noopener">this tutorial</a>. Please ensure to install all required packages:</p>



<ul class="wp-block-list">
<li><em><a href="https://pandas.pydata.org/" target="_blank" rel="noreferrer noopener">pandas</a></em></li>



<li><em><a href="https://numpy.org/" target="_blank" rel="noreferrer noopener">NumPy</a></em></li>



<li><em><a href="https://matplotlib.org/" target="_blank" rel="noreferrer noopener">matplotlib</a></em></li>
</ul>



<p class="wp-block-paragraph">In addition, we will be using the machine learning library <a href="https://scikit-learn.org/stable/" target="_blank" rel="noreferrer noopener">Scikit-learn</a> and <a data-type="URL" data-id="https://seaborn.pydata.org/" href="https://seaborn.pydata.org/" target="_blank" rel="noreferrer noopener">Seaborn</a> for visualization. You can install packages using console commands:</p>



<ul class="wp-block-list">
<li><em>pip install &lt;package name&gt;</em></li>



<li><em>conda install &lt;package name&gt;</em>&nbsp;(if you are using the anaconda packet manager)</li>
</ul>



<h3 class="wp-block-heading" id="h-about-the-dataset">About the Dataset</h3>



<p class="wp-block-paragraph">In this tutorial, we will be working with a public dataset from <a href="https://www.kaggle.com/roshansharma/online-shoppers-intention" target="_blank" rel="noreferrer noopener">Kaggle.com</a>. The data consists of 18 feature vectors belonging to 12,330 shopping sessions. You can download the data via the link below:</p>



<div class="wp-block-file"><a id="wp-block-file--media-3f304c01-ab35-4462-bda0-88dce356d27e" href="https://www.relataly.com/wp-content/uploads/2020/05/online_shoppers_intention.csv">online_shoppers_intention.csv</a><a href="https://www.relataly.com/wp-content/uploads/2020/05/online_shoppers_intention.csv" class="wp-block-file__button wp-element-button" download aria-describedby="wp-block-file--media-3f304c01-ab35-4462-bda0-88dce356d27e">Download</a></div>



<p class="wp-block-paragraph">The data stems from a big shopping website that has recorded the session for one year. Each record belongs to a separate shopping session and user. Thus, there is no bias in the data, such as a specific period, user, or day to avoid. </p>



<p class="wp-block-paragraph">Below you will find an overview of the features contained in the data (Source: Kaggle.com): </p>



<ul class="wp-block-list">
<li>&#8220;Administrative,&#8221; &#8220;Administrative Duration,&#8221; &#8220;Informational,&#8221; &#8220;Informational Duration,&#8221; &#8220;Product Related,&#8221; and &#8220;Product-Related Duration&#8221; represent the number of different types of pages visited by the visitor in that session and the total time spent in each of these page categories.&nbsp;</li>



<li>The &#8220;Bounce Rate,&#8221; &#8220;Exit Rate,&#8221; and &#8220;Page Value&#8221; features represent the metrics measured by &#8220;Google Analytics&#8221; for each page on the e-commerce site. </li>



<li>The &#8220;Special Day&#8221; feature indicates the closeness of the site visiting time to a specific special day (e.g., Mother&#8217;s Day, Valentine&#8217;s Day)</li>



<li>The dataset also includes an operating system, browser, region, traffic type, visitor type as returning or new visitor, a Boolean value indicating whether the date of the visit is a weekend, and the month of the year.</li>
</ul>



<p class="wp-block-paragraph">The &#8216;Revenue&#8217; attribute is the class label, called the &#8220;prediction label.&#8221;</p>



<h3 class="wp-block-heading" id="h-step-1-load-the-data">Step #1 Load the Data</h3>



<p class="wp-block-paragraph">We begin by loading the shopping dataset into a Pandas DataFrame. Afterward, we will print a brief overview of the data.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}">import calendar
import math 
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt 
from matplotlib import cm
import seaborn as sns

from sklearn.model_selection import train_test_split as train_test_split
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import confusion_matrix, roc_curve, auc, roc_auc_score
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Load train data
filepath = &quot;data/classification-online-shopping/&quot;
df_shopping_base = pd.read_csv(filepath + 'online_shoppers_intention.csv') 
df_shopping_base</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">	Administrative	Administrative_Duration	Informational	Informational_Duration	ProductRelated	ProductRelated_Duration	BounceRates	ExitRates	PageValues	SpecialDay	Month	OperatingSystems	Browser	Region	TrafficType	VisitorType			Weekend	Revenue
0	0.0				0.0						0.0				0.0						1.0				0.000000				0.20		0.20		0.0			0.0			Feb		1					1		1		1			Returning_Visitor	False	False
1	0.0				0.0						0.0				0.0						2.0				64.000000				0.00		0.10		0.0			0.0			Feb		2					2		1		2			Returning_Visitor	False	False
2	0.0				-1.0					0.0				-1.0					1.0				-1.000000				0.20		0.20		0.0			0.0			Feb		4					1		9		3			Returning_Visitor	False	False
3	0.0				0.0						0.0				0.0						2.0				2.666667				0.05		0.14		0.0			0.0			Feb		3					2		2		4			Returning_Visitor	False	False
4	0.0				0.0						0.0				0.0						10.0			627.500000				0.02		0.05		0.0			0.0			Feb		3					3		1		4			Returning_Visitor	True	False</pre></div>



<h3 class="wp-block-heading" id="h-step-2-cleaning-the-data">Step #2 Cleaning the Data</h3>



<p class="wp-block-paragraph">Before we can start training our prediction model, we&#8217;ll do some cleanups (handling missing data, data type conversions, treating outliers, and so on).</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Replacing visitor_type to int
print(df_shopping_base['VisitorType'].unique())
df_shop = df_shopping_base.replace({'VisitorType' : { 'New_Visitor' : 0, 'Returning_Visitor' : 1, 'Other' : 2 }})

# Coverting month column to numeric numeric values
monthlist = df_shop['Month'].replace('June', 'Jun')
mlist = []
m = np.array(monthlist)
for mi in m:
    a = list(calendar.month_abbr).index(mi)
    mlist.append(a)
df_shop['Month'] =  mlist

# Delete records with NAs
df_shop.dropna(inplace=True)

df_shop.head()</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">['Returning_Visitor' 'New_Visitor' 'Other']
	Administrative	Administrative_Duration	Informational	Informational_Duration	ProductRelated	ProductRelated_Duration	BounceRates	ExitRates	PageValues	SpecialDay	Month	OperatingSystems	Browser	Region	TrafficType	VisitorType	Weekend	Revenue
  0	0.0				0.0						0.0				0.0						1.0				0.000000				0.20		0.20		0.0			0.0			2		1					1		1		1			1			False	False
1	0.0				0.0						0.0				0.0						2.0				64.000000				0.00		0.10		0.0			0.0			2		2					2		1		2			1			False	False
2	0.0				-1.0					0.0				-1.0					1.0				-1.000000				0.20		0.20		0.0			0.0			2		4					1		9		3			1			False	False
3	0.0				0.0						0.0				0.0						2.0				2.666667				0.05		0.14		0.0			0.0			2		3					2		2		4			1			False	False
4	0.0				0.0						0.0				0.0						10.0			627.50</pre></div>



<h3 class="wp-block-heading" id="h-step-3-exploring-the-data">Step #3 Exploring the Data</h3>



<p class="wp-block-paragraph">Next, we will familiarize ourselves with the data. </p>



<h4 class="wp-block-heading" id="h-3-1-class-labels">3.1 Class Labels</h4>



<p class="wp-block-paragraph">First, we take a look at the class labels to see how balanced they are. If class labels are balanced, it means that each class has an approximately equal number of examples in the training data. This is important because it helps ensure that the trained model will be able to make accurate predictions on new data. If the class labels are unbalanced, then the model is more likely to be biased towards the more common classes, which can lead to poor performance on less common classes. Additionally, unbalanced class labels can make it more difficult to evaluate the performance of a machine learning model, because the model&#8217;s accuracy may not be an accurate reflection of its ability to generalize to new data.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Checking the balance of prediction labels
plt.figure(figsize=(16,2))
fig = sns.countplot(y=&quot;Revenue&quot;, data=df_shop, palette=&quot;muted&quot;)
plt.show()</pre></div>



<figure class="wp-block-image size-full is-resized"><img decoding="async" data-attachment-id="6830" data-permalink="https://www.relataly.com/predicting-the-purchase-intention-of-online-shoppers/982/output-3-2/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/04/output-3.png" data-orig-size="953,154" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="output-3" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/04/output-3.png" src="https://www.relataly.com/wp-content/uploads/2022/04/output-3.png" alt="" class="wp-image-6830" width="946" height="153" srcset="https://www.relataly.com/wp-content/uploads/2022/04/output-3.png 953w, https://www.relataly.com/wp-content/uploads/2022/04/output-3.png 300w, https://www.relataly.com/wp-content/uploads/2022/04/output-3.png 768w" sizes="(max-width: 946px) 100vw, 946px" /></figure>



<p class="wp-block-paragraph">Our class labels are somewhat imbalanced, as there are much more cases in the data with a prediction &#8220;false.&#8221; The reason is that more visitors won&#8217;t buy anything. Imbalanced data can affect the performance of classification models. But now that we are aware of the imbalance in our data, we can choose appropriate evaluation metrics later.</p>



<h4 class="wp-block-heading" id="h-3-2-feature-correlation">3.2 Feature Correlation</h4>



<p class="wp-block-paragraph">When developing classification models, not all features are usually equally useful. It is important that features are not correlated because correlated features can provide redundant information to a machine learning model. If two or more features are highly correlated, they may convey the same information to the model, which can make the model&#8217;s predictions less accurate. Additionally, having correlated features can make it more difficult to interpret the model&#8217;s predictions, because it is not clear which features are actually contributing to the model&#8217;s decision-making process. </p>



<p class="wp-block-paragraph">Let&#8217;s check which of our features are correlated. First, we will create a series of Whiskerplots for the features in our dataset. They help us identify potential outliers and get a better idea of how the data looks.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Whiskerplots
c= 'black'
df_shop.drop('Revenue', axis=1).plot(kind='box', 
                                subplots=True, layout=(4,4), 
                                sharex=False, sharey=False, 
                                figsize=(14,14), 
                                title='Whister plot for input variables')
plt.show()</pre></div>



<figure class="wp-block-image size-large is-resized"><img decoding="async" data-attachment-id="986" data-permalink="https://www.relataly.com/predicting-the-purchase-intention-of-online-shoppers/982/image-35-2/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2020/05/image-35.png" data-orig-size="821,893" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-35" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2020/05/image-35.png" src="https://www.relataly.com/wp-content/uploads/2020/05/image-35.png" alt="Purchase Intention Prediction, Feature Permutation Importance, Feature Correlation plot" class="wp-image-986" width="664" height="721" srcset="https://www.relataly.com/wp-content/uploads/2020/05/image-35.png 821w, https://www.relataly.com/wp-content/uploads/2020/05/image-35.png 276w, https://www.relataly.com/wp-content/uploads/2020/05/image-35.png 768w" sizes="(max-width: 664px) 100vw, 664px" /><figcaption class="wp-element-caption">Feature Whiskerplots</figcaption></figure>



<p class="wp-block-paragraph">The Whiskerplots show that there are a couple of outliers in the data. However, the outliers are not significant enough to worry about them.</p>



<p class="wp-block-paragraph">Histograms are another way of visualizing the distribution of numerical or categorical variables. They give a rough sense of the density of the distribution. To create the histograms, run the code below.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># # Create pariplots for feature columns separated by prediction label value
df_plot = df_shop.copy()

# class_columnname = 'Revenue'
sns.pairplot(df_plot, hue=&quot;Revenue&quot;, height=2.5)</pre></div>



<figure class="wp-block-image size-large is-resized"><img decoding="async" data-attachment-id="6829" data-permalink="https://www.relataly.com/predicting-the-purchase-intention-of-online-shoppers/982/shopper-buying-intention/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/04/Shopper-Buying-Intention.png" data-orig-size="2560,2485" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="Shopper-Buying-Intention pair plots with seaborn" data-image-description="&lt;p&gt;Shopper-Buying-Intention pair plots with seaborn&lt;/p&gt;
" data-image-caption="&lt;p&gt;Shopper-Buying-Intention pair plots with seaborn&lt;/p&gt;
" data-large-file="https://www.relataly.com/wp-content/uploads/2022/04/Shopper-Buying-Intention.png" src="https://www.relataly.com/wp-content/uploads/2022/04/Shopper-Buying-Intention-1024x994.png" alt="Purchase Intention Prediction, Feature Permutation Importance, Feature Correlation plot" class="wp-image-6829" width="1117" height="1085" srcset="https://www.relataly.com/wp-content/uploads/2022/04/Shopper-Buying-Intention.png 1024w, https://www.relataly.com/wp-content/uploads/2022/04/Shopper-Buying-Intention.png 300w, https://www.relataly.com/wp-content/uploads/2022/04/Shopper-Buying-Intention.png 768w, https://www.relataly.com/wp-content/uploads/2022/04/Shopper-Buying-Intention.png 1536w, https://www.relataly.com/wp-content/uploads/2022/04/Shopper-Buying-Intention.png 2048w, https://www.relataly.com/wp-content/uploads/2022/04/Shopper-Buying-Intention.png 2475w" sizes="(max-width: 1117px) 100vw, 1117px" /></figure>



<p class="wp-block-paragraph">Finally, we create a correlation matrix and visualize it as a heat map. The matrix provides a quick overview of which features are correlated and not.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Feature correlation
plt.figure(figsize=(15,4))
f_cor = df_shop.corr()
sns.heatmap(f_cor, cmap=&quot;Blues_r&quot;)</pre></div>



<figure class="wp-block-image size-large is-resized"><img decoding="async" data-attachment-id="4662" data-permalink="https://www.relataly.com/predicting-the-purchase-intention-of-online-shoppers/982/image-50-3/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2021/06/image-50.png" data-orig-size="899,367" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-50" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2021/06/image-50.png" src="https://www.relataly.com/wp-content/uploads/2021/06/image-50.png" alt="Purchase Intention Prediction, Feature Permutation Importance" class="wp-image-4662" width="674" height="275" srcset="https://www.relataly.com/wp-content/uploads/2021/06/image-50.png 899w, https://www.relataly.com/wp-content/uploads/2021/06/image-50.png 300w, https://www.relataly.com/wp-content/uploads/2021/06/image-50.png 768w" sizes="(max-width: 674px) 100vw, 674px" /></figure>



<p class="wp-block-paragraph">The correlation plot shows that some features are highly correlated. The following features are highly correlated:</p>



<ul class="wp-block-list">
<li>ProductRelated and ProductRelated_Duration. </li>



<li>BounceRates and ExitRates</li>
</ul>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}">plt.figure(figsize=(8,5))
sns.scatterplot(x= 'BounceRates',y='ExitRates',data=df_shop,hue='Revenue')
plt.title('Bounce Rate vs. Exit Rate', fontweight='bold', fontsize=15)
plt.show()</pre></div>



<figure class="wp-block-image size-large is-resized"><img decoding="async" data-attachment-id="4674" data-permalink="https://www.relataly.com/predicting-the-purchase-intention-of-online-shoppers/982/image-51-3/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2021/06/image-51.png" data-orig-size="510,335" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-51" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2021/06/image-51.png" src="https://www.relataly.com/wp-content/uploads/2021/06/image-51.png" alt="Purchase Intention Prediction, Feature Permutation Importance" class="wp-image-4674" width="537" height="352" srcset="https://www.relataly.com/wp-content/uploads/2021/06/image-51.png 510w, https://www.relataly.com/wp-content/uploads/2021/06/image-51.png 300w" sizes="(max-width: 537px) 100vw, 537px" /></figure>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}">plt.figure(figsize=(8,5))
sns.scatterplot(x= 'ProductRelated',y='ProductRelated_Duration',data=df_shop,hue='Revenue')
plt.title('Bounce Rate vs. Exit Rate', fontweight='bold', fontsize=15)
plt.show()</pre></div>



<figure class="wp-block-image size-large is-resized"><img decoding="async" data-attachment-id="4675" data-permalink="https://www.relataly.com/predicting-the-purchase-intention-of-online-shoppers/982/image-52-3/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2021/06/image-52.png" data-orig-size="514,335" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-52" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2021/06/image-52.png" src="https://www.relataly.com/wp-content/uploads/2021/06/image-52.png" alt="Purchase Intention Prediction, Feature Permutation Importance" class="wp-image-4675" width="528" height="343" srcset="https://www.relataly.com/wp-content/uploads/2021/06/image-52.png 514w, https://www.relataly.com/wp-content/uploads/2021/06/image-52.png 300w" sizes="(max-width: 528px) 100vw, 528px" /></figure>



<p class="wp-block-paragraph">When we start to train our model, we will only use one of the features from the two pairs.</p>



<h3 class="wp-block-heading" id="h-step-4-data-preprocessing">Step #4 Data Preprocessing </h3>



<p class="wp-block-paragraph">Now that we are familiar with the data, we can prepare the data to train the purchase intention classification model. Firstly, we will include only selecting the features from the original shopping dataset. Second, we will split the data into two separate datasets: train and test with a ratio of 70%. Train X_train and X_test datasets contain the features, while y_train and y_test include the respective prediction labels. Thirdly, we will use the MinMaxScaler to scale the numeric features between 0 and 1. Scaling makes it easier for the algorithm to interpret the data and improve classification performance.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Separate labels from training data
features = ['Administrative', 'Administrative_Duration', 'Informational', 
            'Informational_Duration', 'ProductRelated', 'BounceRates', 'PageValues', 
            'Month', 'Region', 'TrafficType', 'VisitorType']
X = df_shop[features] #Training data
y = df_shop['Revenue'] #Prediction label

# Split the data into x_train and y_train data sets
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, random_state=0)

# Scale the numeric values
scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)</pre></div>



<h3 class="wp-block-heading" id="h-step-5-train-a-purchase-intention-classifier">Step #5 Train a Purchase Intention Classifier</h3>



<p class="wp-block-paragraph">Next, it is time to train our prediction model. Various classification algorithms could be used to solve this problem, for example, decision trees, random forests, neural networks, or support-vector machines. We will use the logistic regression algorithm, a common choice for simple two-class prediction problems. </p>



<p class="wp-block-paragraph">We start the training process using the &#8220;fit&#8221; method of the logistic regression algorithm. </p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Training a classification model using logistic regression 
logreg = LogisticRegression(solver='lbfgs')
score = logreg.fit(X_train, y_train).decision_function(X_test)</pre></div>



<p class="wp-block-paragraph">The trained model returns a training score showing how well the model has performed on the test dataset. </p>



<h3 class="wp-block-heading" id="h-step-6-evaluate-model-performance">Step #6 Evaluate Model Performance</h3>



<p class="wp-block-paragraph">Finally, we will evaluate the performance of our classification model. For this purpose, we first create a confusion matrix. Then we calculate and compare different error metrics.</p>



<h4 class="wp-block-heading" id="h-6-1-confusion-matrix">6.1 Confusion Matrix</h4>



<p class="wp-block-paragraph">The confusion matrix is a holistic and clean way to illustrate the results of a classification model. It differentiates between predicted labels and actual labels. For a binary classification model, the matrix comprises 2&#215;2 quadrants that show the number of cases in each quadrant. </p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># create a confusion matrix
y_pred = logreg.predict(X_test)
cnf_matrix = confusion_matrix(y_test, y_pred)

# create heatmap
%matplotlib inline
class_names=[False, True] # name  of classes
fig, ax = plt.subplots(figsize=(7, 6))
tick_marks = np.arange(len(class_names))
plt.xticks(tick_marks, class_names)
plt.yticks(tick_marks, class_names)
sns.heatmap(pd.DataFrame(cnf_matrix), annot=True, cmap=&quot;YlGnBu&quot;, fmt='g')
ax.xaxis.set_label_position(&quot;top&quot;)
plt.tight_layout()
plt.title('Confusion matrix')
plt.ylabel('Actual label')
plt.xlabel('Predicted label')</pre></div>



<figure class="wp-block-image size-large is-resized"><img decoding="async" data-attachment-id="990" data-permalink="https://www.relataly.com/predicting-the-purchase-intention-of-online-shoppers/982/image-39-2/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2020/05/image-39.png" data-orig-size="492,452" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-39" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2020/05/image-39.png" src="https://www.relataly.com/wp-content/uploads/2020/05/image-39.png" alt="confusion matrix on the results of our classification model that predicts purchase intentions, purchase intention prediction model" class="wp-image-990" width="374" height="344" srcset="https://www.relataly.com/wp-content/uploads/2020/05/image-39.png 492w, https://www.relataly.com/wp-content/uploads/2020/05/image-39.png 300w" sizes="(max-width: 374px) 100vw, 374px" /></figure>



<p class="wp-block-paragraph">In the upper left (0,0), we see that the model correctly predicted for 3102 online shopping sessions that these sessions will not lead to a purchase (True negatives). In 30 cases, the model was wrong and expected that there would be a purchase, but there wasn&#8217;t (False positives). For 412 buyers, the model predicted that they would not buy anything, even though they were buying something (False negatives). In the lower right corner, we see that only in 151 cases could buyers be correctly identified as such (True positives). </p>



<h4 class="wp-block-heading" id="h-6-2-performance-metrics-for-classification-models">6.2 Performance Metrics for Classification Models</h4>



<p class="wp-block-paragraph">Next, let&#8217;s take a brief look at the performance metrics. Four standard metrics that measure the performance of classification models are Accuracy, Precision, Recall, and  f1_score. </p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}">print('Accuracy: {:.2f}'.format(accuracy_score(y_test, y_pred)))
print('Precision: {:.2f}'.format(precision_score(y_test, y_pred)))
print('Recall: {:.2f}'.format(recall_score(y_test, y_pred)))
print('f1_score: {:.2f}'.format(f1_score(y_test, y_pred)))</pre></div>



<h5 class="wp-block-heading" id="h-accuracy"><strong>Accuracy</strong></h5>



<p class="wp-block-paragraph">The accuracy of the test set shows that 88% of the online shopper sessions were correctly classified. However, our data is imbalanced. That is to say, most labels have the value &#8220;False,&#8221; and only a few target labels are &#8220;True.&#8221; Consequently, we must ensure that our model does not classify all online shoppers as &#8220;non-buyers&#8221; (label: False) but also correctly predicts the buyers (label: True). </p>



<h5 class="wp-block-heading" id="h-precision"><strong>Precision</strong></h5>



<p class="wp-block-paragraph">We calculate the precision as the number of True Positives divided by the number of True Positives and False Positives. Similar to Accuracy, Precision puts too much emphasis on the True negatives. Therefore, it does not say much about our model. The precision score for our model is just a little lower than the accuracy (83%).</p>



<h5 class="wp-block-heading" id="h-recall"><strong>Recall</strong></h5>



<p class="wp-block-paragraph">We calculate the Recall&nbsp;by dividing the number of True Positives by the sum of the True Positives and the False Negatives. The Recall of our model is 27%, which is significantly below accuracy and precision. In our case, the precision call is more meaningful than precision and Recall because it puts a higher penalty on the low number of True positives.</p>



<h5 class="wp-block-heading" id="h-f1-score"><strong>F1-Score</strong></h5>



<p class="wp-block-paragraph">The formula for the F1-Score is 2*((precision*recall)/(precision+recall)). Because the formula includes the Recall, the F-1 Score of our model is only 41%. Imagine we want to optimize our classification model further. In this case, we should look out for both F1-Score and Recall.</p>



<h4 class="wp-block-heading" id="h-6-3-interpretation">6.3 Interpretation</h4>



<p class="wp-block-paragraph">Metrics for classification models can be misleading. We should thus choose them carefully. Depending on which use case we are dealing with, False-negative and False-positive predictions can have different costs. Therefore, model evaluation is not always about exactness (precision and accuracy). Instead, the choice of performance metrics depends on what we want to achieve.</p>



<p class="wp-block-paragraph">The challenge for our model is to correctly classify the smaller group of buyers (True positives). So, optimizing our model would be about achieving a balance between good accuracy without significantly lowering the F1_Score and Recall.</p>



<h3 class="wp-block-heading" id="h-step-7-insights-on-customer-purchase-intentions">Step #7 Insights on Customer Purchase Intentions</h3>



<p class="wp-block-paragraph">Finally, we will use permutation feature importance to gain additional insights into our prediction model&#8217;s features. Permutation Feature Importance is a technique that measures the influence of features on the predictions of our model. Features with a high positive or negative score substantially impact predicting the prediction label. In contrast, features with scores close to zero play a lesser role in the predictions.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Load the data
r = permutation_importance(model_lgr, X_test, y_test, n_repeats=30, random_state=0)

# Plot the barchart
data_im = pd.DataFrame(r.importances_mean, columns=['feature_permuation_score'])
data_im['feature_names'] = X.columns
data_im = data_im.sort_values('feature_permuation_score', ascending=False)

fig, ax = plt.subplots(figsize=(16, 5))
sns.barplot(y=data_im['feature_names'], x=&quot;feature_permuation_score&quot;, data=data_im, palette='nipy_spectral')
ax.set_title(&quot;Logistic Regression Feature Importances&quot;)</pre></div>



<figure class="wp-block-image size-large"><img decoding="async" width="1024" height="326" data-attachment-id="4684" data-permalink="https://www.relataly.com/predicting-the-purchase-intention-of-online-shoppers/982/image-56-3/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2021/06/image-56.png" data-orig-size="1050,334" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-56" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2021/06/image-56.png" src="https://www.relataly.com/wp-content/uploads/2021/06/image-56-1024x326.png" alt="online purchase intention prediction - results of the feature permutation importance technique" class="wp-image-4684" srcset="https://www.relataly.com/wp-content/uploads/2021/06/image-56.png 1024w, https://www.relataly.com/wp-content/uploads/2021/06/image-56.png 300w, https://www.relataly.com/wp-content/uploads/2021/06/image-56.png 768w, https://www.relataly.com/wp-content/uploads/2021/06/image-56.png 1050w" sizes="(max-width: 1024px) 100vw, 1024px" /></figure>



<p class="wp-block-paragraph">We can see that the three features with the highest impact are PageValues, BounceRates and Administration_Duration. </p>



<ul class="wp-block-list">
<li>The higher the page&#8217;s value, the higher the customer&#8217;s chance to make a purchase. </li>



<li>The higher the average bounce rate that the customer visits, the higher the chance the customer makes a purchase.</li>



<li>In contrast, the more time a customer spends on administrative settings, the lower the chance the customer completes the purchase.</li>
</ul>



<p class="wp-block-paragraph">These were just a few sample findings. There is much more to explore in the data, and deeper analysis can uncover much more about the customers&#8217; buying decisions.</p>



<h2 class="wp-block-heading" id="h-summary">Summary</h2>



<p class="wp-block-paragraph">This article has presented customer purchase prediction as an interesting use case for machine learning in e-commerce. After discussing the use case, we have developed a classification model that predicts the purchase intentions of online shoppers. You have learned to preprocess the data, train a logistic regression model and evaluate the model&#8217;s performance. Classifying purchase intentions can help online shops understand their customers better and automate certain online marketing activities. The previous section showed how marketers could use this to gain further insights into their customers&#8217; behavior.</p>



<p class="wp-block-paragraph">Thanks for reading and if you have any questions, let me know in the comments. </p>



<h2 class="wp-block-heading">Sources and Further Reading</h2>



<p class="wp-block-paragraph">I hope this article was helpful. If you have any remarks or questions, please write them in the comments. </p>



<div style="display: inline-block;">
  <iframe sandbox="allow-popups allow-scripts allow-modals allow-forms allow-same-origin" style="width:120px;height:240px;" marginwidth="0" marginheight="0" scrolling="no" frameborder="0" src="//ws-eu.amazon-adsystem.com/widgets/q?ServiceVersion=20070822&amp;OneJS=1&amp;Operation=GetAdHtml&amp;MarketPlace=DE&amp;source=ss&amp;ref=as_ss_li_til&amp;ad_type=product_link&amp;tracking_id=flo7up-21&amp;language=de_DE&amp;marketplace=amazon&amp;region=DE&amp;placement=3030181162&amp;asins=3030181162&amp;linkId=669e46025028259138fbb5ccec12dfbe&amp;show_border=true&amp;link_opens_in_new_window=true"></iframe>
<iframe sandbox="allow-popups allow-scripts allow-modals allow-forms allow-same-origin" style="width:120px;height:240px;" marginwidth="0" marginheight="0" scrolling="no" frameborder="0" src="//ws-eu.amazon-adsystem.com/widgets/q?ServiceVersion=20070822&amp;OneJS=1&amp;Operation=GetAdHtml&amp;MarketPlace=DE&amp;source=ss&amp;ref=as_ss_li_til&amp;ad_type=product_link&amp;tracking_id=flo7up-21&amp;language=de_DE&amp;marketplace=amazon&amp;region=DE&amp;placement=1999579577&amp;asins=1999579577&amp;linkId=91d862698bf9010ff4c09539e4c49bf4&amp;show_border=true&amp;link_opens_in_new_window=true"></iframe>
<iframe sandbox="allow-popups allow-scripts allow-modals allow-forms allow-same-origin" style="width:120px;height:240px;" marginwidth="0" marginheight="0" scrolling="no" frameborder="0" src="//ws-eu.amazon-adsystem.com/widgets/q?ServiceVersion=20070822&amp;OneJS=1&amp;Operation=GetAdHtml&amp;MarketPlace=DE&amp;source=ss&amp;ref=as_ss_li_til&amp;ad_type=product_link&amp;tracking_id=flo7up-21&amp;language=de_DE&amp;marketplace=amazon&amp;region=DE&amp;placement=1839217715&amp;asins=1839217715&amp;linkId=356ba074068849ff54393f527190825d&amp;show_border=true&amp;link_opens_in_new_window=true"></iframe>
<iframe sandbox="allow-popups allow-scripts allow-modals allow-forms allow-same-origin" style="width:120px;height:240px;" marginwidth="0" marginheight="0" scrolling="no" frameborder="0" src="//ws-eu.amazon-adsystem.com/widgets/q?ServiceVersion=20070822&amp;OneJS=1&amp;Operation=GetAdHtml&amp;MarketPlace=DE&amp;source=ss&amp;ref=as_ss_li_til&amp;ad_type=product_link&amp;tracking_id=flo7up-21&amp;language=de_DE&amp;marketplace=amazon&amp;region=DE&amp;placement=1492032646&amp;asins=1492032646&amp;linkId=2214804dd039e7103577abd08722abac&amp;show_border=true&amp;link_opens_in_new_window=true"></iframe>
</div>



<p class="has-contrast-2-color has-base-3-background-color has-text-color has-background wp-block-paragraph"><em>The links above to Amazon are affiliate links. By buying through these links, you support the Relataly.com blog and help to cover the hosting costs. Using the links does not affect the price.</em></p>
<p>The post <a href="https://www.relataly.com/predicting-the-purchase-intention-of-online-shoppers/982/">Classifying Purchase Intention of Online Shoppers with Python</a> appeared first on <a href="https://www.relataly.com">relataly.com</a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://www.relataly.com/predicting-the-purchase-intention-of-online-shoppers/982/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">982</post-id>	</item>
	</channel>
</rss>
