<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Scikit-Learn Archives - relataly.com</title>
	<atom:link href="https://www.relataly.com/category/tech/scikit-learn-stack/feed/" rel="self" type="application/rss+xml" />
	<link>https://www.relataly.com/category/tech/scikit-learn-stack/</link>
	<description>The Business AI Blog</description>
	<lastBuildDate>Sat, 27 May 2023 10:16:26 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	<generator>https://wordpress.org/?v=6.9.4</generator>

<image>
	<url>https://www.relataly.com/wp-content/uploads/2023/04/cropped-AI-cat-Icon-White.png</url>
	<title>Scikit-Learn Archives - relataly.com</title>
	<link>https://www.relataly.com/category/tech/scikit-learn-stack/</link>
	<width>32</width>
	<height>32</height>
</image> 
<site xmlns="com-wordpress:feed-additions:1">175977316</site>	<item>
		<title>Predictive Maintenance: Predicting Machine Failure using Sensor Data with XGBoost and Python</title>
		<link>https://www.relataly.com/predictive-maintenance-predicting-machine-failure-with-python/10618/</link>
					<comments>https://www.relataly.com/predictive-maintenance-predicting-machine-failure-with-python/10618/#comments</comments>
		
		<dc:creator><![CDATA[Florian Follonier]]></dc:creator>
		<pubDate>Sun, 08 Jan 2023 20:34:44 +0000</pubDate>
				<category><![CDATA[Algorithms]]></category>
		<category><![CDATA[Classification (multi-class)]]></category>
		<category><![CDATA[Cross-Validation]]></category>
		<category><![CDATA[Data Visualization]]></category>
		<category><![CDATA[Exploratory Data Analysis (EDA)]]></category>
		<category><![CDATA[Gradient Boosting]]></category>
		<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[Manufacturing]]></category>
		<category><![CDATA[Plotly]]></category>
		<category><![CDATA[Predictive Maintenance]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[Scikit-Learn]]></category>
		<category><![CDATA[Seaborn]]></category>
		<category><![CDATA[Yahoo Finance API]]></category>
		<category><![CDATA[AI in Manufacturing]]></category>
		<category><![CDATA[Classic Machine Learning]]></category>
		<category><![CDATA[Intermediate Tutorials]]></category>
		<category><![CDATA[Multivariate Models]]></category>
		<guid isPermaLink="false">https://www.relataly.com/?p=10618</guid>

					<description><![CDATA[<p>Predictive maintenance is a game-changer for the modern industry. Still, it is based on a simple idea: By using machine learning algorithms, businesses can predict equipment failures before they happen. This approach can help businesses improve their operations by reducing the need for reactive, unplanned maintenance and by enabling them to schedule maintenance activities during ... <a title="Predictive Maintenance: Predicting Machine Failure using Sensor Data with XGBoost and Python" class="read-more" href="https://www.relataly.com/predictive-maintenance-predicting-machine-failure-with-python/10618/" aria-label="Read more about Predictive Maintenance: Predicting Machine Failure using Sensor Data with XGBoost and Python">Read more</a></p>
<p>The post <a href="https://www.relataly.com/predictive-maintenance-predicting-machine-failure-with-python/10618/">Predictive Maintenance: Predicting Machine Failure using Sensor Data with XGBoost and Python</a> appeared first on <a href="https://www.relataly.com">relataly.com</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow">
<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p>Predictive maintenance is a game-changer for the modern industry. Still, it is based on a simple idea: By using machine learning algorithms, businesses can predict equipment failures before they happen. This approach can help businesses improve their operations by reducing the need for reactive, unplanned maintenance and by enabling them to schedule maintenance activities during planned downtime. In this article, we&#8217;ll explore the use of machine learning algorithms to predict machine failures using the robust XGBoost algorithm in Python. By the end of this tutorial, you&#8217;ll have the knowledge and skills to start implementing predictive maintenance in your organization. So, let&#8217;s get started!</p>



<p>We begin by discussing the concept of predictive maintenance and show different ways to implement it. Then we will turn to the coding part in python and implement the prediction model based on machine sensor data. We train a classification model that predicts different types of machine failure using XGBoost.</p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%">
<figure class="wp-block-image size-full"><img fetchpriority="high" decoding="async" width="509" height="467" data-attachment-id="12909" data-permalink="https://www.relataly.com/robot-factory-machine-learning-predictive-maintenance-min/" data-orig-file="https://www.relataly.com/wp-content/uploads/2023/03/robot-factory-machine-learning-predictive-maintenance-min.png" data-orig-size="509,467" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="robot factory machine learning predictive maintenance-min" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2023/03/robot-factory-machine-learning-predictive-maintenance-min.png" src="https://www.relataly.com/wp-content/uploads/2023/03/robot-factory-machine-learning-predictive-maintenance-min.png" alt="Predictive maintenance is a game-changer for the modern industry. Image generated with Midjourney." class="wp-image-12909" srcset="https://www.relataly.com/wp-content/uploads/2023/03/robot-factory-machine-learning-predictive-maintenance-min.png 509w, https://www.relataly.com/wp-content/uploads/2023/03/robot-factory-machine-learning-predictive-maintenance-min.png 300w" sizes="(max-width: 509px) 100vw, 509px" /><figcaption class="wp-element-caption">Predictive maintenance is a game-changer for the modern industry. Image generated with <a href="http://www.midjourney.com" target="_blank" rel="noreferrer noopener">Midjourney</a>.</figcaption></figure>
</div>
</div>



<h2 class="wp-block-heading">What is Predictive Maintenance?</h2>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p>Predictive maintenance is a data-driven approach that uses predictive modeling to assess the state of equipment and determine the optimal timing for maintenance activities. This technique is particularly beneficial in industries that heavily rely on equipment for their operations, such as manufacturing, transportation, energy, and healthcare. Depending on the requirements and challenges of an organization, predictive maintenance may contribute to one or several of the following goals:</p>



<ul class="wp-block-list">
<li><strong>Improve equipment reliability</strong>: By proactively identifying and addressing potential problems with equipment, predictive maintenance can help improve the reliability of the equipment, reducing the risk of unexpected downtime or failure.</li>



<li><strong>Increase efficiency</strong>: Predictive maintenance can help improve the efficiency of equipment by identifying and fixing problems before they cause equipment failure or downtime. This can help reduce maintenance costs and increase productivity.</li>



<li><strong>Improve safety:</strong> Predictive maintenance can help improve safety by identifying and addressing potential problems with equipment before they occur. This can help prevent accidents and injuries caused by equipment failure.</li>



<li><strong>Reduce maintenance costs</strong>: By proactively identifying and fixing potential problems with equipment, predictive maintenance can help reduce the overall cost of maintenance by minimizing the need for unscheduled downtime.</li>



<li><strong>Improve asset management</strong>: Predictive maintenance can help improve asset management by providing data and insights into the condition and performance of equipment. This can help organizations decide when to replace or upgrade equipment.</li>
</ul>



<p>Next, we look at the different ways organizations can implement predictive maintenance.</p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%">
<figure class="wp-block-image size-full"><img decoding="async" width="511" height="510" data-attachment-id="12380" data-permalink="https://www.relataly.com/predictive-maintenance-predicting-machine-failure-with-python/10618/monitoring-predictive-maintenance-safety-manufacturing-min/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2023/02/monitoring-predictive-maintenance-safety-manufacturing-min.png" data-orig-size="511,510" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="monitoring-predictive-maintenance-safety-manufacturing-min" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2023/02/monitoring-predictive-maintenance-safety-manufacturing-min.png" src="https://www.relataly.com/wp-content/uploads/2023/02/monitoring-predictive-maintenance-safety-manufacturing-min.png" alt="" class="wp-image-12380" srcset="https://www.relataly.com/wp-content/uploads/2023/02/monitoring-predictive-maintenance-safety-manufacturing-min.png 511w, https://www.relataly.com/wp-content/uploads/2023/02/monitoring-predictive-maintenance-safety-manufacturing-min.png 300w, https://www.relataly.com/wp-content/uploads/2023/02/monitoring-predictive-maintenance-safety-manufacturing-min.png 140w" sizes="(max-width: 511px) 100vw, 511px" /><figcaption class="wp-element-caption">Utilities and manufacturing are only two of the many industries that use predictive maintenance. Image generated with <a href="http://www.midjourney.com" target="_blank" rel="noreferrer noopener">Midjourney</a>.</figcaption></figure>



<p></p>
</div>
</div>
</div>
</div>



<h2 class="wp-block-heading">Approaches to Predictive Maintenance</h2>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p>There are several approaches to implementing a predictive maintenance solution, depending on the type of equipment being monitored and the resources available. These approaches include:</p>



<ul class="wp-block-list">
<li><strong>Condition-based monitoring:</strong> This involves continuously monitoring the condition of the equipment using sensors. When certain thresholds or conditions are met, an alert is triggered, or corrective measures are launched. The goal is to reduce the risk of failure. For example, if the temperature of a motor exceeds a certain level, this may indicate that the motor is about to fail.</li>



<li><strong>Predictive modeling:</strong> This approach involves using machine learning algorithms to analyze historical lifetime data about the equipment to identify patterns that may indicate an impending failure. This can be done using data from sensors, as well as operational data and maintenance records. When historical or failure data is not available, a degradation model can be created to estimate failure times based on a threshold value. This approach is often used when there is limited data available.</li>



<li><strong>Prognostic algorithms: </strong>By using data from sensors and other sources, prognostic algorithms can predict the remaining useful life of a piece of equipment. This information can help organizations determine the likelihood of a breakdown and plan for replacements or maintenance activities. By understanding the equipment better, organizations can potentially extend maintenance cycles, which can reduce costs for replacements and maintenance.</li>
</ul>



<p>It is important to choose an approach that is appropriate for the specific equipment and maintenance challenges faced by the organization. </p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%"></div>
</div>



<h2 class="wp-block-heading">Data Requirements</h2>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p>When implementing predictive maintenance, it is important to consider that each approach comes with its own set of data requirements. Types of data include the following:</p>



<ul class="wp-block-list">
<li><strong>Current condition data</strong> includes information about the state of the equipment, such as its temperature, pressure, vibration, and other physical parameters.</li>



<li><strong>Operating data </strong>includes information about how the equipment is being used, such as its load, speed, and other operating parameters.</li>



<li><strong>Maintenance history data</strong> includes information about past maintenance activities that have been performed on the equipment.</li>



<li><strong>Failure history data</strong> includes information about past equipment failures, such as the date of the failure, the cause of the failure, and the impact on operations.</li>
</ul>



<p>Collecting these data requires investing in sensors and other data collection infrastructure and ensuring that data collection is accurate and storage is proper. By combining various data types, organizations can create a comprehensive view of equipment condition and performance and use it to predict maintenance requirements.</p>



<p>The specific types of data needed will depend on the implementation approach. Organizations must ensure they have access to the necessary data to implement the selected approach effectively. Some specific data requirements for each approach include the following:</p>



<figure class="wp-block-table"><table><thead><tr><th>Approach</th><th>Data Requirements</th></tr></thead><tbody><tr><td>Condition-based monitoring</td><td>Sensor data from the equipment being monitored. </td></tr><tr><td>Predictive modeling</td><td>A combination of sensor data, operational data, and maintenance records. </td></tr><tr><td>Prognostic algorithms</td><td>Sensor data, as well as data about past failures and maintenance events. </td></tr></tbody></table><figcaption class="wp-element-caption">Data requirements per implementation approach</figcaption></figure>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%">
<figure class="wp-block-image size-full"><img decoding="async" width="1018" height="856" data-attachment-id="12379" data-permalink="https://www.relataly.com/predictive-maintenance-predicting-machine-failure-with-python/10618/jejimga_a_factory_using_technology_for_safety_efficiency_qualit_5daef8a5-5ab0-49d2-9821-4588049635a2-min/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2023/02/jejimga_a_factory_using_technology_for_safety_efficiency_qualit_5daef8a5-5ab0-49d2-9821-4588049635a2-min.png" data-orig-size="1018,856" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="jejimga_a_factory_using_technology_for_safety_efficiency_qualit_5daef8a5-5ab0-49d2-9821-4588049635a2-min" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2023/02/jejimga_a_factory_using_technology_for_safety_efficiency_qualit_5daef8a5-5ab0-49d2-9821-4588049635a2-min.png" src="https://www.relataly.com/wp-content/uploads/2023/02/jejimga_a_factory_using_technology_for_safety_efficiency_qualit_5daef8a5-5ab0-49d2-9821-4588049635a2-min.png" alt="Predictive maintenance - Machine learning can make maintenance cycles more cost-efficient. Image generated using Midjourney" class="wp-image-12379" srcset="https://www.relataly.com/wp-content/uploads/2023/02/jejimga_a_factory_using_technology_for_safety_efficiency_qualit_5daef8a5-5ab0-49d2-9821-4588049635a2-min.png 1018w, https://www.relataly.com/wp-content/uploads/2023/02/jejimga_a_factory_using_technology_for_safety_efficiency_qualit_5daef8a5-5ab0-49d2-9821-4588049635a2-min.png 300w, https://www.relataly.com/wp-content/uploads/2023/02/jejimga_a_factory_using_technology_for_safety_efficiency_qualit_5daef8a5-5ab0-49d2-9821-4588049635a2-min.png 768w" sizes="(max-width: 1018px) 100vw, 1018px" /><figcaption class="wp-element-caption">Predictive maintenance &#8211; Machine learning can make maintenance cycles more cost-efficient. Image generated using&nbsp;<a href="http://www.Midjourney.com" target="_blank" rel="noreferrer noopener">Midjourney</a></figcaption></figure>
</div>
</div>



<h2 class="wp-block-heading">Predicting Failures in Milling Machines using XGBoost in Python</h2>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p>Now that we have a basic understanding of predictive maintenance, it&#8217;s time to get hands-on with Python. We will use sensor data and machine learning to predict failures in milling machines. But why do these machines break down in the first place? Milling machines have many moving parts that can suffer from wear and tear over time, leading to failures. Additionally, improper maintenance can cause issues with machine operation and lead to costly damage. Efficient maintenance can be challenging due to the varying loads that milling machines are subjected to. However, by implementing a predictive maintenance solution with Python, we can proactively identify and address issues to prevent costly downtime and ensure the smooth operation of our milling machines. Our goal is to predict one of five failure types, which corresponds to a predictive modeling approach. Let&#8217;s get started on building our predictive maintenance solution.</p>



<p>The code is available on the GitHub repository.</p>



<div class="wp-block-kadence-advancedbtn kb-buttons-wrap kb-btns_c6038a-a9"><a class="kb-button kt-button button kb-btn_614436-7b kt-btn-size-standard kt-btn-width-type-full kb-btn-global-inherit  kt-btn-has-text-true kt-btn-has-svg-true  wp-block-button__link wp-block-kadence-singlebtn" href="https://github.com/flo7up/relataly-public-python-tutorials/blob/master/02%20Classification/022%20Predicting%20Machine%20Malfunction%20of%20Milling%20Machines%20in%20Python.ipynb" target="_blank" rel="noreferrer noopener"><span class="kb-svg-icon-wrap kb-svg-icon-fe_eye kt-btn-icon-side-left"><svg viewBox="0 0 24 24"  fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"  aria-hidden="true"><path d="M1 12s4-8 11-8 11 8 11 8-4 8-11 8-11-8-11-8z"/><circle cx="12" cy="12" r="3"/></svg></span><span class="kt-btn-inner-text">View on GitHub </span></a>

<a class="kb-button kt-button button kb-btn_a119d2-89 kt-btn-size-standard kt-btn-width-type-full kb-btn-global-inherit  kt-btn-has-text-true kt-btn-has-svg-true  wp-block-button__link wp-block-kadence-singlebtn" href="https://github.com/flo7up/relataly-public-python-API-tutorials" target="_blank" rel="noreferrer noopener"><span class="kb-svg-icon-wrap kb-svg-icon-fa_github kt-btn-icon-side-left"><svg viewBox="0 0 496 512"  fill="currentColor" xmlns="http://www.w3.org/2000/svg"  aria-hidden="true"><path d="M165.9 397.4c0 2-2.3 3.6-5.2 3.6-3.3.3-5.6-1.3-5.6-3.6 0-2 2.3-3.6 5.2-3.6 3-.3 5.6 1.3 5.6 3.6zm-31.1-4.5c-.7 2 1.3 4.3 4.3 4.9 2.6 1 5.6 0 6.2-2s-1.3-4.3-4.3-5.2c-2.6-.7-5.5.3-6.2 2.3zm44.2-1.7c-2.9.7-4.9 2.6-4.6 4.9.3 2 2.9 3.3 5.9 2.6 2.9-.7 4.9-2.6 4.6-4.6-.3-1.9-3-3.2-5.9-2.9zM244.8 8C106.1 8 0 113.3 0 252c0 110.9 69.8 205.8 169.5 239.2 12.8 2.3 17.3-5.6 17.3-12.1 0-6.2-.3-40.4-.3-61.4 0 0-70 15-84.7-29.8 0 0-11.4-29.1-27.8-36.6 0 0-22.9-15.7 1.6-15.4 0 0 24.9 2 38.6 25.8 21.9 38.6 58.6 27.5 72.9 20.9 2.3-16 8.8-27.1 16-33.7-55.9-6.2-112.3-14.3-112.3-110.5 0-27.5 7.6-41.3 23.6-58.9-2.6-6.5-11.1-33.3 2.6-67.9 20.9-6.5 69 27 69 27 20-5.6 41.5-8.5 62.8-8.5s42.8 2.9 62.8 8.5c0 0 48.1-33.6 69-27 13.7 34.7 5.2 61.4 2.6 67.9 16 17.7 25.8 31.5 25.8 58.9 0 96.5-58.9 104.2-114.8 110.5 9.2 7.9 17 22.9 17 46.4 0 33.7-.3 75.4-.3 83.6 0 6.5 4.6 14.4 17.3 12.1C428.2 457.8 496 362.9 496 252 496 113.3 383.5 8 244.8 8zM97.2 352.9c-1.3 1-1 3.3.7 5.2 1.6 1.6 3.9 2.3 5.2 1 1.3-1 1-3.3-.7-5.2-1.6-1.6-3.9-2.3-5.2-1zm-10.8-8.1c-.7 1.3.3 2.9 2.3 3.9 1.6 1 3.6.7 4.3-.7.7-1.3-.3-2.9-2.3-3.9-2-.6-3.6-.3-4.3.7zm32.4 35.6c-1.6 1.3-1 4.3 1.3 6.2 2.3 2.3 5.2 2.6 6.5 1 1.3-1.3.7-4.3-1.3-6.2-2.2-2.3-5.2-2.6-6.5-1zm-11.4-14.7c-1.6 1-1.6 3.6 0 5.9 1.6 2.3 4.3 3.3 5.6 2.3 1.6-1.3 1.6-3.9 0-6.2-1.4-2.3-4-3.3-5.6-2z"/></svg></span><span class="kt-btn-inner-text">Relataly GitHub Repo </span></a></div>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%">
<figure class="wp-block-image size-full is-resized"><img decoding="async" data-attachment-id="12384" data-permalink="https://www.relataly.com/predictive-maintenance-predicting-machine-failure-with-python/10618/cnc_milling_machine_cyberpunk/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2023/02/cnc_milling_machine_cyberpunk.png" data-orig-size="253,253" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="cnc_milling_machine_cyberpunk" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2023/02/cnc_milling_machine_cyberpunk.png" src="https://www.relataly.com/wp-content/uploads/2023/02/cnc_milling_machine_cyberpunk.png" alt="Image of a CNC milling machine. Image created with Midjourney" class="wp-image-12384" width="375" height="375" srcset="https://www.relataly.com/wp-content/uploads/2023/02/cnc_milling_machine_cyberpunk.png 253w, https://www.relataly.com/wp-content/uploads/2023/02/cnc_milling_machine_cyberpunk.png 140w" sizes="(max-width: 375px) 100vw, 375px" /><figcaption class="wp-element-caption">Image of a CNC milling machine. Image created with <a href="http://www.midjourney.com" target="_blank" rel="noreferrer noopener">Midjourney</a></figcaption></figure>
</div>
</div>



<h3 class="wp-block-heading">Prerequisites</h3>



<p>Before starting the coding part, make sure that you have set up your <a href="https://www.python.org/downloads/" target="_blank" rel="noreferrer noopener">Python 3</a> environment and required packages. </p>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow">
<p><strong>Python Environment</strong></p>



<p>Before diving into the FairLearn Python tutorial, it is important to take the necessary steps to ensure that your Python environment is properly set up and that you have all the required packages installed. This will ensure a seamless learning experience and prevent any potential roadblocks or issues that may arise due to an improperly configured environment.</p>



<p>If you don&#8217;t have an environment, follow&nbsp;<a href="https://www.relataly.com/anaconda-python-environment-machine-learning/1663/" target="_blank" rel="noreferrer noopener">this tutorial</a>&nbsp;to set up the&nbsp;<a href="https://www.anaconda.com/products/individual" target="_blank" rel="noreferrer noopener">Anaconda environment</a>.</p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow">
<p><strong>Python Packages</strong></p>



<p>Make sure you install all required packages. In this tutorial, we will be working with the following packages:&nbsp;</p>



<ul class="wp-block-list">
<li>Pandas</li>



<li>NumPy</li>



<li>Matplotlib</li>



<li>Seaborn</li>



<li>Plotly</li>
</ul>



<p>In addition, we will be using the machine learning library <strong><em>Scikit-learn</em></strong> and the XGBoost library, which is a popular library for training gradient-boosting models.</p>



<p>You can install packages using console commands:</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">pip install &lt;package name&gt;
conda install &lt;package name&gt; (if you are using the anaconda packet manager)</pre></div>
</div>
</div>



<h3 class="wp-block-heading">About the Sensor Dataset</h3>



<p>In this tutorial, we will work with a synthetic sensor dataset from the <a href="https://archive.ics.uci.edu/ml/datasets/AI4I+2020+Predictive+Maintenance+Dataset" target="_blank" rel="noreferrer noopener">UCL ML archives</a> that simulates the typical life cycle of a milling machine. The dataset contains the following fields:</p>



<p>The dataset consists of 10 000 data points stored as rows with 14 features in columns:</p>



<ul class="wp-block-list">
<li>UID: unique identifier ranging from 1 to 10000</li>



<li>productID: consisting of a letter L, M, or H for low (50% of all products), medium (30%), and high (20%) as product quality variants and a variant-specific serial number</li>



<li>air temperature [K]</li>



<li>process temperature [K]</li>



<li>rotational speed [rpm]</li>



<li>torque [Nm]</li>



<li>tool wear [min]</li>



<li>machine failure. A label that indicates whether the machine has failed or not</li>



<li>Failure type (prediction label). The label contains five failure types: tool wear failure (TWF), heat dissipation failure (HDF), power failure (PWF), overstrain failure (OSF), random failures (RNF)</li>
</ul>



<p>Source: <a href="https://archive.ics.uci.edu/ml/datasets/AI4I+2020+Predictive+Maintenance+Dataset" target="_blank" rel="noreferrer noopener">UCL ML Repository</a></p>



<p>You can download the dataset from <a href="https://www.kaggle.com/code/potongpasir/predicting-machine-malfunction/data" target="_blank" rel="noreferrer noopener">Kaggle.com</a>. Unzip the file predictive_maintenance.csv and save it under the following file path: &#8220;/data/iot/classification/&#8221;</p>



<h3 class="wp-block-heading">Step #1 Load the Data</h3>



<p>We begin by importing the required libraries. This also includes the XGBoost library, which is a popular library for training gradient-boosting models. In addition, we will load the dataset using the pandas library. Then we define our target variable as Failure Type. The dataset contains a second target column, which only contains the binary information of machine failures. We will drop this column, as our goal is to predict the specific type of failure. Then we print the first three rows of the loaded dataset. </p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># A tutorial for this file is available at www.relataly.com
# Tested with Python 3.9.13, Matplotlib 3.6.2, Scikit-learn 1.2, Seaborn 0.12.1, numpy 1.21.5, xgboost 1.7.2

import pandas as pd 
import matplotlib.pyplot as plt 
import numpy as np
import seaborn as sns
import plotly.express as px
sns.set_style('white', { 'axes.spines.right': False, 'axes.spines.top': False})
from sklearn.metrics import classification_report, confusion_matrix, precision_recall_fscore_support as score, roc_curve
from sklearn.model_selection import cross_val_score, train_test_split, cross_validate
from sklearn.utils import compute_sample_weight
from xgboost import XGBClassifier

# load the train data
path = '/data/iot/classification/'
df = pd.read_csv(path + &quot;predictive_maintenance.csv&quot;) 

# define the target
target_name='Failure Type'

# drop a redundant columns
df.drop(columns=['Target'], inplace=True)

# print a summary of the train data
print(df.shape[0])
df.head(3)</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">	UDI	Product ID	Type	Air temperature [K]	Process temperature [K]	Rotational speed [rpm]	Torque [Nm]	Tool wear [min]	Failure Type
0	1	M14860		M		298.1				308.6				1551						42.8		0				No Failure
1	2	L47181		L		298.2				308.7				1408						46.3		3				No Failure
2	3	L47182		L		298.1				308.5				1498						49.4		5				No Failure</pre></div>



<h3 class="wp-block-heading">Step #2 Clean the Data</h3>



<p>Next, we quickly check the data quality of our dataset. The following code block checks if there are any missing values in our dataset. If there are missing values, it creates a barplot showing the number of missing values for each column, along with the percentage of missing values. If there are no missing values, it prints a message saying &#8220;no missing values.&#8221;</p>



<p>The function then drops any columns with more than 5% missing values from the DataFrame. Finally, it prints the names of the remaining columns in the DataFrame. This function can be used to identify and handle missing values in a dataset before applying machine learning algorithms to it.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># check for missing values
def print_missing_values(df):
    null_df = pd.DataFrame(df.isna().sum(), columns=['null_values']).sort_values(['null_values'], ascending=False)
    fig = plt.subplots(figsize=(16, 6))
    ax = sns.barplot(data=null_df, x='null_values', y=null_df.index, color='royalblue')
    pct_values = [' {:g}'.format(elm) + ' ({:.1%})'.format(elm/len(df)) for elm in list(null_df['null_values'])]
    ax.set_title('Overview of missing values')
    ax.bar_label(container=ax.containers[0], labels=pct_values, size=12)

if df.isna().sum().sum() &gt; 0:
    print_missing_values(df)
else:
    print('no missing values')

# drop all columns with more than 5% missing values
for col_name in df.columns:
    if df[col_name].isna().sum()/df.shape[0] &gt; 0.05:
        df.drop(columns=[col_name], inplace=True) 

df.columns</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">no missing values
Index(['UDI', 'Product ID', 'Type', 'Air temperature [K]',
       'Process temperature [K]', 'Rotational speed [rpm]', 'Torque [Nm]',
       'Tool wear [min]', 'Failure Type'],
      dtype='object')</pre></div>



<p>Next, we will drop two unnecessary columns and rename the remaining ones to make them easier to work with. The original column names are quite long and contain special characters that could cause errors during the training process. Once the columns are renamed, we will print the updated DataFrame to verify the changes.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># drop id columns
df_base = df.drop(columns=['Product ID', 'UDI'])

# adjust column names
df_base.rename(columns={'Air temperature [K]': 'air_temperature', 
                        'Process temperature [K]': 'process_temperature', 
                        'Rotational speed [rpm]':'rotational_speed', 
                        'Torque [Nm]': 'torque', 
                        'Tool wear [min]': 'tool_wear'}, inplace=True)
df_base.head()</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">	Type	air_temperature	process_temperature	rotational_speed	torque	tool_wear	Failure Type
0	M		298.1			308.6				1551				42.8	0			No Failure
1	L		298.2			308.7				1408				46.3	3			No Failure
2	L		298.1			308.5				1498				49.4	5			No Failure
3	L		298.2			308.6				1433				39.5	7			No Failure
4	L		298.2			308.7				1408				40.0	9			No Failure</pre></div>



<p>Everything looks as expected: Our dataset contains six features and the target column with the five failure types.</p>



<h3 class="wp-block-heading" id="h-step-3-explore-the-data">Step #3 Explore the Data</h3>



<p>Next, let&#8217;s explore the dataset. </p>



<h4 class="wp-block-heading">Target Class Distribution</h4>



<p>The following code uses the plotly express library to create a histogram showing the class distribution of the &#8220;Failure Type&#8221; column in a DataFrame called &#8220;df_base.&#8221; The histogram will have one bar for each unique value in the &#8220;Failure Type&#8221; column, and the height of each bar will represent the number of occurrences of that value in the column. This can be useful for understanding the imbalance in the distribution of classes in a classification problem.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># display class distribution of the target variable
px.histogram(df_base, y=&quot;Failure Type&quot;, color=&quot;Failure Type&quot;) </pre></div>



<figure class="wp-block-image size-large is-resized"><img decoding="async" data-attachment-id="11828" data-permalink="https://www.relataly.com/predictive-maintenance-predicting-machine-failure-with-python/10618/newplot/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2023/01/newplot.png" data-orig-size="2042,450" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="newplot" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2023/01/newplot.png" src="https://www.relataly.com/wp-content/uploads/2023/01/newplot-1024x226.png" alt="Target class distribution in our predictive maintenance dataset" class="wp-image-11828" width="1115" height="246" srcset="https://www.relataly.com/wp-content/uploads/2023/01/newplot.png 1024w, https://www.relataly.com/wp-content/uploads/2023/01/newplot.png 300w, https://www.relataly.com/wp-content/uploads/2023/01/newplot.png 768w, https://www.relataly.com/wp-content/uploads/2023/01/newplot.png 1536w, https://www.relataly.com/wp-content/uploads/2023/01/newplot.png 2042w" sizes="(max-width: 1115px) 100vw, 1115px" /></figure>



<p>Our dataset is highly imbalanced, with the vast majority of cases having a &#8220;No Failure&#8221; label. If the dataset is highly imbalanced, with a disproportionate number of cases in one class compared to the others, it can impact the performance of machine learning models. This is because imbalanced datasets can lead to models that are biased towards the majority class, and may not perform well on the minority class. In order to improve model performance on imbalanced datasets, we will later adjust the model hyperparameters accordingly. </p>



<h4 class="wp-block-heading">Feature Pairplots</h4>



<p>Next, let&#8217;s construct pair plots to explore feature relations with the target variable. Pair plots, also known as scatter plots, are a type of plot that shows the relationship between two variables. In the context of a predictive maintenance dataset, pair plots can be useful for exploring the relationships between different features and the target variable (e.g., the likelihood of a machine failure). By creating pair plots and visualizing the relationships between different features and the target variable, you can gain insights into which features might be most useful for building a predictive model.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># pairplots on failure type
sns.pairplot(df_base, height=2.5, hue='Failure Type')</pre></div>



<figure class="wp-block-image size-large is-resized"><img decoding="async" data-attachment-id="11829" data-permalink="https://www.relataly.com/predictive-maintenance-predicting-machine-failure-with-python/10618/image-3-2/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2023/01/image-3.png" data-orig-size="1476,1226" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-3" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2023/01/image-3.png" src="https://www.relataly.com/wp-content/uploads/2023/01/image-3-1024x851.png" alt="feature plot for our predictive maintenance dataset" class="wp-image-11829" width="874" height="726" srcset="https://www.relataly.com/wp-content/uploads/2023/01/image-3.png 1024w, https://www.relataly.com/wp-content/uploads/2023/01/image-3.png 300w, https://www.relataly.com/wp-content/uploads/2023/01/image-3.png 768w, https://www.relataly.com/wp-content/uploads/2023/01/image-3.png 1476w" sizes="(max-width: 874px) 100vw, 874px" /></figure>



<p>The pair plots reveal valuable patterns in our features that can inform the predictions of our model. For instance, we see that Power Failures tend to be correlated with torque values that are either close to the maximum or minimum. Such patterns should allow our predictive model to make solid predictions. </p>



<h4 class="wp-block-heading">Feature Correlation</h4>



<p>Next, we will look at feature correlation. The following code block creates a heatmap using the seaborn library that shows the correlation between all pairs of columns in a DataFrame called &#8220;df_base&#8221;. The heatmap is plotted using a color scale, with warmer colors indicating stronger correlations and cooler colors indicating weaker correlations. The correlation values are also displayed in the cells of the heatmap, with values ranging from -1 (perfect negative correlation) to 1 (perfect positive correlation). By creating a heatmap, you can quickly see which variables are positively or negatively correlated with each other, and to what degree. This can be helpful for identifying which features might be most useful for building a predictive model.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># correlation plot
plt.figure(figsize=(6,4))
sns.heatmap(df_base.corr(), cbar=True, fmt='.1f', vmax=0.8, annot=True, cmap='Blues')</pre></div>



<figure class="wp-block-image size-full is-resized"><img decoding="async" data-attachment-id="11830" data-permalink="https://www.relataly.com/predictive-maintenance-predicting-machine-failure-with-python/10618/image-4-2/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2023/01/image-4.png" data-orig-size="649,506" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-4" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2023/01/image-4.png" src="https://www.relataly.com/wp-content/uploads/2023/01/image-4.png" alt="Feature correlation for our predictive maintenance dataset" class="wp-image-11830" width="689" height="537" srcset="https://www.relataly.com/wp-content/uploads/2023/01/image-4.png 649w, https://www.relataly.com/wp-content/uploads/2023/01/image-4.png 300w" sizes="(max-width: 689px) 100vw, 689px" /></figure>



<p>From the table, it looks like there is a strong positive correlation between &#8220;air_temperature&#8221; and &#8220;process_temperature&#8221; (0.87). This makes sense since a high process temperature will naturally also heat up the air around the machine. In addition, there is a strong negative correlation between &#8220;rotational_speed&#8221; and &#8220;torque&#8221; (-0.87). The other correlations are weaker and closer to 0, indicating weaker relationships.</p>



<p>Understanding the correlations between different variables in a dataset can be helpful for building predictive models, as it can give you an idea of which features might be most important for predicting a given target. It can also help you identify any redundant features that might not add much value to your model. Since our dataset only contains six features, we will keep all of them. </p>



<h4 class="wp-block-heading">Feature Boxplots</h4>



<p>Box plots are a useful visualization tool for understanding the distribution of values in a dataset. They show the minimum, first quartile, median, third quartile, and maximum values for each group, as well as any outliers. By creating box plots separated by a categorical variable, you can compare the distributions of values between different groups and see if there are any significant differences. This can be useful for identifying trends or patterns in the data that might be useful for building a predictive model.</p>



<p>If there are significant differences between the boxplots for different categories, it could be a good sign for building a predictive model. For example, if the boxplots for one category tend to have higher values for a particular feature than the boxplots for another category, it could indicate that the feature is related to the target variable and could be useful for making predictions.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># create histograms for feature columns separated by target column
def create_histogram(column_name):
    plt.figure(figsize=(16,6))
    return px.box(data_frame=df_base, y=column_name, color='Failure Type', points=&quot;all&quot;, width=1200)

create_histogram('air_temperature')</pre></div>



<figure class="wp-block-image size-large is-resized"><img decoding="async" data-attachment-id="11831" data-permalink="https://www.relataly.com/predictive-maintenance-predicting-machine-failure-with-python/10618/newplot-1/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2023/01/newplot-1.png" data-orig-size="1200,450" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="newplot-1" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2023/01/newplot-1.png" src="https://www.relataly.com/wp-content/uploads/2023/01/newplot-1-1024x384.png" alt="feature boxplot for different failure types in predictive maintenance dataset. feature: air temperature" class="wp-image-11831" width="1078" height="405" srcset="https://www.relataly.com/wp-content/uploads/2023/01/newplot-1.png 1024w, https://www.relataly.com/wp-content/uploads/2023/01/newplot-1.png 300w, https://www.relataly.com/wp-content/uploads/2023/01/newplot-1.png 768w, https://www.relataly.com/wp-content/uploads/2023/01/newplot-1.png 1200w" sizes="(max-width: 1078px) 100vw, 1078px" /></figure>



<p>Feature boxplot for process_temperature.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}">create_histogram('process_temperature')</pre></div>



<figure class="wp-block-image size-large is-resized"><img decoding="async" data-attachment-id="11832" data-permalink="https://www.relataly.com/predictive-maintenance-predicting-machine-failure-with-python/10618/newplot-2/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2023/01/newplot-2.png" data-orig-size="1200,450" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="newplot-2" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2023/01/newplot-2.png" src="https://www.relataly.com/wp-content/uploads/2023/01/newplot-2-1024x384.png" alt="feature boxplot for different failure types in predictive maintenance dataset. feature: air temperature" class="wp-image-11832" width="1087" height="408" srcset="https://www.relataly.com/wp-content/uploads/2023/01/newplot-2.png 1024w, https://www.relataly.com/wp-content/uploads/2023/01/newplot-2.png 300w, https://www.relataly.com/wp-content/uploads/2023/01/newplot-2.png 768w, https://www.relataly.com/wp-content/uploads/2023/01/newplot-2.png 1200w" sizes="(max-width: 1087px) 100vw, 1087px" /><figcaption class="wp-element-caption">Feature boxplot for rotational speed.</figcaption></figure>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}">create_histogram('rotational_speed')</pre></div>



<figure class="wp-block-image size-large is-resized"><img decoding="async" data-attachment-id="11833" data-permalink="https://www.relataly.com/predictive-maintenance-predicting-machine-failure-with-python/10618/newplot-3/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2023/01/newplot-3.png" data-orig-size="1200,450" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="newplot-3" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2023/01/newplot-3.png" src="https://www.relataly.com/wp-content/uploads/2023/01/newplot-3-1024x384.png" alt="feature boxplot for different failure types in predictive maintenance dataset. feature: rotational speed" class="wp-image-11833" width="1110" height="417" srcset="https://www.relataly.com/wp-content/uploads/2023/01/newplot-3.png 1024w, https://www.relataly.com/wp-content/uploads/2023/01/newplot-3.png 300w, https://www.relataly.com/wp-content/uploads/2023/01/newplot-3.png 768w, https://www.relataly.com/wp-content/uploads/2023/01/newplot-3.png 1200w" sizes="(max-width: 1110px) 100vw, 1110px" /></figure>



<p>Feature boxplot for torque.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}">create_histogram('torque')</pre></div>



<figure class="wp-block-image size-large is-resized"><img decoding="async" data-attachment-id="11834" data-permalink="https://www.relataly.com/predictive-maintenance-predicting-machine-failure-with-python/10618/newplot-4/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2023/01/newplot-4.png" data-orig-size="1200,450" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="newplot-4" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2023/01/newplot-4.png" src="https://www.relataly.com/wp-content/uploads/2023/01/newplot-4-1024x384.png" alt="feature boxplot for different failure types in predictive maintenance dataset. feature: torque" class="wp-image-11834" width="1082" height="406" srcset="https://www.relataly.com/wp-content/uploads/2023/01/newplot-4.png 1024w, https://www.relataly.com/wp-content/uploads/2023/01/newplot-4.png 300w, https://www.relataly.com/wp-content/uploads/2023/01/newplot-4.png 768w, https://www.relataly.com/wp-content/uploads/2023/01/newplot-4.png 1200w" sizes="(max-width: 1082px) 100vw, 1082px" /></figure>



<p>Feature boxplot for tool wear.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}">create_histogram('tool_wear')</pre></div>



<figure class="wp-block-image size-large is-resized"><img decoding="async" data-attachment-id="11835" data-permalink="https://www.relataly.com/predictive-maintenance-predicting-machine-failure-with-python/10618/newplot-5/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2023/01/newplot-5.png" data-orig-size="1200,450" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="newplot-5" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2023/01/newplot-5.png" src="https://www.relataly.com/wp-content/uploads/2023/01/newplot-5-1024x384.png" alt="feature boxplot for different failure types in predictive maintenance dataset. feature: tool wear" class="wp-image-11835" width="1097" height="411" srcset="https://www.relataly.com/wp-content/uploads/2023/01/newplot-5.png 1024w, https://www.relataly.com/wp-content/uploads/2023/01/newplot-5.png 300w, https://www.relataly.com/wp-content/uploads/2023/01/newplot-5.png 768w, https://www.relataly.com/wp-content/uploads/2023/01/newplot-5.png 1200w" sizes="(max-width: 1097px) 100vw, 1097px" /></figure>



<p>Now that we have a good understanding of our dataset, we can prepare the data for model training. </p>



<h3 class="wp-block-heading" id="h-step-4-data-preparation">Step #4 Data Preparation</h3>



<p>To prepare the data for model training, we will need to split our dataset and make additional modifications. </p>



<p>The following code block contains a reusable function called data_preparation. The purpose of this function is to prepare the data in a way that is suitable for building and evaluating machine learning models. It performs several preprocessing steps, such as encoding categorical variables and splitting the data into training and test sets. </p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}">def data_preparation(df_base, target_name):
    df = df_base.dropna()

    df['target_name_encoded'] = df[target_name].replace({'No Failure': 0, 'Power Failure': 1, 'Tool Wear Failure': 2, 'Overstrain Failure': 3, 'Random Failures': 4, 'Heat Dissipation Failure': 5})
    df['Type'].replace({'L': 0, 'M': 1, 'H': 2}, inplace=True)
    X = df.drop(columns=[target_name, 'target_name_encoded'])
    y = df['target_name_encoded'] #Prediction label

    # split the data into x_train and y_train data sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, random_state=0)

    # print the shapes: the result is: (rows, training_sequence, features) (prediction value, )
    print('train: ', X_train.shape, y_train.shape)
    print('test: ', X_test.shape, y_test.shape)
    return X, y, X_train, X_test, y_train, y_test

# remove target from training data
X, y, X_train, X_test, y_train, y_test = data_preparation(df_base, target_name)</pre></div>



<h3 class="wp-block-heading" id="h-step-5-model-training">Step #5 Model Training</h3>



<p>Now that we have prepared the dataset, we can train the XGBoost classification model. The basic idea behind XGBoost is to train a series of weak models, such as decision trees, and then combine their predictions using gradient boosting. During training, XGBoost uses an optimization algorithm to adjust the weight of each model in the ensemble in order to improve the overall prediction accuracy. XGBoost also includes a number of additional features and techniques that help to improve the performance of the model, such as regularization, feature selection, and handling missing values.</p>



<p>XGboost provides several configuration options that we can use to finetune performance and adjust the training process to our dataset. For a complete list of hyperparameters, please see the <a href="https://xgboost.readthedocs.io/en/stable/python/index.html" target="_blank" rel="noreferrer noopener">library documentation</a>.</p>



<p>Remember that our class labels are imbalanced. Therefore, we will provide the model with sample weights. The following code creates a weight array for the training and test sets using the &#8220;compute_sample_weight&#8221; function from scikit-learn. We calculate the weight array based on the &#8220;balanced&#8221; mode. This means that the weights are calculated such that the class distribution in the sample is balanced. This can be useful when working with imbalanced datasets, as it helps to mitigate the effects of class imbalance on the model.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}">weight_train = compute_sample_weight('balanced', y_train)
weight_test = compute_sample_weight('balanced', y_test)

xgb_clf = XGBClassifier(booster='gbtree', 
                        tree_method='gpu_hist', 
                        sampling_method='gradient_based', 
                        eval_metric='aucpr', 
                        objective='multi:softmax', 
                        num_class=6)
# fit the model to the data
xgb_clf.fit(X_train, y_train.ravel(), sample_weight=weight_train)</pre></div>



<figure class="wp-block-image size-full"><img decoding="async" width="842" height="270" data-attachment-id="11836" data-permalink="https://www.relataly.com/predictive-maintenance-predicting-machine-failure-with-python/10618/image-5-3/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2023/01/image-5.png" data-orig-size="842,270" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-5" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2023/01/image-5.png" src="https://www.relataly.com/wp-content/uploads/2023/01/image-5.png" alt="summary of our XGBoost classifier of our predictive maintenance solution" class="wp-image-11836" srcset="https://www.relataly.com/wp-content/uploads/2023/01/image-5.png 842w, https://www.relataly.com/wp-content/uploads/2023/01/image-5.png 300w, https://www.relataly.com/wp-content/uploads/2023/01/image-5.png 768w" sizes="(max-width: 842px) 100vw, 842px" /></figure>



<p>We can see that the blue box summarizes the configuration of our model and indicates that the training process has been successful. Now that we have the classifier, we can use it to make predictions on new data.</p>



<h3 class="wp-block-heading" id="h-step-6-model-evaluation">Step #6 Model Evaluation</h3>



<p>Finally, we will evaluate the model&#8217;s performance. This will involve three steps:</p>



<ul class="wp-block-list">
<li>Model scoring</li>



<li>Cross-validation</li>



<li>Confusion matrix</li>
</ul>



<h4 class="wp-block-heading">Model Scoring</h4>



<p>First, we calculate the accuracy of the classifier on the test set using the &#8220;score&#8221; method. To account for the imbalance of class labels, we pass in the weight array for the test set as an additional parameter. This returns the fraction of correct predictions made by the classifier. Next, the code uses the classifier to make predictions on the test set using the &#8220;predict&#8221; method. It then generates a classification report using the &#8220;classification_report&#8221; function from scikit-learn. The report displays a summary of the model&#8217;s performance in terms of various evaluation metrics such as precision, recall, and f1-score.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># score the model with the test dataset
score = xgb_clf.score(X_test, y_test.ravel(), sample_weight=weight_test)

# predict on the test dataset
y_pred = xgb_clf.predict(X_test)

# print a classification report
results_log = classification_report(y_test, y_pred)
print(results_log)</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">precision    recall  f1-score   support

           0       0.99      0.98      0.99      2903
           1       0.64      0.88      0.74        24
           2       0.04      0.08      0.06        12
           3       0.77      0.89      0.83        27
           4       0.00      0.00      0.00         4
           5       0.76      0.97      0.85        30

    accuracy                           0.98      3000
   macro avg       0.53      0.63      0.58      3000
weighted avg       0.98      0.98      0.98      3000</pre></div>



<p>The classification report shows the performance of our XGBoost classifier on the test dataset. The model appears to perform well, with a high accuracy of 0.98 and a high weighted average f1-score of 0.98. </p>



<p>However, there are a few classes where the model&#8217;s performance is not as strong. Class 1 has a relatively low precision of 0.64 and a low f1-score of 0.74, while class 2 has a very low precision of 0.04 and a low f1-score of 0.06. Class 4 has a precision and f1-score of 0.00, which suggests that the model is not making any correct predictions for this class.</p>



<p>It is also worth noting that the support for some classes is much lower than for others. Class 1 has a support of 24, while class 0 has a support of 2903. This is due to the fact that there are relatively few instances of class 1 in the test dataset compared to class 0, which affects the model&#8217;s performance on class 1.</p>



<h4 class="wp-block-heading">Confusion Matrix</h4>



<p>Next, we create a confusion matrix. We input the true labels of the test set (y_test) and the predicted labels produced by the model (y_pred) to generate the matrix. The matrix shows us the number of correct and incorrect predictions made by the model for each class.</p>



<p>We then create a DataFrame from the confusion matrix and use the seaborn library to visualize the matrix as a heatmap. The heatmap allows us to easily see which classes are being predicted correctly and which are being misclassified. </p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># create predictions on the test dataset
y_pred = xgb_clf.predict(X_test)

# print a multi-Class Confusion Matrix
cnf_matrix = confusion_matrix(y_test, y_pred)
df_cm = pd.DataFrame(cnf_matrix, columns=np.unique(y_test), index=np.unique(y_test))
df_cm.index.name = 'Actual'
df_cm.columns.name = 'Predicted'
plt.figure(figsize = (8, 5))
sns.set(font_scale=1.1) #for label size
sns.heatmap(df_cm, cbar=True, cmap= &quot;inferno&quot;, annot=True, fmt='.0f') </pre></div>



<figure class="wp-block-image size-full is-resized"><img decoding="async" data-attachment-id="11837" data-permalink="https://www.relataly.com/predictive-maintenance-predicting-machine-failure-with-python/10618/image-6-2/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2023/01/image-6.png" data-orig-size="668,456" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-6" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2023/01/image-6.png" src="https://www.relataly.com/wp-content/uploads/2023/01/image-6.png" alt="Evaluating the performance of our predictive maintenance solution using a confusion matrix" class="wp-image-11837" width="663" height="452" srcset="https://www.relataly.com/wp-content/uploads/2023/01/image-6.png 668w, https://www.relataly.com/wp-content/uploads/2023/01/image-6.png 300w" sizes="(max-width: 663px) 100vw, 663px" /></figure>



<p>The color scale of the heatmap indicates the magnitude of the values in the matrix. In this case, the darker the color, the higher the number of predictions. This visualization helps us to understand the performance of the model and identify areas for improvement. </p>



<p>Here are a few things that we can learn from this matrix:</p>



<ul class="wp-block-list">
<li>The model made a total of 2902 correct predictions and 67 incorrect predictions.</li>



<li>For the &#8220;No Failure&#8221; class, the model made 2854 correct predictions and 29 incorrect predictions. The majority of the incorrect predictions were false negatives.</li>



<li>For the &#8220;Power Failure&#8221; class, the model made 21 correct predictions and three incorrect predictions. </li>



<li>For the &#8220;Tool Wear Failure&#8221; class, the model made 1 correct prediction and 1 incorrect prediction. </li>



<li>For the &#8220;Overstrain Failure&#8221; class, the model made 24 correct predictions and 2 incorrect predictions. </li>



<li>For the &#8220;Random Failures&#8221; class, the model made 29 correct predictions and 4 incorrect predictions. </li>



<li>For the &#8220;Heat Dissipation Failure&#8221; class, the model made 29 correct predictions and 1 incorrect prediction. </li>
</ul>



<p>Overall, the model seems to be performing relatively well, but it is making a lot of false negatives for some classes. </p>



<h4 class="wp-block-heading">Cross Validation</h4>



<p>Finally, we perform cross-validation on the training set using the &#8220;cross_validate&#8221; function from scikit-learn. Cross-validation is a technique for evaluating the performance of a machine learning model by training it on different subsets of the data and evaluating it on the remaining data. </p>



<p>In this case, we will train and evaluate our model 10 times using different splits of the data (specified by the &#8220;cv&#8221; parameter). We also specify that the evaluation metric should be the weighted f1-score (specified by the &#8220;scoring&#8221; parameter). We then pass the weight array for the training set to the classifier.</p>



<p>The &#8220;cross_validate&#8221; function returns a dictionary containing various evaluation metrics for each fold of the cross-validation. We will convert the dictionary to a DataFrame and create a bar plot using the plotly express library to visualize the results. This helps us to understand the consistency and stability of the model&#8217;s performance.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># cross validation
scores  = cross_validate(xgb_clf, X_train, y_train, cv=10, scoring=&quot;f1_weighted&quot;, fit_params={ &quot;sample_weight&quot; :weight_train})
scores_df = pd.DataFrame(scores)
px.bar(x=scores_df.index, y=scores_df.test_score, width=800)</pre></div>



<figure class="wp-block-image size-full is-resized"><img decoding="async" data-attachment-id="11838" data-permalink="https://www.relataly.com/predictive-maintenance-predicting-machine-failure-with-python/10618/newplot-6/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2023/01/newplot-6.png" data-orig-size="800,450" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="newplot-6" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2023/01/newplot-6.png" src="https://www.relataly.com/wp-content/uploads/2023/01/newplot-6.png" alt="Evaluation the performance of our predictive maintenance solution. cross validation scores for the XGBoost model. " class="wp-image-11838" width="644" height="362" srcset="https://www.relataly.com/wp-content/uploads/2023/01/newplot-6.png 800w, https://www.relataly.com/wp-content/uploads/2023/01/newplot-6.png 300w, https://www.relataly.com/wp-content/uploads/2023/01/newplot-6.png 768w" sizes="(max-width: 644px) 100vw, 644px" /></figure>



<p>The model performance remains consistent across all folds. </p>



<h2 class="wp-block-heading">Summary</h2>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p>In this article, we have presented the concept of predictive maintenance and demonstrated how organizations can use this approach to improve their maintenance cycles. The second part of the article provided a hands-on tutorial showing how to implement a predictive maintenance solution for predicting different failure types of a milling machine. We trained a classification model using the XGBoost algorithm and sensor data from the machine. </p>



<p>While the model demonstrated good performance overall, we observed that it was not able to predict all classes with the same level of accuracy. This suggests that there may be opportunities to improve the model&#8217;s performance. One potential approach is to balance the dataset by up or down-sampling the data to achieve a more even distribution of classes. By doing so, we can mitigate the effects of class imbalance and potentially improve the model&#8217;s predictions for all classes.</p>



<p>By implementing such a predictive maintenance approach, organizations can improve their operational efficiency and ensure the smooth running of their machinery.</p>



<p>I hope this article was helpful. If you have any questions or feedback, let me know in the comments. </p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%">
<figure class="wp-block-image size-full"><img decoding="async" width="497" height="493" data-attachment-id="12901" data-permalink="https://www.relataly.com/smart-factory-iot-sensors-relataly-midjourney-min/" data-orig-file="https://www.relataly.com/wp-content/uploads/2023/03/smart-factory-iot-sensors-relataly-midjourney-min.png" data-orig-size="497,493" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="smart factory iot sensors relataly midjourney-min" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2023/03/smart-factory-iot-sensors-relataly-midjourney-min.png" src="https://www.relataly.com/wp-content/uploads/2023/03/smart-factory-iot-sensors-relataly-midjourney-min.png" alt="" class="wp-image-12901" srcset="https://www.relataly.com/wp-content/uploads/2023/03/smart-factory-iot-sensors-relataly-midjourney-min.png 497w, https://www.relataly.com/wp-content/uploads/2023/03/smart-factory-iot-sensors-relataly-midjourney-min.png 300w, https://www.relataly.com/wp-content/uploads/2023/03/smart-factory-iot-sensors-relataly-midjourney-min.png 140w" sizes="(max-width: 497px) 100vw, 497px" /><figcaption class="wp-element-caption">Predictive maintenance also plays an essential role in a smart factory. Image created with Midjourney. </figcaption></figure>
</div>
</div>



<p></p>



<h2 class="wp-block-heading">Sources and Further Reading</h2>



<p>There are many books available on the topics of IoT and predictive maintenance. Here are a few recommendations:</p>



<ul class="wp-block-list">
<li><a href="https://amzn.to/3XgrX7L" target="_blank" rel="noreferrer noopener">An Introduction to Predictive Maintenance</a> by R Keith Mobley</li>



<li><a href="https://amzn.to/3CzYL3A" target="_blank" rel="noreferrer noopener">Predictive Analytics: The Secret to Predicting Future Events Using Big Data and Data Science Techniques Such as Data Mining, Predictive Modelling, Statistics, Data Analysis, and Machine</a> by Richard Hurley</li>



<li>Stephan Matzka, <a href="https://ieeexplore.ieee.org/document/9253083" target="_blank" rel="noreferrer noopener">Explainable Artificial Intelligence for Predictive Maintenance Applications</a>, Third International Conference on Artificial Intelligence for Industries (AI4I 2020)</li>



<li><a href="https://amzn.to/3TrBdDY" target="_blank" rel="noreferrer noopener">David Forsyth (2019) Applied Machine Learning Springer</a></li>



<li>ChatGPT was used to revise certain parts of this article</li>



<li>Images created using Midjourney and OpenAI Dall-E</li>
</ul>



<p class="has-contrast-2-color has-base-3-background-color has-text-color has-background"><em>The links above to Amazon are affiliate links. By buying through these links, you support the Relataly.com blog and help to cover the hosting costs. Using the links does not affect the price.</em></p>
<p>The post <a href="https://www.relataly.com/predictive-maintenance-predicting-machine-failure-with-python/10618/">Predictive Maintenance: Predicting Machine Failure using Sensor Data with XGBoost and Python</a> appeared first on <a href="https://www.relataly.com">relataly.com</a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://www.relataly.com/predictive-maintenance-predicting-machine-failure-with-python/10618/feed/</wfw:commentRss>
			<slash:comments>3</slash:comments>
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">10618</post-id>	</item>
		<item>
		<title>How to Use Hierarchical Clustering For Customer Segmentation in Python</title>
		<link>https://www.relataly.com/customer-segmentation-using-hierarchical-clustering-in-python/11335/</link>
					<comments>https://www.relataly.com/customer-segmentation-using-hierarchical-clustering-in-python/11335/#respond</comments>
		
		<dc:creator><![CDATA[Florian Follonier]]></dc:creator>
		<pubDate>Thu, 22 Dec 2022 18:50:14 +0000</pubDate>
				<category><![CDATA[Agglomerative Clustering]]></category>
		<category><![CDATA[Algorithms]]></category>
		<category><![CDATA[Clustering]]></category>
		<category><![CDATA[Customer Segmentation]]></category>
		<category><![CDATA[Data Science]]></category>
		<category><![CDATA[Data Visualization]]></category>
		<category><![CDATA[Exploratory Data Analysis (EDA)]]></category>
		<category><![CDATA[Finance]]></category>
		<category><![CDATA[Insurance]]></category>
		<category><![CDATA[Kaggle Competitions]]></category>
		<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[Marketing Automation]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[Scikit-Learn]]></category>
		<category><![CDATA[Seaborn]]></category>
		<category><![CDATA[Telecommunications]]></category>
		<category><![CDATA[Use Cases]]></category>
		<category><![CDATA[AI in Finance]]></category>
		<category><![CDATA[AI in Insurance]]></category>
		<category><![CDATA[Beginner Tutorials]]></category>
		<category><![CDATA[Classic Machine Learning]]></category>
		<category><![CDATA[Digital Transformation]]></category>
		<guid isPermaLink="false">https://www.relataly.com/?p=11335</guid>

					<description><![CDATA[<p>Have you ever found yourself wondering how you can better understand your customer base and target your marketing efforts more effectively? One solution is to use hierarchical clustering, a method of grouping customers into clusters based on their characteristics and behaviors. By dividing your customers into distinct groups, you can tailor your marketing campaigns and ... <a title="How to Use Hierarchical Clustering For Customer Segmentation in Python" class="read-more" href="https://www.relataly.com/customer-segmentation-using-hierarchical-clustering-in-python/11335/" aria-label="Read more about How to Use Hierarchical Clustering For Customer Segmentation in Python">Read more</a></p>
<p>The post <a href="https://www.relataly.com/customer-segmentation-using-hierarchical-clustering-in-python/11335/">How to Use Hierarchical Clustering For Customer Segmentation in Python</a> appeared first on <a href="https://www.relataly.com">relataly.com</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p>Have you ever found yourself wondering how you can better understand your customer base and target your marketing efforts more effectively? One solution is to use hierarchical clustering, a method of grouping customers into clusters based on their characteristics and behaviors. By dividing your customers into distinct groups, you can tailor your marketing campaigns and personalize your marketing efforts to meet the specific needs of each group. This can be especially useful for businesses with large customer bases, as it allows them to target their marketing efforts to specific segments rather than trying to appeal to everyone at once. Additionally, hierarchical clustering can help businesses identify common patterns and trends among their customers, which can be useful for targeting future marketing efforts and improving the overall customer experience. In this tutorial, we will use Python and the scikit-learn library to apply hierarchical (agglomerative) clustering to a dataset of customer data. </p>



<p>The rest of this tutorial proceeds in two parts. The first part will discuss hierarchical clustering and how we can use it to identify clusters in a set of customer data. The second part is a hands-on Python tutorial. We will explore customer health insurance data and apply an agglomerative clustering approach to group the customers into meaningful segments. Finally, we will use a tree-like diagram called a dendrogram, which is helpful for visualizing the structure of the data. The resulting segments could inform our marketing strategies and help us better understand our customers. So let&#8217;s get started!</p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%">
<figure class="wp-block-image size-full"><img decoding="async" width="896" height="510" data-attachment-id="12402" data-permalink="https://www.relataly.com/customer-segmentation-using-hierarchical-clustering-in-python/11335/isometric-view-of-people-customer-segmentation-using-machine-learning-python-tutorial-min/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2023/02/isometric-view-of-people-customer-segmentation-using-machine-learning-python-tutorial-min.png" data-orig-size="896,510" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="isometric-view-of-people-customer-segmentation-using-machine-learning-python-tutorial-min" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2023/02/isometric-view-of-people-customer-segmentation-using-machine-learning-python-tutorial-min.png" src="https://www.relataly.com/wp-content/uploads/2023/02/isometric-view-of-people-customer-segmentation-using-machine-learning-python-tutorial-min.png" alt="isometric view of people customer segmentation using machine learning python tutorial" class="wp-image-12402" srcset="https://www.relataly.com/wp-content/uploads/2023/02/isometric-view-of-people-customer-segmentation-using-machine-learning-python-tutorial-min.png 896w, https://www.relataly.com/wp-content/uploads/2023/02/isometric-view-of-people-customer-segmentation-using-machine-learning-python-tutorial-min.png 300w, https://www.relataly.com/wp-content/uploads/2023/02/isometric-view-of-people-customer-segmentation-using-machine-learning-python-tutorial-min.png 768w" sizes="(max-width: 896px) 100vw, 896px" /><figcaption class="wp-element-caption">Customer segmentation is a typical use case for clustering. Image generated with <a href="http://www.midjourney.com" target="_blank" rel="noreferrer noopener">Midjourney</a>. </figcaption></figure>
</div>
</div>



<h2 class="wp-block-heading">What is Hierarchical Clustering?</h2>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p>So what is hierarchical clustering? Hierarchical clustering is a method of cluster analysis that aims to build a hierarchy of clusters. It creates a tree-like diagram called a dendrogram, which shows the relationships between clusters. There are two main types of hierarchical clustering: agglomerative and divisive. </p>



<ol class="wp-block-list">
<li>Agglomerative hierarchical clustering: This is a bottom-up approach in which each data point is treated as a single cluster at the outset. The algorithm iteratively merges the most similar pairs of clusters until all data points are in a single cluster.</li>



<li>Divisive hierarchical clustering: This is a top-down approach in which all data points are treated as a single cluster at the outset. The algorithm iteratively splits the cluster into smaller and smaller subclusters until each data point is in its own cluster.</li>
</ol>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%"></div>
</div>



<h3 class="wp-block-heading">Agglomerative Clustering</h3>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p>In this article, we will apply the agglomerative clustering approach, which is a bottom-up approach to clustering. The idea is to initially treat each data point in a dataset as its own cluster and then combine the points with other clusters as the algorithm progresses. The process of agglomerative clustering can be broken down into the following steps:</p>



<ol class="wp-block-list">
<li>Start with each data point in its own cluster.</li>



<li>Calculate the similarity between all pairs of clusters.</li>



<li>Merge the two most similar clusters.</li>



<li>Repeat steps 2 and 3 until all the data points are in a single cluster or until a predetermined number of clusters is reached.</li>
</ol>



<p>There are several ways to calculate the similarity between clusters, including using measures such as the Euclidean distance, cosine similarity, or the Jaccard index. The specific measure used can impact the results of the clustering algorithm.</p>



<p>For details on how the clustering approach works, see the&nbsp;<a href="https://en.wikipedia.org/wiki/Hierarchical_clustering">Wikipedia page</a>.</p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%">
<figure class="wp-block-image size-large"><img decoding="async" width="430" height="512" data-attachment-id="13027" data-permalink="https://www.relataly.com/mushrooms_and_fruits_pattern-min-2/" data-orig-file="https://www.relataly.com/wp-content/uploads/2023/03/mushrooms_and_fruits_pattern-min.png" data-orig-size="506,602" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="mushrooms_and_fruits_pattern-min" data-image-description="&lt;p&gt;Hierarchical clustering is an unsupversied way to classify things. &lt;/p&gt;
" data-image-caption="&lt;p&gt;Hierarchical clustering is an unsupversied way to classify things. &lt;/p&gt;
" data-large-file="https://www.relataly.com/wp-content/uploads/2023/03/mushrooms_and_fruits_pattern-min.png" src="https://www.relataly.com/wp-content/uploads/2023/03/mushrooms_and_fruits_pattern-min-430x512.png" alt="Hierarchical clustering is an unsupversied way to classify things. " class="wp-image-13027" srcset="https://www.relataly.com/wp-content/uploads/2023/03/mushrooms_and_fruits_pattern-min.png 430w, https://www.relataly.com/wp-content/uploads/2023/03/mushrooms_and_fruits_pattern-min.png 252w, https://www.relataly.com/wp-content/uploads/2023/03/mushrooms_and_fruits_pattern-min.png 506w" sizes="(max-width: 430px) 100vw, 430px" /><figcaption class="wp-element-caption">Hierarchical clustering is an unsupervised technique to classify things based on patterns in their data. Image created with <a href="http://www.midjourney.com" target="_blank" rel="noreferrer noopener">Midjourney</a>.</figcaption></figure>
</div>
</div>



<h3 class="wp-block-heading">Hierarchical Clustering vs. K-means</h3>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p>In a previous article, we have already discussed the popular <a href="https://www.relataly.com/simple-cluster-analysis-with-k-means-with-python/5070/" target="_blank" rel="noreferrer noopener">clustering approach k-means</a>. So how are k-means and hierarchical clustering different? Hierarchical clustering and k-means are both clustering algorithms that can be used to group similar data points together. However, there are several key differences between these two approaches:</p>



<ol class="wp-block-list">
<li><strong>The number of clusters:</strong> In k-means, the number of clusters must be specified in advance, whereas in hierarchical clustering, the number of clusters is not specified. Instead, hierarchical clustering creates a hierarchy of clusters, starting with each data point as its own cluster and then merging the most similar clusters until all data points are in a single cluster.</li>



<li><strong>Cluster shape:</strong> K-means produces clusters that are spherical, while hierarchical clustering produces clusters that can have any shape. This means that k-means is better suited for data that is well-separated into distinct, spherical clusters, while hierarchical clustering is more flexible and can handle more complex cluster shapes.</li>



<li><strong>Distance measure:</strong> K-means uses a distance measure, such as the Euclidean distance, to calculate the similarity between data points, while hierarchical clustering can use a variety of distance measures. This means that k-means is more sensitive to the scale of the features, while hierarchical clustering is less sensitive to the feature scale.</li>



<li><strong>Computational complexity:</strong> K-means is generally faster than hierarchical clustering, especially for large datasets. This is because k-means only requires a single pass through the data to assign data points to clusters, while hierarchical clustering requires multiple passes to merge clusters.</li>



<li><strong>Visualization: </strong>Hierarchical clustering produces a tree-like diagram called a &#8220;dendrogram.&#8221; The dendrogram shows the relationships between clusters. This can be useful for visualizing the structure of the data and understanding how clusters are related.</li>
</ol>



<p>Next, let&#8217;s look at how we can implement a hierarchical clustering model in Python. </p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%">
<p></p>
</div>
</div>



<h2 class="wp-block-heading">Customer Segmentation using Hierarchical Clustering in Python</h2>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p>In this comprehensive guide, we explore the application of hierarchical clustering for effective customer segmentation using a customer dataset. This data-driven segmentation method enables businesses to identify distinct customer clusters based on various factors, including demographics, behaviors, and preferences.</p>



<p>Customer segmentation is a strategic approach that splits a customer base into smaller, more manageable groups with similar characteristics. It aims to better understand the diverse needs and wants of different customer segments to enhance marketing strategies and product development.</p>



<p>Applying customer segmentation through hierarchical clustering allows businesses to personalize their marketing messages, design targeted campaigns, and tailor products to meet the unique needs of each segment. This proactive approach can stimulate increased customer loyalty and sales.</p>



<p>We begin by loading the customer data and selecting the relevant features we want to use for clustering. We then standardize the data using the StandardScaler from scikit-learn. Next, we apply hierarchical clustering using the AgglomerativeClustering method, specifying the number of clusters we want to create. Finally, we add the predictions to the original data as a new column and view the resulting segments by calculating the mean of each feature for each segment.</p>



<p>The code is available on the GitHub repository.</p>



<div class="wp-block-kadence-advancedbtn kb-buttons-wrap kb-btns_bada6f-73"><a class="kb-button kt-button button kb-btn_43f94b-af kt-btn-size-standard kt-btn-width-type-full kb-btn-global-inherit  kt-btn-has-text-true kt-btn-has-svg-true  wp-block-button__link wp-block-kadence-singlebtn" href="https://github.com/flo7up/relataly-public-python-tutorials/blob/master/03%20Clustering/043%20Customer%20Segmentation%20using%20Hierarchical%20Clustering%20with%20Python.ipynb" target="_blank" rel="noreferrer noopener"><span class="kb-svg-icon-wrap kb-svg-icon-fe_eye kt-btn-icon-side-left"><svg viewBox="0 0 24 24"  fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"  aria-hidden="true"><path d="M1 12s4-8 11-8 11 8 11 8-4 8-11 8-11-8-11-8z"/><circle cx="12" cy="12" r="3"/></svg></span><span class="kt-btn-inner-text">View on GitHub </span></a>

<a class="kb-button kt-button button kb-btn_17702b-41 kt-btn-size-standard kt-btn-width-type-full kb-btn-global-inherit  kt-btn-has-text-true kt-btn-has-svg-true  wp-block-button__link wp-block-kadence-singlebtn" href="https://github.com/flo7up/relataly-public-python-API-tutorials" target="_blank" rel="noreferrer noopener"><span class="kb-svg-icon-wrap kb-svg-icon-fa_github kt-btn-icon-side-left"><svg viewBox="0 0 496 512"  fill="currentColor" xmlns="http://www.w3.org/2000/svg"  aria-hidden="true"><path d="M165.9 397.4c0 2-2.3 3.6-5.2 3.6-3.3.3-5.6-1.3-5.6-3.6 0-2 2.3-3.6 5.2-3.6 3-.3 5.6 1.3 5.6 3.6zm-31.1-4.5c-.7 2 1.3 4.3 4.3 4.9 2.6 1 5.6 0 6.2-2s-1.3-4.3-4.3-5.2c-2.6-.7-5.5.3-6.2 2.3zm44.2-1.7c-2.9.7-4.9 2.6-4.6 4.9.3 2 2.9 3.3 5.9 2.6 2.9-.7 4.9-2.6 4.6-4.6-.3-1.9-3-3.2-5.9-2.9zM244.8 8C106.1 8 0 113.3 0 252c0 110.9 69.8 205.8 169.5 239.2 12.8 2.3 17.3-5.6 17.3-12.1 0-6.2-.3-40.4-.3-61.4 0 0-70 15-84.7-29.8 0 0-11.4-29.1-27.8-36.6 0 0-22.9-15.7 1.6-15.4 0 0 24.9 2 38.6 25.8 21.9 38.6 58.6 27.5 72.9 20.9 2.3-16 8.8-27.1 16-33.7-55.9-6.2-112.3-14.3-112.3-110.5 0-27.5 7.6-41.3 23.6-58.9-2.6-6.5-11.1-33.3 2.6-67.9 20.9-6.5 69 27 69 27 20-5.6 41.5-8.5 62.8-8.5s42.8 2.9 62.8 8.5c0 0 48.1-33.6 69-27 13.7 34.7 5.2 61.4 2.6 67.9 16 17.7 25.8 31.5 25.8 58.9 0 96.5-58.9 104.2-114.8 110.5 9.2 7.9 17 22.9 17 46.4 0 33.7-.3 75.4-.3 83.6 0 6.5 4.6 14.4 17.3 12.1C428.2 457.8 496 362.9 496 252 496 113.3 383.5 8 244.8 8zM97.2 352.9c-1.3 1-1 3.3.7 5.2 1.6 1.6 3.9 2.3 5.2 1 1.3-1 1-3.3-.7-5.2-1.6-1.6-3.9-2.3-5.2-1zm-10.8-8.1c-.7 1.3.3 2.9 2.3 3.9 1.6 1 3.6.7 4.3-.7.7-1.3-.3-2.9-2.3-3.9-2-.6-3.6-.3-4.3.7zm32.4 35.6c-1.6 1.3-1 4.3 1.3 6.2 2.3 2.3 5.2 2.6 6.5 1 1.3-1.3.7-4.3-1.3-6.2-2.2-2.3-5.2-2.6-6.5-1zm-11.4-14.7c-1.6 1-1.6 3.6 0 5.9 1.6 2.3 4.3 3.3 5.6 2.3 1.6-1.3 1.6-3.9 0-6.2-1.4-2.3-4-3.3-5.6-2z"/></svg></span><span class="kt-btn-inner-text">Relataly GitHub Repo </span></a></div>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%">
<figure class="wp-block-image size-full"><img decoding="async" width="512" height="513" data-attachment-id="12366" data-permalink="https://www.relataly.com/customer-segmentation-using-hierarchical-clustering-in-python/11335/the_future_of_the_healthcare_using_blockchain-min/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/12/the_future_of_the_healthcare_using_blockchain-min.png" data-orig-size="512,513" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="the_future_of_the_healthcare_using_blockchain-min" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/12/the_future_of_the_healthcare_using_blockchain-min.png" src="https://www.relataly.com/wp-content/uploads/2022/12/the_future_of_the_healthcare_using_blockchain-min.png" alt="In this machine learning tutorial, we will run a hierarchical clustering algorithm on health data." class="wp-image-12366" srcset="https://www.relataly.com/wp-content/uploads/2022/12/the_future_of_the_healthcare_using_blockchain-min.png 512w, https://www.relataly.com/wp-content/uploads/2022/12/the_future_of_the_healthcare_using_blockchain-min.png 300w, https://www.relataly.com/wp-content/uploads/2022/12/the_future_of_the_healthcare_using_blockchain-min.png 140w" sizes="(max-width: 512px) 100vw, 512px" /><figcaption class="wp-element-caption">The future of healthcare will see a tight collaboration between humans and AI. Image generated using&nbsp;Midjourney</figcaption></figure>
</div>
</div>



<h3 class="wp-block-heading">About the Customer Health Insurance Dataset</h3>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p>In this tutorial, we will work with a public dataset on health_insurance_customer_data from kaggle.com. Download the <a href="https://www.kaggle.com/datasets/teertha/ushealthinsurancedataset" target="_blank" rel="noreferrer noopener">CSV file from Kaggle</a> and copy it into the following path, starting from the folder with your python notebook: data/customer/</p>



<p>The dataset is relatively simple and contains 1338 rows of insured customers. It includes the insurance charges, as well as demographic and personal information such as Age, Sex, BMI, Number of Children, Smoker, and Region. The dataset does not have any undefined or missing values.</p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%"></div>
</div>



<h3 class="wp-block-heading" id="h-prerequisites">Prerequisites</h3>



<p>Before we start the coding part, ensure that you have set up your Python 3 environment and the required packages. If you don’t have an environment, follow&nbsp;this tutorial&nbsp;to set up the&nbsp;<a href="https://www.anaconda.com/products/individual" target="_blank" rel="noreferrer noopener">Anaconda environment</a>. Also, make sure you install all required packages. In this tutorial, we will be working with the following standard packages:&nbsp;</p>



<ul class="wp-block-list">
<li>pandas</li>



<li>NumPy</li>



<li>matplotlib</li>



<li>scikit-learn</li>
</ul>



<p>You can install packages using console commands:</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">pip install &lt;package name&gt; 
conda install &lt;package name&gt; (if you are using the anaconda packet manager)</pre></div>



<h3 class="wp-block-heading">Step #1 Load the Data</h3>



<p>To begin, we need to load the required packages and the data we want to cluster. We will load the data by reading the CSV file via the pandas library. </p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># import necessary libraries
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import AgglomerativeClustering
from sklearn.preprocessing import LabelEncoder
from pandas.api.types import is_string_dtype
import pandas as pd
import math
import seaborn as sns

# load customer data
customer_df = pd.read_csv(&quot;data/customer/customer_health_insurance.csv&quot;)
customer_df.head(3)</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">	age	sex		bmi		children	smoker	region		charges
0	19	female	27.90	0			yes		southwest	16884.9240
1	18	male	33.77	1			no		southeast	1725.5523
2	28	male	33.00	3			no		southeast	4449.4620</pre></div>



<h3 class="wp-block-heading">Step #2 Explore the Data</h3>



<p>Next, it is a good idea to explore the data and get a sense of its structure and content. This can be done using a variety of methods, such as examining the shape of the dataframe, checking for missing values, and plotting some basic statistics. For example, the following plots will explore the relationships between some of the variables. We won&#8217;t go into too much detail here.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}">def make_kdeplot(df, column_name, target_name):
    fig, ax = plt.subplots(figsize=(10, 6))
    sns.kdeplot(data=df, hue=column_name, x=target_name, ax = ax, linewidth=2,)
    ax.tick_params(axis=&quot;x&quot;, rotation=90, labelsize=10, length=0)
    ax.set_title(column_name)
    ax.set_xlim(0, df[target_name].quantile(0.99))
    plt.show()

# make kde plot for ext_color 
make_kdeplot(customer_df, 'smoker', 'charges')</pre></div>



<figure class="wp-block-image size-full is-resized"><img decoding="async" data-attachment-id="11363" data-permalink="https://www.relataly.com/customer-segmentation-using-hierarchical-clustering-in-python/11335/image-17-3/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/12/image-17.png" data-orig-size="833,571" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-17" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/12/image-17.png" src="https://www.relataly.com/wp-content/uploads/2022/12/image-17.png" alt="" class="wp-image-11363" width="567" height="389" srcset="https://www.relataly.com/wp-content/uploads/2022/12/image-17.png 833w, https://www.relataly.com/wp-content/uploads/2022/12/image-17.png 300w, https://www.relataly.com/wp-content/uploads/2022/12/image-17.png 768w" sizes="(max-width: 567px) 100vw, 567px" /></figure>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># make kde plot for ext_color 
make_kdeplot(customer_df, 'sex', 'charges')</pre></div>



<figure class="wp-block-image size-full is-resized"><img decoding="async" data-attachment-id="11364" data-permalink="https://www.relataly.com/customer-segmentation-using-hierarchical-clustering-in-python/11335/image-44-2/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/12/image-44.png" data-orig-size="846,571" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-44" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/12/image-44.png" src="https://www.relataly.com/wp-content/uploads/2022/12/image-44.png" alt="" class="wp-image-11364" width="572" height="386" srcset="https://www.relataly.com/wp-content/uploads/2022/12/image-44.png 846w, https://www.relataly.com/wp-content/uploads/2022/12/image-44.png 300w, https://www.relataly.com/wp-content/uploads/2022/12/image-44.png 768w" sizes="(max-width: 572px) 100vw, 572px" /></figure>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}">sns.lmplot(x=&quot;charges&quot;, y=&quot;age&quot;, hue=&quot;smoker&quot;, data=customer_df, aspect=2)
plt.show()</pre></div>



<figure class="wp-block-image size-large is-resized"><img decoding="async" data-attachment-id="11365" data-permalink="https://www.relataly.com/customer-segmentation-using-hierarchical-clustering-in-python/11335/image-45/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/12/image-45.png" data-orig-size="1067,489" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-45" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/12/image-45.png" src="https://www.relataly.com/wp-content/uploads/2022/12/image-45-1024x469.png" alt="" class="wp-image-11365" width="700" height="321" srcset="https://www.relataly.com/wp-content/uploads/2022/12/image-45.png 1024w, https://www.relataly.com/wp-content/uploads/2022/12/image-45.png 300w, https://www.relataly.com/wp-content/uploads/2022/12/image-45.png 768w, https://www.relataly.com/wp-content/uploads/2022/12/image-45.png 1067w" sizes="(max-width: 700px) 100vw, 700px" /></figure>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}">def make_boxplot(customer_df, x,y,h):
    fig, ax = plt.subplots(figsize=(10,4))
    box = sns.boxplot(x=x, y=y, hue=h, data=customer_df)
    box.set_xticklabels(box.get_xticklabels())
    fig.subplots_adjust(bottom=0.2)
    plt.tight_layout()

make_boxplot(customer_df, &quot;smoker&quot;, &quot;charges&quot;, &quot;sex&quot;)</pre></div>



<figure class="wp-block-image size-full is-resized"><img decoding="async" data-attachment-id="11366" data-permalink="https://www.relataly.com/customer-segmentation-using-hierarchical-clustering-in-python/11335/image-46/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/12/image-46.png" data-orig-size="989,390" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-46" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/12/image-46.png" src="https://www.relataly.com/wp-content/uploads/2022/12/image-46.png" alt="" class="wp-image-11366" width="675" height="266" srcset="https://www.relataly.com/wp-content/uploads/2022/12/image-46.png 989w, https://www.relataly.com/wp-content/uploads/2022/12/image-46.png 300w, https://www.relataly.com/wp-content/uploads/2022/12/image-46.png 768w" sizes="(max-width: 675px) 100vw, 675px" /></figure>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}">make_boxplot(customer_df, &quot;region&quot;, &quot;charges&quot;, &quot;sex&quot;)</pre></div>



<figure class="wp-block-image size-full is-resized"><img decoding="async" data-attachment-id="11367" data-permalink="https://www.relataly.com/customer-segmentation-using-hierarchical-clustering-in-python/11335/image-47-4/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/12/image-47.png" data-orig-size="989,390" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-47" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/12/image-47.png" src="https://www.relataly.com/wp-content/uploads/2022/12/image-47.png" alt="" class="wp-image-11367" width="693" height="273" srcset="https://www.relataly.com/wp-content/uploads/2022/12/image-47.png 989w, https://www.relataly.com/wp-content/uploads/2022/12/image-47.png 300w, https://www.relataly.com/wp-content/uploads/2022/12/image-47.png 768w" sizes="(max-width: 693px) 100vw, 693px" /></figure>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}">make_boxplot(customer_df, &quot;children&quot;, &quot;bmi&quot;, &quot;sex&quot;)</pre></div>



<figure class="wp-block-image size-full is-resized"><img decoding="async" data-attachment-id="11368" data-permalink="https://www.relataly.com/customer-segmentation-using-hierarchical-clustering-in-python/11335/image-48-4/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/12/image-48.png" data-orig-size="989,390" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-48" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/12/image-48.png" src="https://www.relataly.com/wp-content/uploads/2022/12/image-48.png" alt="" class="wp-image-11368" width="705" height="278" srcset="https://www.relataly.com/wp-content/uploads/2022/12/image-48.png 989w, https://www.relataly.com/wp-content/uploads/2022/12/image-48.png 300w, https://www.relataly.com/wp-content/uploads/2022/12/image-48.png 768w" sizes="(max-width: 705px) 100vw, 705px" /></figure>



<p>Next, let&#8217;s prepare the data for model training. </p>



<h3 class="wp-block-heading" id="h-step-3-prepare-the-data">Step #3 Prepare the Data</h3>



<p>Before we can train a model on the data, we must prepare it for modeling. This typically involves selecting the relevant features, handling missing values, and scaling the data. However, we are using a very simple dataset that already has good data quality. Therefore we can limit our data preparation activities to encoding the labels and scaling the data. </p>



<p>To encode the categorical values, we will use label encoder from the scikit-learn library.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># encode categorical features
label_encoder = LabelEncoder()

for col_name in customer_df.columns:
    if (is_string_dtype(customer_df[col_name])):
        customer_df[col_name] = label_encoder.fit_transform(customer_df[col_name])
customer_df.head(3)</pre></div>



<p>Next, we will scale the numeric variables. While scaling the data is an essential preprocessing step for many machine learning algorithms to work effectively, it is generally not necessary for hierarchical clustering. This is because hierarchical clustering is not sensitive to the scale of the features. However, when you use certain distance measures, such as Euclidean distance, scaling the data might still be useful when performing hierarchical clustering. Scaling the data can help to ensure that all of the features are given equal weight. This can be useful if you want to avoid giving more weight to features with larger scales.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># select features
X = customer_df # we will select all features

# standardize the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
X_scaled.head(3)</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">array([[-1.43876426, -1.0105187 , -0.45332   , ...,  1.34390459,
         0.2985838 ,  1.97058663],
       [-1.50996545,  0.98959079,  0.5096211 , ...,  0.43849455,
        -0.95368917, -0.5074631 ],
       [-0.79795355,  0.98959079,  0.38330685, ...,  0.43849455,
        -0.72867467, -0.5074631 ],
       ...,
       [-1.50996545, -1.0105187 ,  1.0148781 , ...,  0.43849455,
        -0.96159623, -0.5074631 ],
       [-1.29636188, -1.0105187 , -0.79781341, ...,  1.34390459,
        -0.93036151, -0.5074631 ],
       [ 1.55168573, -1.0105187 , -0.26138796, ..., -0.46691549,
         1.31105347,  1.97058663]])</pre></div>



<h3 class="wp-block-heading">Step #4 Train the Hierarchical Clustering Algorithm</h3>



<p>To train a hierarchical clustering model using scikit-learn, we can use the AgglomerativeClustering or Ward class. The main parameters for these classes are:</p>



<ul class="wp-block-list">
<li><strong>n_clusters: </strong>The number of clusters to form. This parameter is required for AgglomerativeClustering but is not used for <code>Ward</code>.</li>



<li><strong>affinity: </strong>The distance measure used to calculate the similarity between pairs of samples. This can be any of the distance measures implemented in scikit-learn, such as the Euclidean distance or the cosine similarity.</li>



<li>l<strong>inkage: </strong>The method used to calculate the distance between clusters. This can be one of &#8220;ward,&#8221; &#8220;complete,&#8221; &#8220;average,&#8221; or &#8220;single.&#8221;</li>



<li><strong>distance_threshold:</strong> The maximum distance between two clusters that allows them to be merged. This parameter is only used in the AgglomerativeClustering class.</li>
</ul>



<p>To train the model, we specify the desired parameters and fit the model to the data using the fit_predict method. This method will fit the model to the data and generate predictions in one step.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># apply hierarchical clustering 
model = AgglomerativeClustering(affinity='euclidean')
predicted_segments = model.fit_predict(X_scaled)</pre></div>



<p>Now we have a trained clustering model also predicted the segments for our data.</p>



<h3 class="wp-block-heading">Step #5 Visualize the Results</h3>



<p>After the model is trained, we can visualize the results to get a better understanding of the clusters that were formed. There is a wide range of plots and tools to visualize clusters. In this tutorial, we will use a scatterplot and a dendrogram. </p>



<h4 class="wp-block-heading">5.1 Scatterplot</h4>



<p>For this, we can use the lmplot function in Seaborn. The lmplot creates a 2D scatterplot with an optional overlay of a linear regression model. The plot visualizes the relationship between two variables and fits a linear regression model to the data that can highlight differences. In the following, we use this linear regression model to highlight the differences between our two cluster segments and the age of the customers. </p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># add predictions to data as a new column
customer_df['segment'] = predicted_segments

# create a scatter plot of the first two features, colored by segment
sns.lmplot(x=&quot;charges&quot;, y=&quot;age&quot;, hue=&quot;segment&quot;, data=customer_df, aspect=2)
plt.show()</pre></div>



<figure class="wp-block-image size-large"><img decoding="async" width="1024" height="470" data-attachment-id="11370" data-permalink="https://www.relataly.com/customer-segmentation-using-hierarchical-clustering-in-python/11335/image-49-2/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/12/image-49.png" data-orig-size="1065,489" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-49" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/12/image-49.png" src="https://www.relataly.com/wp-content/uploads/2022/12/image-49-1024x470.png" alt="" class="wp-image-11370" srcset="https://www.relataly.com/wp-content/uploads/2022/12/image-49.png 1024w, https://www.relataly.com/wp-content/uploads/2022/12/image-49.png 300w, https://www.relataly.com/wp-content/uploads/2022/12/image-49.png 768w, https://www.relataly.com/wp-content/uploads/2022/12/image-49.png 1065w" sizes="(max-width: 1024px) 100vw, 1024px" /></figure>



<p>We can see that our model has determined two clusters in our data. The clusters seem to correspond well with the smoker category, which indicates that this attribute is decisive in forming relevant groups.</p>



<h4 class="wp-block-heading" id="h-5-2-dendrogram">5.2 Dendrogram</h4>



<p>The hierarchical clustering approach lets us visualize relationships between different groups in our dataset in a dendrogram. A dendrogram is a graphical representation of a hierarchical structure, such as the relationships between different groups of objects or organisms. It is typically used in biology to show the relationships between different species or taxonomic groups, but it can also be used in other fields to represent the hierarchical structure of any set of data. In a dendrogram, the objects or groups being studied are represented as branches on a tree-like diagram. The branches are usually labeled with the names of the objects or groups, and the lengths of the branches represent the distances or dissimilarities between the objects or groups. The branches are also arranged in a hierarchical manner, with the most closely related objects or groups being placed closer together and the more distantly related ones being placed farther apart.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Visualize data similarity in a dendogram
def plot_dendrogram(model, **kwargs):
    # create the counts of samples under each node
    counts = np.zeros(model.children_.shape[0])
    n_samples = len(model.labels_)
    for i, merge in enumerate(model.children_):
        current_count = 0
        for child_idx in merge:
            if child_idx &lt; n_samples:
                current_count += 1  # leaf node
            else:
                current_count += counts[child_idx - n_samples]
        counts[i] = current_count

    linkage_matrix = np.column_stack(
        [model.children_, model.distances_, counts]
    ).astype(float)

    # Plot the corresponding dendrogram
    dendrogram(linkage_matrix, orientation='right',**kwargs)


plt.title(&quot;Hierarchical Clustering Dendrogram&quot;)
# plot the top three levels of the dendrogram
plot_dendrogram(cluster_model, truncate_mode=&quot;level&quot;, p=4)
plt.xlabel(&quot;Euclidean Distance&quot;)
plt.ylabel(&quot;Number of points in node (or index of point if no parenthesis).&quot;)
plt.show()</pre></div>



<p>Source: This code block is based on code <a href="https://scikit-learn.org/stable/auto_examples/cluster/plot_agglomerative_dendrogram.html" target="_blank" rel="noreferrer noopener">from the scikit-learn page</a></p>



<figure class="wp-block-image size-full"><img decoding="async" width="575" height="453" data-attachment-id="11396" data-permalink="https://www.relataly.com/customer-segmentation-using-hierarchical-clustering-in-python/11335/image-53-4/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/12/image-53.png" data-orig-size="575,453" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-53" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/12/image-53.png" src="https://www.relataly.com/wp-content/uploads/2022/12/image-53.png" alt="" class="wp-image-11396" srcset="https://www.relataly.com/wp-content/uploads/2022/12/image-53.png 575w, https://www.relataly.com/wp-content/uploads/2022/12/image-53.png 300w" sizes="(max-width: 575px) 100vw, 575px" /></figure>



<h2 class="wp-block-heading">Summary</h2>



<p>In conclusion, hierarchical clustering is a powerful tool for customer segmentation that can help businesses better understand their customer base and target their marketing efforts more effectively. By grouping customers into clusters based on their characteristics and behaviors, companies can create targeted campaigns and personalize their marketing efforts to better meet the needs of each group. Using Python and the scikit-learn library, we were able to apply an agglomerative clustering approach to a dataset of customer data and identify two distinct segments. We can then use these segments to inform our marketing strategies and get a better understanding of our customers.</p>



<p>By the way, customer segmentation is an area where real-world data can be prone to bias and unfairness. If you&#8217;re concerned about this, check out our latest article on <a href="https://www.relataly.com/building-fair-machine-machine-learning-models-with-fairlearn/12804/" target="_blank" rel="noreferrer noopener">addressing fairness in machine learning with fairlearn</a>.</p>



<p>I hope this article was useful. If you have any feedback, please write your thoughts in the comments. </p>



<h2 class="wp-block-heading">Sources and Further Reading</h2>



<p>Articles</p>



<ul class="wp-block-list">
<li><a href="https://scikit-learn.org/stable/auto_examples/cluster/plot_agglomerative_dendrogram.html" target="_blank" rel="noreferrer noopener">https://scikit-learn.org/stable/auto_examples/cluster/plot_agglomerative_dendrogram.html</a></li>



<li>Images generated with OpenAI Dall-E and Midjourney.</li>
</ul>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow">
<div class="wp-block-group"><div class="wp-block-group__inner-container is-layout-constrained wp-block-group-is-layout-constrained">
<h4 class="wp-block-heading"><strong>Books on Clustering</strong></h4>



<ul class="wp-block-list">
<li><a href="https://amzn.to/3Gb5kfj" target="_blank" rel="noreferrer noopener">&#8220;Data Clustering: Algorithms and Applications&#8221; by Charu C. Aggarwal</a>: This book covers a wide range of clustering algorithms, including hierarchical clustering, and discusses their applications in various fields.</li>



<li><a href="https://amzn.to/3WmhGXB" target="_blank" rel="noreferrer noopener">&#8220;Data Mining: Practical Machine Learning Tools and Techniques&#8221; by Ian H. Witten and Eibe Frank</a>: This book is a comprehensive introduction to data mining and machine learning, including a chapter on hierarchical clustering.</li>
</ul>



<div style="display: inline-block;">
<iframe sandbox="allow-popups allow-scripts allow-modals allow-forms allow-same-origin" style="width:120px;height:240px;" marginwidth="0" marginheight="0" scrolling="no" frameborder="0" src="//ws-eu.amazon-adsystem.com/widgets/q?ServiceVersion=20070822&amp;OneJS=1&amp;Operation=GetAdHtml&amp;MarketPlace=DE&amp;source=ss&amp;ref=as_ss_li_til&amp;ad_type=product_link&amp;tracking_id=flo7up-21&amp;language=de_DE&amp;marketplace=amazon&amp;region=DE&amp;placement=0128042915&amp;asins=0128042915&amp;linkId=1e9fe160a76f7255e3eea8e0119ca74f&amp;show_border=true&amp;link_opens_in_new_window=true"></iframe>

<iframe sandbox="allow-popups allow-scripts allow-modals allow-forms allow-same-origin" style="width:120px;height:240px;" marginwidth="0" marginheight="0" scrolling="no" frameborder="0" src="//ws-eu.amazon-adsystem.com/widgets/q?ServiceVersion=20070822&amp;OneJS=1&amp;Operation=GetAdHtml&amp;MarketPlace=DE&amp;source=ss&amp;ref=as_ss_li_til&amp;ad_type=product_link&amp;tracking_id=flo7up-21&amp;language=de_DE&amp;marketplace=amazon&amp;region=DE&amp;placement=B00EYROAQU&amp;asins=B00EYROAQU&amp;linkId=ba1fcb8a59417e729afe33f6eceb2a9f&amp;show_border=true&amp;link_opens_in_new_window=true"></iframe>
</div>
</div></div>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow">
<div class="wp-block-group"><div class="wp-block-group__inner-container is-layout-constrained wp-block-group-is-layout-constrained">
<h4 class="wp-block-heading"><strong>Books on Machine Learning</strong></h4>



<ul class="wp-block-list">
<li><a href="https://amzn.to/3S9Nfkl" target="_blank" rel="noreferrer noopener">Aurélien Géron (2019) Hands-On Machine Learning</a></li>



<li><a href="https://amzn.to/3EKidwE" target="_blank" rel="noreferrer noopener">David Forsyth (2019) Applied Machine Learning Springer</a></li>
</ul>



<div style="display: inline-block;">

  <iframe sandbox="allow-popups allow-scripts allow-modals allow-forms allow-same-origin" style="width:120px;height:240px;" marginwidth="0" marginheight="0" scrolling="no" frameborder="0" src="//ws-eu.amazon-adsystem.com/widgets/q?ServiceVersion=20070822&amp;OneJS=1&amp;Operation=GetAdHtml&amp;MarketPlace=DE&amp;source=ss&amp;ref=as_ss_li_til&amp;ad_type=product_link&amp;tracking_id=flo7up-21&amp;language=de_DE&amp;marketplace=amazon&amp;region=DE&amp;placement=3030181162&amp;asins=3030181162&amp;linkId=669e46025028259138fbb5ccec12dfbe&amp;show_border=true&amp;link_opens_in_new_window=true"></iframe>

<iframe sandbox="allow-popups allow-scripts allow-modals allow-forms allow-same-origin" style="width:120px;height:240px;" marginwidth="0" marginheight="0" scrolling="no" frameborder="0" src="//ws-eu.amazon-adsystem.com/widgets/q?ServiceVersion=20070822&amp;OneJS=1&amp;Operation=GetAdHtml&amp;MarketPlace=DE&amp;source=ss&amp;ref=as_ss_li_til&amp;ad_type=product_link&amp;tracking_id=flo7up-21&amp;language=de_DE&amp;marketplace=amazon&amp;region=DE&amp;placement=1492032646&amp;asins=1492032646&amp;linkId=2214804dd039e7103577abd08722abac&amp;show_border=true&amp;link_opens_in_new_window=true"></iframe>
</div>
</div></div>
</div>
</div>



<p class="has-contrast-2-color has-base-3-background-color has-text-color has-background"><em>The links above to Amazon are affiliate links. By buying through these links, you support the Relataly.com blog and help to cover the hosting costs. Using the links does not affect the price.</em></p>



<p><strong>Relataly articles on clustering and machine learning</strong></p>



<ul class="wp-block-list">
<li><a href="https://www.relataly.com/simple-cluster-analysis-with-k-means-with-python/5070/" target="_blank" rel="noreferrer noopener">Simple Clustering using K-means in Python</a>: This article gives an overview of cluster analysis with k-means.</li>



<li><a href="https://www.relataly.com/crypto-market-cluster-analysis-using-affinity-propagation-python/8114/" target="_blank" rel="noreferrer noopener">Clustering crypto markets using affinity propagation in Python</a>: This article applies cluster analysis to crypto markets and creates a market map for various cryptocurrencies.</li>



<li><a href="https://www.relataly.com/building-fair-machine-machine-learning-models-with-fairlearn/12804/" target="_blank" rel="noreferrer noopener">Addressing fairness in machine learning with the fairlearn library</a></li>
</ul>



<p></p>
<p>The post <a href="https://www.relataly.com/customer-segmentation-using-hierarchical-clustering-in-python/11335/">How to Use Hierarchical Clustering For Customer Segmentation in Python</a> appeared first on <a href="https://www.relataly.com">relataly.com</a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://www.relataly.com/customer-segmentation-using-hierarchical-clustering-in-python/11335/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">11335</post-id>	</item>
		<item>
		<title>Feature Engineering and Selection for Regression Models with Python and Scikit-learn</title>
		<link>https://www.relataly.com/exploratory-feature-preparation-for-regression-with-python-and-scikit-learn/8832/</link>
					<comments>https://www.relataly.com/exploratory-feature-preparation-for-regression-with-python-and-scikit-learn/8832/#comments</comments>
		
		<dc:creator><![CDATA[Florian Follonier]]></dc:creator>
		<pubDate>Mon, 26 Sep 2022 22:20:29 +0000</pubDate>
				<category><![CDATA[Algorithms]]></category>
		<category><![CDATA[Data Visualization]]></category>
		<category><![CDATA[Exploratory Data Analysis (EDA)]]></category>
		<category><![CDATA[Feature Engineering]]></category>
		<category><![CDATA[Feature Permutation Importance]]></category>
		<category><![CDATA[Linear Regression]]></category>
		<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[Measuring Model Performance]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[Random Decision Forests]]></category>
		<category><![CDATA[Sales Forecasting]]></category>
		<category><![CDATA[Scikit-Learn]]></category>
		<category><![CDATA[Seaborn]]></category>
		<category><![CDATA[Simple Regression]]></category>
		<category><![CDATA[Use Cases]]></category>
		<category><![CDATA[Advanced Tutorials]]></category>
		<category><![CDATA[AI in Finance]]></category>
		<category><![CDATA[Deep Learning]]></category>
		<category><![CDATA[Feature Engineering for Time Series Forecasting]]></category>
		<category><![CDATA[Feature Exploration]]></category>
		<category><![CDATA[Feature Selection]]></category>
		<category><![CDATA[Multivariate Models]]></category>
		<category><![CDATA[Neural Networks]]></category>
		<category><![CDATA[Price Regression]]></category>
		<guid isPermaLink="false">https://www.relataly.com/?p=8832</guid>

					<description><![CDATA[<p>Training a machine learning model is like baking a cake: the quality of the end result depends on the ingredients you put in. If your input data is poor, your predictions will be too. But with the right ingredients &#8211; in this case, carefully selected input features &#8211; you can create a model that&#8217;s both ... <a title="Feature Engineering and Selection for Regression Models with Python and Scikit-learn" class="read-more" href="https://www.relataly.com/exploratory-feature-preparation-for-regression-with-python-and-scikit-learn/8832/" aria-label="Read more about Feature Engineering and Selection for Regression Models with Python and Scikit-learn">Read more</a></p>
<p>The post <a href="https://www.relataly.com/exploratory-feature-preparation-for-regression-with-python-and-scikit-learn/8832/">Feature Engineering and Selection for Regression Models with Python and Scikit-learn</a> appeared first on <a href="https://www.relataly.com">relataly.com</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p>Training a machine learning model is like baking a cake: the quality of the end result depends on the ingredients you put in. If your input data is poor, your predictions will be too. But with the right ingredients &#8211; in this case, carefully selected input features &#8211; you can create a model that&#8217;s both accurate and powerful. This is where feature engineering comes in. It&#8217;s the process of exploring, creating, and selecting the most relevant and useful features to use in your model. And just like a chef experimenting with different spices and flavors, the process of feature engineering is iterative and tailored to the problem at hand. In this guide, we&#8217;ll walk you through a step-by-step process using Python and Scikit-learn to create a strong set of features for a regression problem. By the end, you&#8217;ll have the skills to tackle any feature engineering challenge that comes your way.</p>



<p>The remainder of this article proceeds as follows: We begin with a brief intro to feature engineering and describe valuable techniques. We then turn to the hands-on part, in which we develop a regression model for car sales. We apply various techniques that show how to handle outliers and missing values, perform correlation analysis, and discover and manipulate features. You will also find information about common challenges and helpful sklearn functions. Finally, we will compare our regression model to a baseline model that uses the original dataset.</p>



<p>Also: <a href="https://www.relataly.com/simple-sentiment-analysis-using-naive-bayes-and-logistic-regression/2007/" target="_blank" rel="noreferrer noopener">Sentiment Analysis with Naive Bayes and Logistic Regression in Python</a></p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%"></div>
</div>



<div style="height:31px" aria-hidden="true" class="wp-block-spacer"></div>



<h2 class="wp-block-heading">What is Feature Engineering?</h2>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p>Feature engineering is the process of using domain knowledge of the data to create features (variables) that make machine learning algorithms work. This is an important step in the machine learning pipeline because the choice of good features can greatly affect the performance of the model. The goal is to identify features, tweak them, and select the most promising ones into a smaller feature subset. We can break this process down into several action items. </p>



<p>Data Scientists can easily spend 70% to 80% of their time on feature engineering. The time is well spent, as changes to input data have a direct impact on performance. This process is often iterative and requires repeatedly revisiting the various tasks as understanding the data and the problem evolves. Knowing techniques and associated challenges helps in adequate feature engineering.</p>



<p>Also: <a href="https://www.relataly.com/mastering-prompt-engineering-for-chatgpt-a-practical-guide-for-businesses/13134/" target="_blank" rel="noreferrer noopener">Mastering Prompt Engineering for ChatGPT for Business Use</a></p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%">
<figure class="wp-block-image size-large"><img decoding="async" width="1024" height="1024" data-attachment-id="12411" data-permalink="https://www.relataly.com/exploratory-feature-preparation-for-regression-with-python-and-scikit-learn/8832/engineering-features-python-tutorial-machine-learning/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2023/02/engineering-features-python-tutorial-machine-learning.png" data-orig-size="1024,1024" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="engineering-features-python-tutorial-machine-learning" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2023/02/engineering-features-python-tutorial-machine-learning.png" src="https://www.relataly.com/wp-content/uploads/2023/02/engineering-features-python-tutorial-machine-learning-1024x1024.png" alt="Engineering features python tutorial machine learning. Image of an engineer working on a technical document. Midjourney. relataly.com" class="wp-image-12411" srcset="https://www.relataly.com/wp-content/uploads/2023/02/engineering-features-python-tutorial-machine-learning.png 1024w, https://www.relataly.com/wp-content/uploads/2023/02/engineering-features-python-tutorial-machine-learning.png 300w, https://www.relataly.com/wp-content/uploads/2023/02/engineering-features-python-tutorial-machine-learning.png 140w, https://www.relataly.com/wp-content/uploads/2023/02/engineering-features-python-tutorial-machine-learning.png 768w" sizes="(max-width: 1024px) 100vw, 1024px" /><figcaption class="wp-element-caption">Feature engineering is about carefully choosing features instead of taking all the features at once. Image created with <a href="http://www.midjourney.com" target="_blank" rel="noreferrer noopener">Midjourney</a>.</figcaption></figure>
</div>
</div>



<p></p>



<h3 class="wp-block-heading">Core Tasks</h3>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p>The goal of feature engineering is to create a set of features that are representative of the underlying data and that can be used by the machine learning algorithm to make accurate predictions. Several tasks are commonly performed as part of the feature engineering process, including:</p>



<ul class="wp-block-list">
<li><strong>Data discovery</strong>: To solve real-world problems with analytics, it is crucial to understand the data. Once you have gathered your data, describing and visualizing the data are means to familiarize yourself with it and develop a general feel for the data. </li>



<li><strong>Data structuring:</strong> The data needs to be structured into a unified and usable format. Variables may have a wrong datatype, or the data is distributed across different data frames and must first be merged. In these cases, we first need to bring the data together and into the right shape.</li>



<li><strong>Data cleansing:</strong> Besides being structured, data needs to be cleaned. Records may be redundant or contaminated with errors and missing values that can hinder our model from learning effectively. The same goes for outliers that can distort statistics. </li>



<li><strong>Data transformation:</strong> We can increase the predictive power of our input features by transforming them. Activities may include applying mathematical functions, removing specific data, or grouping variables into bins. Or we create entirely new features out of several existing ones. </li>



<li><strong>Feature selection: </strong>Only some may contain valuable information from the many available variables. By sorting variables that are less relevant and selecting the most promising features, we can create models that are less complex and yield better results.</li>
</ul>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%"></div>
</div>



<h3 class="wp-block-heading">Exploratory Feature Engineering Toolset</h3>



<p>Exploratory analysis for identifying and assessing relevant features knows several tools: </p>



<ul class="wp-block-list">
<li>Data Cleansing</li>



<li>Descriptive statistics</li>



<li>Univariate Analysis</li>



<li>Bi-variate Analysis</li>



<li>Multivariate Analysis</li>
</ul>



<h2 class="wp-block-heading">Data Cleansing</h2>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p>Educational data is often remarkably perfect, without any errors or missing values. However, it is important to recognize that most real-world data has data quality issues. Some reasons for data quality issues are </p>



<ul class="wp-block-list">
<li>Standardization issues because the data was recorded from different peoples, sensor types, etc.</li>



<li>Sensor or system outages can lead to gaps in the data or create erroneous data points.</li>



<li>Human errors</li>
</ul>



<p>An important part of feature engineering is to inspect the data and ensure its quality before use. This is what we understand as &#8220;data cleansing.&#8221; It includes several tasks that aim to improve the data quality, remove erroneous data points and bring the data into a more useful form. </p>



<ul class="wp-block-list">
<li>Cleaning errors, missing values, and other issues.</li>



<li>Handling possible imbalanced data </li>



<li>Removing obvious outliers</li>



<li>Standardisation, e.g., dates or adresses </li>
</ul>



<p>Accomplishing these tasks requires a good understanding of the data. We, therefore, carry out data cleansing activities closely intertwined with other exploratory tasks, e.g., univariate and bivariate data analysis. Also, remember that visualizations can aid in the process, as they can greatly enhance your ability to analyze and understand the data. </p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%"></div>
</div>



<h4 class="wp-block-heading">Descriptive Statistics</h4>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p>One of the first steps in familiarizing oneself with a new dataset is to use descriptive statistics. Descriptive statistics help understand the data and how the sample represents the real-world population. We can use several statistical measures to analyze and describe a dataset, including the following:</p>



<ul class="wp-block-list">
<li><strong>Measures of Central Tendency</strong> represent a typical value of the data.
<ul class="wp-block-list">
<li><strong>The mean:</strong> The average-based adds together all values in the sample and divides them by the number of samples.</li>



<li><strong>The median</strong>: The median is the value that lies in the middle of the range of all sample values</li>



<li><strong>The mode: </strong>is the most occurring value in a sample set (for categorical variables)</li>
</ul>
</li>



<li><strong>Measures of Variability</strong> tell us something about the spread of the data.
<ul class="wp-block-list">
<li><strong>Range:</strong> The difference between the minimum and maximum value</li>



<li><strong>Variance:</strong> This is the average of the squared difference of the mean.</li>



<li><strong>Standard Deviation:</strong> The square root of the variance.</li>
</ul>
</li>



<li>and <strong>Measures of Frequency</strong> inform us how often we can expect a value to be present in the data, e.g., value counts</li>
</ul>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%"></div>
</div>



<p><strong>Univariate Analysis</strong></p>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow">
<p>As &#8220;uni&#8221; suggests, the univariate analysis focuses on a single variable. Rather than examining the relationships between the variables, univariate analysis employs descriptive statistics and visualizations to understand individual columns better.</p>



<p>Which illustrations and measures we use depends on the type of the variable.</p>



<p><strong>Categorical variables (incl. binary)</strong></p>



<ul class="wp-block-list">
<li>Descriptive measures include counts in percent and absolute values</li>



<li>Visualizations include pie charts, bar charts (count plots)</li>
</ul>



<p><strong>Continuous variables</strong></p>



<ul class="wp-block-list">
<li>Descriptive measures include min, max, median, mean, variance, standard deviation, and quantiles.</li>



<li>Visualizations include box plots, line plots, and histograms.</li>
</ul>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow">
<figure class="wp-block-image size-full"><img decoding="async" width="838" height="585" data-attachment-id="9261" data-permalink="https://www.relataly.com/exploratory-feature-preparation-for-regression-with-python-and-scikit-learn/8832/output-9/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/09/output.png" data-orig-size="838,585" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="Normal distribution" data-image-description="&lt;p&gt;Normal distribution, univariate analysis&lt;/p&gt;
" data-image-caption="&lt;p&gt;Normal distribution, univariate analysis&lt;/p&gt;
" data-large-file="https://www.relataly.com/wp-content/uploads/2022/09/output.png" src="https://www.relataly.com/wp-content/uploads/2022/09/output.png" alt="" class="wp-image-9261" srcset="https://www.relataly.com/wp-content/uploads/2022/09/output.png 838w, https://www.relataly.com/wp-content/uploads/2022/09/output.png 300w, https://www.relataly.com/wp-content/uploads/2022/09/output.png 768w" sizes="(max-width: 838px) 100vw, 838px" /><figcaption class="wp-element-caption">Normal distribution, univariate analysis</figcaption></figure>



<figure class="wp-block-image size-full"><img decoding="async" width="751" height="194" data-attachment-id="9293" data-permalink="https://www.relataly.com/exploratory-feature-preparation-for-regression-with-python-and-scikit-learn/8832/image-12-15/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/09/image-12.png" data-orig-size="751,194" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-12" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/09/image-12.png" src="https://www.relataly.com/wp-content/uploads/2022/09/image-12.png" alt="" class="wp-image-9293" srcset="https://www.relataly.com/wp-content/uploads/2022/09/image-12.png 751w, https://www.relataly.com/wp-content/uploads/2022/09/image-12.png 300w" sizes="(max-width: 751px) 100vw, 751px" /></figure>
</div>
</div>



<h4 class="wp-block-heading">Bi-variate Analysis </h4>



<p>Bi-variate (two-variate) analysis is a kind of statistical analysis that focuses on the relationship between two variables, for example, between a feature column and the target variable. In the case of machine learning projects, bivariate analysis can help to identify features that are potentially predictive of the label or the regression target. </p>



<p>Model performance will benefit from strong linear dependencies. In addition, we are also interested in examining the relationships among the features used to train the model. Different types of relations exist that can be examined using various plots and statistical measures:</p>



<h4 class="wp-block-heading">Numerical/Numerical</h4>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p>Both variables have numerical values. We can illustrate their relation using lineplots or dot plots. We can examine such relations with <a href="https://www.relataly.com/category/data-science/pearson-correlation/" target="_blank" rel="noreferrer noopener">correlation analysis</a>.</p>



<p>The ideal feature subset contains features that are not correlated with each other but are heavily correlated with the target variable. We can use dimensionality reduction to reduce a dataset with many features to a lower-dimensional space in which the remaining features are less correlated.</p>



<p>Traditional correlation analysis (e.g., Pearson) cannot consider non-linear relations. We can identify such a relation manually by visualizing the data, for example, using line plots. Once we denote a non-linear relation, we could try to apply mathematical transformations to one of the variables to make their relation more linear. </p>



<p>For pairwise analysis, we must understand which variables we deal with. We can differentiate between three categories:</p>



<ul class="wp-block-list">
<li>Numerical/Categorical</li>



<li>Numerical/Numerical</li>



<li>Categorical/Categorical</li>
</ul>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%">
<figure class="wp-block-image size-full is-resized"><img decoding="async" src="https://www.relataly.com/wp-content/uploads/2022/09/image-2.png" alt="Heatmaps illustrate the relation between features and a target variable." class="wp-image-9269" width="372" height="328"/><figcaption class="wp-element-caption">Heatmaps illustrate the relation between features and a target variable.</figcaption></figure>
</div>
</div>



<div style="height:36px" aria-hidden="true" class="wp-block-spacer"></div>



<h4 class="wp-block-heading">Numerical/Categorical</h4>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p>Plots that visualize the relationship between a categorical and a numerical variable include barplots and lineplots. </p>



<p>Especially helpful are histograms (count plots). They can highlight differences in the distribution of the numerical variable for different categories.</p>



<p>A specific subcase is a numerical/date relation. Such relations are typically visualized using line plots. In addition, we want to look out for linear or non-linear dependencies. </p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%">
<figure class="wp-block-image size-full is-resized"><img decoding="async" data-attachment-id="9286" data-permalink="https://www.relataly.com/exploratory-feature-preparation-for-regression-with-python-and-scikit-learn/8832/image-6-15/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/09/image-6.png" data-orig-size="764,406" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-6" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/09/image-6.png" src="https://www.relataly.com/wp-content/uploads/2022/09/image-6.png" alt="the lineplot is useful for feature exploration and engineering" class="wp-image-9286" width="379" height="201" srcset="https://www.relataly.com/wp-content/uploads/2022/09/image-6.png 764w, https://www.relataly.com/wp-content/uploads/2022/09/image-6.png 300w" sizes="(max-width: 379px) 100vw, 379px" /><figcaption class="wp-element-caption">Line charts are useful when examining trends.</figcaption></figure>
</div>
</div>



<h4 class="wp-block-heading">Categorical/Categorical</h4>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p>The relation between two categorical variables can be studied, including density plots, histograms, and bar plots.</p>



<p>For example, with car types (attributes: sedan and coupe) and colors (characteristics: red, blue, yellow), we can use a barplot to see if sedans are more often red than coupes. Differences in the distribution of characteristics can be a starting point for attempts to manipulate the features and improve model performance. </p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%">
<figure class="wp-block-image size-full is-resized"><img decoding="async" data-attachment-id="9291" data-permalink="https://www.relataly.com/exploratory-feature-preparation-for-regression-with-python-and-scikit-learn/8832/image-11-8/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/09/image-11.png" data-orig-size="765,396" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-11" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/09/image-11.png" src="https://www.relataly.com/wp-content/uploads/2022/09/image-11.png" alt="the barplot is useful for feature exploration and engineering" class="wp-image-9291" width="374" height="194" srcset="https://www.relataly.com/wp-content/uploads/2022/09/image-11.png 765w, https://www.relataly.com/wp-content/uploads/2022/09/image-11.png 300w" sizes="(max-width: 374px) 100vw, 374px" /><figcaption class="wp-element-caption">Bar and column charts are a great way to compare numeric values for discrete categories visually.</figcaption></figure>
</div>
</div>



<h4 class="wp-block-heading">Multivariate Analysis</h4>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p><em>Multivariate</em> analysis encompasses the simultaneous analysis of more than two variables. The approach can uncover multi-dimensional dependencies and is often used in advanced feature engineering. For example, you may find that two variables are weakly correlated with the target variable, but when combined, their relation intensifies. So you might try to create a new feature that uses the two variables as input. Plots that can visualize relations between several variables include dot plots and violin plots.</p>



<p>In addition, multivariate analysis refers to techniques to reduce the dimensionality of a dataset. For example, principal component analysis (PCA) or factor analysis can condense the information in a data set into a smaller number of synthetic features.</p>



<p>Now that we have a good understanding of what feature selection techniques are available, we can start the practical part and apply them.</p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%">
<figure class="wp-block-image size-full is-resized"><img decoding="async" data-attachment-id="9282" data-permalink="https://www.relataly.com/exploratory-feature-preparation-for-regression-with-python-and-scikit-learn/8832/image-4-20/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/09/image-4.png" data-orig-size="738,409" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-4" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/09/image-4.png" src="https://www.relataly.com/wp-content/uploads/2022/09/image-4.png" alt="the scatterplot is useful for feature exploration and engineering" class="wp-image-9282" width="377" height="209" srcset="https://www.relataly.com/wp-content/uploads/2022/09/image-4.png 738w, https://www.relataly.com/wp-content/uploads/2022/09/image-4.png 300w" sizes="(max-width: 377px) 100vw, 377px" /><figcaption class="wp-element-caption">Scatter charts are useful when you want to compare two numeric quantities and see a relationship or correlation between them.</figcaption></figure>



<figure class="wp-block-image size-full"><img decoding="async" width="743" height="405" data-attachment-id="9294" data-permalink="https://www.relataly.com/exploratory-feature-preparation-for-regression-with-python-and-scikit-learn/8832/image-13-7/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/09/image-13.png" data-orig-size="743,405" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-13" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/09/image-13.png" src="https://www.relataly.com/wp-content/uploads/2022/09/image-13.png" alt="the violin plot is useful for feature exploration and engineering" class="wp-image-9294" srcset="https://www.relataly.com/wp-content/uploads/2022/09/image-13.png 743w, https://www.relataly.com/wp-content/uploads/2022/09/image-13.png 300w" sizes="(max-width: 743px) 100vw, 743px" /></figure>
</div>
</div>



<div style="height:100px" aria-hidden="true" class="wp-block-spacer"></div>



<p>Also: <a href="https://www.relataly.com/cryptocurrency-price-charts-with-color-overlay-python/2820/" target="_blank" rel="noreferrer noopener">Color-Coded Cryptocurrency Price Charts in Python</a></p>



<h2 class="wp-block-heading" id="h-feature-engineering-for-car-price-regression-with-python-and-scikit-learn">Feature Engineering for Car Price Regression with Python and Scikit-learn</h2>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p>The value of a car on the market depends on various factors. The distance traveled with the vehicle and the year of manufacture is obvious dependencies. But beyond that, we can use many other factors to train a machine learning model that predicts the selling price of the used car market. The following hands-on Python tutorial will create such a model. We will work with a dataset containing used cars&#8217; characteristics in the following. For marketing, it is crucial to understand what car characteristics determine the price of a vehicle. Our goal is to model the car price from the available independent variables. We aim to build a model that performs well on a small but powerful input subset. </p>



<p>Exploring and creating features varies between different application domains. For example, feature engineering in computer vision will differ greatly from feature engineering for regression or classification models or NLP models. So the example provided in this article is just for regression models.</p>



<p>We follow an exploratory process that includes the following steps:</p>



<ol class="wp-block-list">
<li>Loading the data</li>



<li>Cleaning the data</li>



<li>Univariate analysis</li>



<li>Bivariate analysis</li>



<li>Selecting features</li>



<li>Data preparation </li>



<li>Model training</li>



<li>Measuring performance</li>
</ol>



<p>Finally, we compare the performance of our model, which was trained on a minimal set of features, to a model that uses the original data.</p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%">
<figure class="wp-block-image size-large"><img decoding="async" width="512" height="512" data-attachment-id="12810" data-permalink="https://www.relataly.com/exploratory-feature-preparation-for-regression-with-python-and-scikit-learn/8832/dwarf-blacksmith-machine-learning-python-feature-engineering-relataly-midjourney/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2023/03/Dwarf-blacksmith-machine-learning-python-feature-engineering-relataly-midjourney.png" data-orig-size="1024,1024" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="Dwarf-blacksmith-machine-learning-python-feature-engineering-relataly-midjourney" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2023/03/Dwarf-blacksmith-machine-learning-python-feature-engineering-relataly-midjourney.png" src="https://www.relataly.com/wp-content/uploads/2023/03/Dwarf-blacksmith-machine-learning-python-feature-engineering-relataly-midjourney-512x512.png" alt="Yes, you can judge by the length of the beard that this guy is a legendary feature engineer. Image created with Midjourney." class="wp-image-12810" srcset="https://www.relataly.com/wp-content/uploads/2023/03/Dwarf-blacksmith-machine-learning-python-feature-engineering-relataly-midjourney.png 512w, https://www.relataly.com/wp-content/uploads/2023/03/Dwarf-blacksmith-machine-learning-python-feature-engineering-relataly-midjourney.png 300w, https://www.relataly.com/wp-content/uploads/2023/03/Dwarf-blacksmith-machine-learning-python-feature-engineering-relataly-midjourney.png 140w, https://www.relataly.com/wp-content/uploads/2023/03/Dwarf-blacksmith-machine-learning-python-feature-engineering-relataly-midjourney.png 768w, https://www.relataly.com/wp-content/uploads/2023/03/Dwarf-blacksmith-machine-learning-python-feature-engineering-relataly-midjourney.png 1024w" sizes="(max-width: 512px) 100vw, 512px" /><figcaption class="wp-element-caption">Yes, you can judge by the length of the beard that this guy is a legendary feature engineer. Image created with <a href="http://www.midjourney.com" target="_blank" rel="noreferrer noopener">Midjourney</a>.</figcaption></figure>
</div>
</div>



<p>The Python code is available in the relataly GitHub repository.</p>



<div class="wp-block-kadence-advancedbtn kb-buttons-wrap kb-btns_f9d778-26"><a class="kb-button kt-button button kb-btn_d0af05-38 kt-btn-size-standard kt-btn-width-type-full kb-btn-global-inherit  kt-btn-has-text-true kt-btn-has-svg-true  wp-block-button__link wp-block-kadence-singlebtn" href="https://github.com/flo7up/relataly-public-python-tutorials/blob/master/11%20Hyperparamter%20Tuning/015%20Hyperparameter%20Tuning%20of%20Regression%20Models%20using%20Random%20Search.ipynb" target="_blank" rel="noreferrer noopener"><span class="kb-svg-icon-wrap kb-svg-icon-fe_eye kt-btn-icon-side-left"><svg viewBox="0 0 24 24"  fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"  aria-hidden="true"><path d="M1 12s4-8 11-8 11 8 11 8-4 8-11 8-11-8-11-8z"/><circle cx="12" cy="12" r="3"/></svg></span><span class="kt-btn-inner-text">View on GitHub </span></a>

<a class="kb-button kt-button button kb-btn_7b2495-91 kt-btn-size-standard kt-btn-width-type-full kb-btn-global-inherit  kt-btn-has-text-true kt-btn-has-svg-true  wp-block-button__link wp-block-kadence-singlebtn" href="https://github.com/flo7up/relataly-public-python-API-tutorials" target="_blank" rel="noreferrer noopener"><span class="kb-svg-icon-wrap kb-svg-icon-fa_github kt-btn-icon-side-left"><svg viewBox="0 0 496 512"  fill="currentColor" xmlns="http://www.w3.org/2000/svg"  aria-hidden="true"><path d="M165.9 397.4c0 2-2.3 3.6-5.2 3.6-3.3.3-5.6-1.3-5.6-3.6 0-2 2.3-3.6 5.2-3.6 3-.3 5.6 1.3 5.6 3.6zm-31.1-4.5c-.7 2 1.3 4.3 4.3 4.9 2.6 1 5.6 0 6.2-2s-1.3-4.3-4.3-5.2c-2.6-.7-5.5.3-6.2 2.3zm44.2-1.7c-2.9.7-4.9 2.6-4.6 4.9.3 2 2.9 3.3 5.9 2.6 2.9-.7 4.9-2.6 4.6-4.6-.3-1.9-3-3.2-5.9-2.9zM244.8 8C106.1 8 0 113.3 0 252c0 110.9 69.8 205.8 169.5 239.2 12.8 2.3 17.3-5.6 17.3-12.1 0-6.2-.3-40.4-.3-61.4 0 0-70 15-84.7-29.8 0 0-11.4-29.1-27.8-36.6 0 0-22.9-15.7 1.6-15.4 0 0 24.9 2 38.6 25.8 21.9 38.6 58.6 27.5 72.9 20.9 2.3-16 8.8-27.1 16-33.7-55.9-6.2-112.3-14.3-112.3-110.5 0-27.5 7.6-41.3 23.6-58.9-2.6-6.5-11.1-33.3 2.6-67.9 20.9-6.5 69 27 69 27 20-5.6 41.5-8.5 62.8-8.5s42.8 2.9 62.8 8.5c0 0 48.1-33.6 69-27 13.7 34.7 5.2 61.4 2.6 67.9 16 17.7 25.8 31.5 25.8 58.9 0 96.5-58.9 104.2-114.8 110.5 9.2 7.9 17 22.9 17 46.4 0 33.7-.3 75.4-.3 83.6 0 6.5 4.6 14.4 17.3 12.1C428.2 457.8 496 362.9 496 252 496 113.3 383.5 8 244.8 8zM97.2 352.9c-1.3 1-1 3.3.7 5.2 1.6 1.6 3.9 2.3 5.2 1 1.3-1 1-3.3-.7-5.2-1.6-1.6-3.9-2.3-5.2-1zm-10.8-8.1c-.7 1.3.3 2.9 2.3 3.9 1.6 1 3.6.7 4.3-.7.7-1.3-.3-2.9-2.3-3.9-2-.6-3.6-.3-4.3.7zm32.4 35.6c-1.6 1.3-1 4.3 1.3 6.2 2.3 2.3 5.2 2.6 6.5 1 1.3-1.3.7-4.3-1.3-6.2-2.2-2.3-5.2-2.6-6.5-1zm-11.4-14.7c-1.6 1-1.6 3.6 0 5.9 1.6 2.3 4.3 3.3 5.6 2.3 1.6-1.3 1.6-3.9 0-6.2-1.4-2.3-4-3.3-5.6-2z"/></svg></span><span class="kt-btn-inner-text">Relataly Github Repo </span></a></div>



<h3 class="wp-block-heading" id="h-prerequisites">Prerequisites</h3>



<p>Before you proceed, ensure that you have set up your <a href="https://www.python.org/downloads/" target="_blank" rel="noreferrer noopener">Python</a> environment (3.8 or higher) and the required packages. If you don&#8217;t have an environment, follow&nbsp;<a href="https://www.relataly.com/anaconda-python-environment-machine-learning/1663/" target="_blank" rel="noreferrer noopener">this tutorial</a>&nbsp;to set up the&nbsp;<a href="https://www.anaconda.com/products/individual" target="_blank" rel="noreferrer noopener">Anaconda environment</a>.</p>



<p>Also, make sure you install all required packages. In this tutorial, we will be working with the following standard packages:&nbsp;</p>



<ul class="wp-block-list">
<li><em><a href="https://pandas.pydata.org/" target="_blank" rel="noreferrer noopener">pandas</a></em></li>



<li><em><a href="https://numpy.org/" target="_blank" rel="noreferrer noopener">NumPy</a></em></li>



<li><em><a href="https://matplotlib.org/" target="_blank" rel="noreferrer noopener">matplotlib</a></em></li>



<li>Seaborn</li>



<li>Scikit-learn</li>
</ul>



<p>You can install packages using console commands:</p>



<ul class="wp-block-list">
<li><em>pip install &lt;package name&gt;</em></li>



<li><em>conda install &lt;package name&gt;</em>&nbsp;(if you are using the anaconda packet manager)</li>
</ul>



<h3 class="wp-block-heading">About the Dataset</h3>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p>In this tutorial, we will be working with a dataset containing listings for 111763&nbsp;used cars. The data includes 13 variables, including the dependent target variable</p>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow">
<ul class="wp-block-list">
<li><strong>prod_date:</strong> The year of production</li>



<li><strong>maker: </strong>The manufacturer&#8217;s name</li>



<li><strong>model: </strong>The car edition</li>



<li><strong>trim: </strong>Different versions of the model</li>



<li><strong>body_type: </strong>The body style of a vehicle</li>



<li><strong>transmission_type: </strong>The way the power is brought to the wheels</li>



<li><strong>state</strong>: The state in which the car is auctioned</li>



<li><strong>condition</strong>: The condition of the cars</li>



<li><strong>odometer</strong>: The distance the car has traveled since manufactured</li>



<li><strong>exterior_color</strong>: Exterior color</li>



<li><strong>interior_color</strong>: Interior color</li>



<li><strong>sale_price (target variable):</strong> The price a car was sold </li>



<li><strong>sale_date: </strong>The date on which the car has been sold</li>
</ul>
</div>
</div>



<p>The dataset is available for download from <a href="https://www.kaggle.com/datasets/lepchenkov/usedcarscatalog" target="_blank" rel="noreferrer noopener">Kaggle.com</a>, but you can execute the code below and load the data from the relataly GitHub repository.</p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%">
<figure class="wp-block-image size-full"><img decoding="async" width="505" height="510" data-attachment-id="12429" data-permalink="https://www.relataly.com/exploratory-feature-preparation-for-regression-with-python-and-scikit-learn/8832/artishellen_set_of_elements_cars_different_colored_cars_cartoon_87cde816-541c-4e6e-ba8c-cfa530032760-min/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2023/02/artishellen_set_of_elements_cars_different_colored_cars_cartoon_87cde816-541c-4e6e-ba8c-cfa530032760-min.png" data-orig-size="505,510" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="artishellen_set_of_elements_cars_different_colored_cars_cartoon_87cde816-541c-4e6e-ba8c-cfa530032760-min" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2023/02/artishellen_set_of_elements_cars_different_colored_cars_cartoon_87cde816-541c-4e6e-ba8c-cfa530032760-min.png" src="https://www.relataly.com/wp-content/uploads/2023/02/artishellen_set_of_elements_cars_different_colored_cars_cartoon_87cde816-541c-4e6e-ba8c-cfa530032760-min.png" alt="Car price prediction machine learning python tutorial. Image of different cars cartoon style. Midjourney. relataly.com" class="wp-image-12429" srcset="https://www.relataly.com/wp-content/uploads/2023/02/artishellen_set_of_elements_cars_different_colored_cars_cartoon_87cde816-541c-4e6e-ba8c-cfa530032760-min.png 505w, https://www.relataly.com/wp-content/uploads/2023/02/artishellen_set_of_elements_cars_different_colored_cars_cartoon_87cde816-541c-4e6e-ba8c-cfa530032760-min.png 297w, https://www.relataly.com/wp-content/uploads/2023/02/artishellen_set_of_elements_cars_different_colored_cars_cartoon_87cde816-541c-4e6e-ba8c-cfa530032760-min.png 140w" sizes="(max-width: 505px) 100vw, 505px" /><figcaption class="wp-element-caption">Car price prediction is a solid use case for machine learning. Image created with <a href="http://www.midjourney.com" target="_blank" rel="noreferrer noopener">Midjourney</a>.</figcaption></figure>
</div>
</div>



<p></p>



<h3 class="wp-block-heading" id="h-step-1-load-the-data">Step #1 Load the Data</h3>



<p>We begin by importing the necessary libraries and downloading the dataset from the relataly GitHub repository. Next, we will read the dataset into a pandas DataFrame. In addition, we store the name of our regression target variable to &#8216;price_usd,&#8217; which is one of the columns in the initial dataset. The &#8220;.head ()&#8221; function displays the first records of our DataFrame.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Tested with Python 3.8.8, Matplotlib 3.5, Scikit-learn 0.24.1, Seaborn 0.11.1, numpy 1.19.5
from codecs import ignore_errors
import math
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('white', {'axes.spines.right': False, 'axes.spines.top': False})
from pandas.api.types import is_string_dtype, is_numeric_dtype 
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import mean_absolute_error, mean_absolute_percentage_error
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.inspection import permutation_importance
from sklearn.model_selection import ShuffleSplit
# Original Data Source: 
# https://www.kaggle.com/datasets/tunguz/used-car-auction-prices
# Load train and test datasets
df = pd.read_csv(&quot;https://raw.githubusercontent.com/flo7up/relataly_data/main/car_prices2/car_prices.csv&quot;)
df.head(3)</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">	prod_year	maker			model		trim		body_type		transmission_type	state	condition	odometer	exterior_color	interior	sellingprice	date
0	2015		Kia				Sorento		LX			SUV				automatic			ca		5.0			16639.0		white			black		21500	2014-12-16
1	2015		Nissan			Altima		2.5 S		Sedan			automatic			ca		1.0			5554.0		gray			black		10900	2014-12-30
2	2014		Audi			A6	3.0T 	Prestige 	quattro	Sedan	automatic			ca		4.8			14414.0		black			black		49750	2014-12-16</pre></div>



<p>We now have a dataframe that contains 12 columns and the dependent target variable we want to predict. </p>



<h3 class="wp-block-heading" id="h-step-2-data-cleansing">Step #2 Data Cleansing</h3>



<p>Now that we have loaded the data, we begin with the exploratory analysis. First, we will put it into shape. </p>



<h4 class="wp-block-heading" id="h-2-1-check-names-and-datatypes">2.1 Check Names and Datatypes</h4>



<p>If the names in a dataset are not self-explaining, it is easy to get confused with all the data. Therefore, will rename some of the columns and provide clearer names. There is no default naming convention, but striving for consistency, simplicity, and understandability is generally a good idea. </p>



<p>The following code line renames some of the columns. </p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># rename some columns for consistency
df.rename(columns={'exterior_color': 'ext_color', 
                   'interior': 'int_color', 
                   'sellingprice': 'sale_price'}, inplace=True)
df.head(1)</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">	prod_year	maker	model	trim	body_type	transmission_type	state	condition	odometer	ext_color	int_color	sale_price	date
0	2015		Kia		Sorento	LX		SUV			automatic			ca		5.0			16639.0		white		black		21500		2014-12-16</pre></div>



<p>Next, we will check and remove possible duplicates.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># check and remove dublicates
print(len(df))
df = df.drop_duplicates()
print(len(df))</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">OUT: 111763, 111763</pre></div>



<p>There were no duplicates in the data, which is good.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># check datatypes
df.dtypes</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">prod_year              int64
maker                 object
model                 object
trim                  object
body_type             object
transmission_type     object
state                 object
condition            float64
odometer             float64
ext_color             object
int_color             object
sale_price             int64
date                  object
dtype: object</pre></div>



<p>We compare the datatypes to the first records we printed in the previous section. Be aware that categorical variables (e.g., of type &#8220;string&#8221;) are shown as &#8220;objects.&#8221; The data types look as expected.</p>



<p>Finally, we define our target variable&#8217;s name, &#8220;sale_price.&#8221; The target variable will be our regression target, and we will use its name often. </p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># consistently define the target variable
target_name = 'sale_price'</pre></div>



<h4 class="wp-block-heading">2.2 Checking Missing Values</h4>



<p>Some machine learning algorithms are sensitive to missing values. Handling missing values is, therefore a crucial step in exploratory feature engineering. </p>



<p>Let&#8217;s first gain an overview of null values. With a larger DataFrame, it would be inefficient to review all the rows and columns individually for missing values. Instead, we use the sum function and visualize the results to get a quick overview of missing data in the DataFrame.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># check for missing values
null_df = pd.DataFrame(df.isna().sum(), columns=['null_values']).sort_values(['null_values'], ascending=False)
fig = plt.subplots(figsize=(16, 6))
ax = sns.barplot(data=null_df, x='null_values', y=null_df.index, color='royalblue')
pct_values = [' {:g}'.format(elm) + ' ({:.1%})'.format(elm/len(df)) for elm in list(null_df['null_values'])]
ax.bar_label(container=ax.containers[0], labels=pct_values, size=12)
ax.set_title('Overview of missing values')</pre></div>



<figure class="wp-block-image size-large"><img decoding="async" width="1024" height="384" data-attachment-id="9365" data-permalink="https://www.relataly.com/exploratory-feature-preparation-for-regression-with-python-and-scikit-learn/8832/missing-values-bar-chart-for-car-price-regression/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/09/missing-values-bar-chart-for-car-price-regression.png" data-orig-size="1026,385" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="missing-values-bar-chart-for-car-price-regression" data-image-description="&lt;p&gt;overview of missing values in the car price regression dataset&lt;/p&gt;
" data-image-caption="&lt;p&gt;overview of missing values in the car price regression dataset&lt;/p&gt;
" data-large-file="https://www.relataly.com/wp-content/uploads/2022/09/missing-values-bar-chart-for-car-price-regression.png" src="https://www.relataly.com/wp-content/uploads/2022/09/missing-values-bar-chart-for-car-price-regression-1024x384.png" alt="overview of missing values in the car price regression dataset" class="wp-image-9365" srcset="https://www.relataly.com/wp-content/uploads/2022/09/missing-values-bar-chart-for-car-price-regression.png 1024w, https://www.relataly.com/wp-content/uploads/2022/09/missing-values-bar-chart-for-car-price-regression.png 300w, https://www.relataly.com/wp-content/uploads/2022/09/missing-values-bar-chart-for-car-price-regression.png 768w, https://www.relataly.com/wp-content/uploads/2022/09/missing-values-bar-chart-for-car-price-regression.png 1026w" sizes="(max-width: 1024px) 100vw, 1024px" /></figure>



<p>The bar chart shows that there are several variables with missing values. Variables with many missing values can negatively affect model performance, which is why we should try to treat them. </p>



<h4 class="wp-block-heading">2.3 Overview of Techniques for Handling Missing Values</h4>



<p> There are various ways to handle missing data. The most common options to handle missing values are:</p>



<ul class="wp-block-list">
<li><strong>Custom substitution value:</strong> Sometimes, the information that a value is missing can be important information to a predictive model. We can substitute missing values with a placeholder value such as &#8220;missing&#8221; or &#8220;unknown.&#8221; The approach works particularly well for variables with many missing values. </li>



<li><strong>Statistical filling: </strong>We can fill in a statistically chosen measure, such as the mean or median for numeric variables, or the mode for categorical variables.</li>



<li><strong>Replace using Probabilistic PCA:</strong> PCA uses a linear approximation function that tries to reconstruct the missing values from the data.</li>



<li><strong>Remove entire rows:</strong> It is crucial to ensure that we only use data we know is correct. In those cases, we can drop an entire row if it contains a missing value. This also solves the problem but comes at the cost of losing potentially important information &#8211; especially if the data quantity is small.</li>



<li><strong>Remove the entire column:</strong> It is another alternative way of resolving missing values. This is typically the least option, as we lose an entire feature. </li>
</ul>



<p>How we handle missing values can dramatically affect our prediction results. To find the ideal method, it is often necessary to experiment with different techniques. Sometimes, the information that a value is missing can also be important. This occurs when the missing values are not randomly distributed in the data and show a pattern. In such a case, you should create an additional feature that states whether values are missing.</p>



<h4 class="wp-block-heading">2.4 Handle Missing Values</h4>



<p>In this example, we will use the median value to fill in the missing values of our numeric variables and the mode to replace the missing values of categorical variables. When we check again, we can see that odometer and condition have no more missing values.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># fill missing values with the mean for numeric columns
for col_name in df.columns:
    if (is_numeric_dtype(df[col_name])) and (df[col_name].isna().sum() &gt; 0):
        df[col_name].fillna(df[col_name].median(), inplace=True) # alternatively you could also drop the columns with missing values using .drop(columns=['engine_capacity']) 
print(df.isna().sum())</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">prod_year                0
maker                 2078
model                 2096
trim                  2157
body_type             2641
transmission_type    13135
state                    0
condition                0
odometer                 0
ext_color              173
int_color              173
sale_price               0
date                     0
dtype: int64</pre></div>



<p>Next, we handle the missing values of transmission_type by filling them with the mode.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># check the distribution of missing values for transmission type
print(df['transmission_type'].value_counts())
# fill values with the mode
df['transmission_type'].fillna(df['transmission_type'].mode()[0], inplace=True)
print(df['transmission_type'].isna().sum())</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">automatic    108198
manual         3565
Name: transmission_type, dtype: int64
0</pre></div>



<p>We handle body_type analogs as transmission_type and fill the missing values with the mode. The mode is the value that appears most often in the data. The mode of transmission_type is &#8220;Sedan.&#8221; However, this value is not that prevalent, as half of the cars have other body types, e.g., &#8220;SUV.&#8221; Therefore, we will replace the missing values with &#8220;Unknown.&#8221;</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># check the distribution of missing values for body type
print(df['body_type'].value_counts())
# fill values with 'Unknown'
df['body_type'].fillna(&quot;Unknown&quot;, inplace=True)
print(df['body_type'].isna().sum())</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">Sedan                 39955
SUV                   23836
sedan                  8377
suv                    4934
Hatchback              4241
                      ...  
cts-v coupe               2
Ram Van                   1
Transit Van               1
CTS Wagon                 1
beetle convertible        1
Name: body_type, Length: 74, dtype: int64
0</pre></div>



<p>Now we have handled most of the missing values in our data. However, some variables are still left, with a few missing values. We will make things easy and simply drop all remaining records with missing values. Considering that we have more than 100k records and only a few variables, we can afford to do this without fear of a severe impact on our model performance. </p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># remove all other records with missing values
df.dropna(inplace=True)
print(df.isna().sum())</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">prod_year            0
maker                0
model                0
trim                 0
body_type            0
transmission_type    0
state                0
condition            0
odometer             0
ext_color            0
int_color            0
sale_price           0
date                 0
dtype: int64</pre></div>



<p>Finally, we check again for missing values and see that everything has been filled. Now, we have a cleansed dataset with 13 columns. </p>



<h4 class="wp-block-heading">2.3 Save a Copy of the Cleaned Data</h4>



<p>Before exploring the features, let&#8217;s make a copy of the cleaned data. We will later use this &#8220;full&#8221; dataset to compare the performance of our model with a baseline model.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Create a copy of the dataset with all features for comparison reasons
df_all = df.copy()</pre></div>



<h3 class="wp-block-heading">Step #3 Getting started with Statistical Univariate Analysis</h3>



<p>Now it&#8217;s time to analyze the data and explore potential useful features for our subset. Although the process follows a linear flow in this example, you may notice in practice that you must go back and forth between different steps of the feature exploration and engineering process. </p>



<p>First, we will look at the variance of the features in the initial dataset. Machine learning models can only learn from variables that have adequate variance. So, low-variance features are often candidates to exclude from the feature subset.</p>



<p>We use the .describe() method to display univariate descriptive statistics about the numerical columns in our dataset. </p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># show statistics for numeric variables
print(df.columns)
df.describe()</pre></div>



<p>Next, we check the categorical variables. All variables seem to have a good variance. We can measure the variance with statistical measures or observe it manually using bar charts and scatterplots.</p>



<p>We can use histplots to visualize the distributions of the numeric variables. The example below shows the histplot for our target variable sale_price.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Explore the variance of the target variable
variable_name = 'sale_price'
fig, ax = plt.subplots(figsize=(14,5))
sns.histplot(data=df[[variable_name]].dropna(), ax=ax, color='royalblue', kde=True)
ax.get_legend().remove()
ax.set_title(variable_name + ' Distribution')
ax.set_xlim(0, df[variable_name].quantile(0.99))</pre></div>



<figure class="wp-block-image size-large is-resized"><img decoding="async" data-attachment-id="9395" data-permalink="https://www.relataly.com/exploratory-feature-preparation-for-regression-with-python-and-scikit-learn/8832/image-23-2/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/09/image-23.png" data-orig-size="1051,395" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-23" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/09/image-23.png" src="https://www.relataly.com/wp-content/uploads/2022/09/image-23-1024x385.png" alt="distribution of the target variable in sale price regression; example for feature exploration and preparation with python and sklearn" class="wp-image-9395" width="695" height="261" srcset="https://www.relataly.com/wp-content/uploads/2022/09/image-23.png 1024w, https://www.relataly.com/wp-content/uploads/2022/09/image-23.png 300w, https://www.relataly.com/wp-content/uploads/2022/09/image-23.png 768w, https://www.relataly.com/wp-content/uploads/2022/09/image-23.png 1051w" sizes="(max-width: 695px) 100vw, 695px" /></figure>



<p>The histplot shows that sale prices are skewed to the left. This means there are many cheap cars and fewer expensive ones, which makes sense.</p>



<p>Next, we create bar plots for categorical values.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># 3.2 Illustrate the Variance of Numeric Variables 
f_list_numeric = [x for x in df.columns if (is_numeric_dtype(df[x]) and df[x].nunique() &gt; 2)]
f_list_numeric
# box plot design
PROPS = {
    'boxprops':{'facecolor':'none', 'edgecolor':'royalblue'},
    'medianprops':{'color':'coral'},
    'whiskerprops':{'color':'royalblue'},
    'capprops':{'color':'royalblue'}
    }
sns.set_style('ticks', {'axes.edgecolor': 'grey',  
                        'xtick.color': '0',
                        'ytick.color': '0'})
# Adjust plotsize based on the number of features
ncols = 1
nrows = math.ceil(len(f_list_numeric) / ncols)
fig, axs = plt.subplots(nrows, ncols, figsize=(14, nrows*1))
for i, ax in enumerate(fig.axes):
    if i &lt; len(f_list_numeric):
        column_name = f_list_numeric[i]
        sns.boxplot(data=df[column_name], orient=&quot;h&quot;, ax = ax, color='royalblue', flierprops={&quot;marker&quot;: &quot;o&quot;}, **PROPS)
        ax.set(yticklabels=[column_name])
        fig.tight_layout()</pre></div>



<figure class="wp-block-image size-large is-resized"><img decoding="async" data-attachment-id="9392" data-permalink="https://www.relataly.com/exploratory-feature-preparation-for-regression-with-python-and-scikit-learn/8832/barplots-to-visualize-the-variance-of-categorical-variables/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/09/barplots-to-visualize-the-variance-of-categorical-variables.png" data-orig-size="1434,425" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="barplots-to-visualize-the-variance-of-categorical-variables" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/09/barplots-to-visualize-the-variance-of-categorical-variables.png" src="https://www.relataly.com/wp-content/uploads/2022/09/barplots-to-visualize-the-variance-of-categorical-variables-1024x303.png" alt="" class="wp-image-9392" width="786" height="232" srcset="https://www.relataly.com/wp-content/uploads/2022/09/barplots-to-visualize-the-variance-of-categorical-variables.png 1024w, https://www.relataly.com/wp-content/uploads/2022/09/barplots-to-visualize-the-variance-of-categorical-variables.png 300w, https://www.relataly.com/wp-content/uploads/2022/09/barplots-to-visualize-the-variance-of-categorical-variables.png 768w, https://www.relataly.com/wp-content/uploads/2022/09/barplots-to-visualize-the-variance-of-categorical-variables.png 1434w" sizes="(max-width: 786px) 100vw, 786px" /></figure>



<p>We can observe two things: First, the variance of transmission type is low, as most cars have an automatic transmission. So transmission_type is the first variable that we exclude from our feature subset.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Drop features with low variety
df = df.drop(columns=['transmission_type'])
df.head(2)</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">	prod_year	maker	model	trim	body_type	state	condition	odometer	ext_color	int_color	sale_price	date
0	2015		Kia		Sorento	LX		SUV			ca		5.0			16639.0		white		black		21500		2014-12-16
1	2015		Nissan	Altima	2.5 S	Sedan		ca		1.0			5554.0		gray		black		10900		2014-12-30</pre></div>



<p>Second, int_color and ext_color have many categorical values. By grouping some of these values that hardly ever occur, we can help the model to focus on the most relevant patterns. However, before we do that, we need to take a closer look at how the target variable differs between the categories. </p>



<h3 class="wp-block-heading">Step #4 Bi-variate Analysis</h3>



<p>Now that we have a general understanding of our dataset&#8217;s individual variables, let&#8217;s look at pairwise dependencies. We are particularly interested in the relationship between features and the target variables. Our goal is to keep features whose dependence on the target variable shows some pattern &#8211; linear or non-linear. On the other hand, we want to exclude features whose relationship with the target variable looks arbitrary. </p>



<p>Visualizations have to take the datatypes of our variables into account. To illustrate the relation between categorical features and the target, we create boxplots and kdeplots. For numeric (continuous) features, we use scatterplots.</p>



<h4 class="wp-block-heading">4.1 Analyzing the Relation between Features and the Target Variable</h4>



<p>We begin by taking a closer look at the int_color and ext_color. We use kdeplots to highlight the distribution of prices depending on different colors. </p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}">def make_kdeplot(column_name):
    fig, ax = plt.subplots(figsize=(20,8))
    sns.kdeplot(data=df, hue=column_name, x=target_name, ax = ax, linewidth=2,)
    ax.tick_params(axis=&quot;x&quot;, rotation=90, labelsize=10, length=0)
    ax.set_title(column_name)
    ax.set_xlim(0, df[target_name].quantile(0.99))
    plt.show()
    
make_kdeplot('ext_color')
</pre></div>



<figure class="wp-block-image size-large is-resized"><img decoding="async" data-attachment-id="9418" data-permalink="https://www.relataly.com/exploratory-feature-preparation-for-regression-with-python-and-scikit-learn/8832/output-2-4/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/09/output-2.png" data-orig-size="1168,507" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="output-2" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/09/output-2.png" src="https://www.relataly.com/wp-content/uploads/2022/09/output-2-1024x444.png" alt="Density plots are useful during feature exloration and selection" class="wp-image-9418" width="637" height="275" srcset="https://www.relataly.com/wp-content/uploads/2022/09/output-2.png 1024w, https://www.relataly.com/wp-content/uploads/2022/09/output-2.png 300w, https://www.relataly.com/wp-content/uploads/2022/09/output-2.png 768w" sizes="(max-width: 637px) 100vw, 637px" /></figure>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}">make_kdeplot('int_color')</pre></div>



<figure class="wp-block-image size-large is-resized"><img decoding="async" data-attachment-id="9419" data-permalink="https://www.relataly.com/exploratory-feature-preparation-for-regression-with-python-and-scikit-learn/8832/output2-2/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/09/output2.png" data-orig-size="1168,507" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="output2" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/09/output2.png" src="https://www.relataly.com/wp-content/uploads/2022/09/output2-1024x444.png" alt="Another density plot that shows the distribution of colors across our car dataset" class="wp-image-9419" width="655" height="283" srcset="https://www.relataly.com/wp-content/uploads/2022/09/output2.png 1024w, https://www.relataly.com/wp-content/uploads/2022/09/output2.png 300w, https://www.relataly.com/wp-content/uploads/2022/09/output2.png 768w, https://www.relataly.com/wp-content/uploads/2022/09/output2.png 1168w" sizes="(max-width: 655px) 100vw, 655px" /></figure>



<p>In both cases, a few colors are prevalent and account for most observations. Moreover, distributions of the car price differ for these prevalent colors. These differences look promising as they may help our model to differentiate cheaper cars from more expensive ones. To simplify things, we group the colors that hardly occur into a color category called &#8220;other.&#8221;</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Binning features
df['int_color'] = [x if  x in(['black', 'gray', 'white', 'silver', 'blue', 'red']) else 'other' for x in df['int_color']]
df['ext_color'] = [x if  x in(['black', 'gray', 'white', 'silver', 'blue', 'red']) else 'other' for x in df['ext_color']]</pre></div>



<p>Next, we create plots for all remaining features. </p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Vizualising Distributions
f_list = [x for x in df.columns if ((is_numeric_dtype(df[x])) and x != target_name) or (df[x].nunique() &lt; 50)]
f_list_len = len(f_list)
print(f'numeric features: {f_list_len}')
# Adjust plotsize based on the number of features
ncols = 1
nrows = math.ceil(f_list_len / ncols)
fig, axs = plt.subplots(nrows, ncols, figsize=(18, nrows*5))
for i, ax in enumerate(fig.axes):
    if i &lt; f_list_len:
        column_name = f_list[i]
        print(column_name)
        # If a variable has more than 8 unique values draw a scatterplot, else draw a violinplot 
        if df[column_name].nunique() &gt; 100 and is_numeric_dtype(df[column_name]):
            # Draw a scatterplot for each variable and target_name
            sns.scatterplot(data=df, y=target_name, x=column_name, ax = ax)
        else: 
            # Draw a vertical violinplot (or boxplot) grouped by a categorical variable:
            myorder = df.groupby(by=[column_name])[target_name].median().sort_values().index
            sns.boxplot(data=df, x=column_name, y=target_name, ax = ax, order=myorder)
            #sns.violinplot(data=df, x=column_name, y=target_name, ax = ax, order=myorder)
        ax.tick_params(axis=&quot;x&quot;, rotation=90, labelsize=10, length=0)
        ax.set_title(column_name)
    fig.tight_layout()</pre></div>



<figure class="wp-block-image size-large is-resized"><img decoding="async" data-attachment-id="9397" data-permalink="https://www.relataly.com/exploratory-feature-preparation-for-regression-with-python-and-scikit-learn/8832/boxplots/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/09/boxplots.png" data-orig-size="1289,2153" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="boxplots" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/09/boxplots.png" src="https://www.relataly.com/wp-content/uploads/2022/09/boxplots-613x1024.png" alt="boxplots and scatterplots help us to understand the relationship between our features and the target variable" class="wp-image-9397" width="725" height="1211" srcset="https://www.relataly.com/wp-content/uploads/2022/09/boxplots.png 613w, https://www.relataly.com/wp-content/uploads/2022/09/boxplots.png 180w, https://www.relataly.com/wp-content/uploads/2022/09/boxplots.png 768w, https://www.relataly.com/wp-content/uploads/2022/09/boxplots.png 920w, https://www.relataly.com/wp-content/uploads/2022/09/boxplots.png 1226w, https://www.relataly.com/wp-content/uploads/2022/09/boxplots.png 1289w" sizes="(max-width: 725px) 100vw, 725px" /></figure>



<p>Again, for categorical variables, we want to see differences in the distribution of the categories. Based on the boxplot&#8217;s median and the quantiles, we can denote that prod_year, int_color, and condition show adequate variance. The scatterplot for the odometer value also looks good. So we want to keep these features. In contrast, the differences between &#8220;state&#8221; and &#8220;ext_color&#8221; are rather weak. Therefore, we exclude these variables from our subset. </p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># drop columns with low variance
df.drop(columns=['state', 'ext_color'], inplace=True)</pre></div>



<p>Finally, if you want to take a more detailed look at the numeric features, you can use jointplots. These are scatterplots with additional information about the distributions. The example below shows the jointplot for the odometer value vs price. </p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># detailed univariate and bivariate analysis of 'odometer' using a jointplot 
def make_jointplot(feature_name):
    p = sns.jointplot(data=df, y=feature_name, x=target_name, height=6, ratio=6, kind='reg', joint_kws={'line_kws':{'color':'coral'}})
    p.fig.suptitle(feature_name + ' Distribution')
    p.ax_joint.collections[0].set_alpha(0.3)
    p.ax_joint.set_ylim(df[feature_name].min(), df[feature_name].max())
    p.fig.tight_layout()
    p.fig.subplots_adjust(top=0.95)
make_jointplot ('odometer')
# Alternatively you can use hex_binning
# def make_joint_hexplot(feature_name):
#     p = sns.jointplot(data=df, y=feature_name, x=target_name, height=10, ratio=1, kind=&quot;hex&quot;)
#     p.ax_joint.set_ylim(0, df[feature_name].quantile(0.999))
#     p.ax_joint.set_xlim(0, df[target_name].quantile(0.999))
#     p.fig.suptitle(feature_name + ' Distribution')</pre></div>



<figure class="wp-block-image size-full is-resized"><img decoding="async" data-attachment-id="11491" data-permalink="https://www.relataly.com/exploratory-feature-preparation-for-regression-with-python-and-scikit-learn/8832/image-8-10/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/12/image-8.png" data-orig-size="425,427" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-8" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/12/image-8.png" src="https://www.relataly.com/wp-content/uploads/2022/12/image-8.png" alt="" class="wp-image-11491" width="499" height="502" srcset="https://www.relataly.com/wp-content/uploads/2022/12/image-8.png 425w, https://www.relataly.com/wp-content/uploads/2022/12/image-8.png 140w" sizes="(max-width: 499px) 100vw, 499px" /></figure>



<p>Here is another example of a jointplot for the variable &#8216;condition.&#8217;</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># detailed univariate and bivariate analysis of 'condition' using a jointplot 
make_jointplot('condition')</pre></div>



<figure class="wp-block-image size-full is-resized"><img decoding="async" data-attachment-id="9423" data-permalink="https://www.relataly.com/exploratory-feature-preparation-for-regression-with-python-and-scikit-learn/8832/jointplot-condition/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/09/jointplot-condition.png" data-orig-size="425,427" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="jointplot-condition" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/09/jointplot-condition.png" src="https://www.relataly.com/wp-content/uploads/2022/09/jointplot-condition.png" alt="Dotplot that shows the relationship between two variables: car condition vs sale price" class="wp-image-9423" width="472" height="475" srcset="https://www.relataly.com/wp-content/uploads/2022/09/jointplot-condition.png 425w, https://www.relataly.com/wp-content/uploads/2022/09/jointplot-condition.png 150w" sizes="(max-width: 472px) 100vw, 472px" /></figure>



<p>The graphs show a linear relationship between the price for the condition and the odometer value. </p>



<h4 class="wp-block-heading" id="h-4-2-correlation-matrix">4.2 Correlation Matrix</h4>



<p>Correlation analysis is a technique to quantify the dependency between numeric features and a target variable. Different ways exist to calculate the correlation coefficient. For example, we can use Pearson correlation (linear relation), Kendall correlation (ordinal association), or Spearman (monotonic dependence). </p>



<p>The example below uses Pearson correlation, which concentrates on the linear relationship between two variables. The Pearson correlation score lies between -1 and 1. General interpretations of the absolute value of the correlation coefficient&nbsp;are:</p>



<ul class="wp-block-list">
<li>.00-.19 &#8220;very weak&#8221;</li>



<li>.20-.39 &#8220;weak&#8221;</li>



<li>.40-.59 &#8220;moderate&#8221;</li>



<li>.60-.79 &#8220;strong&#8221;</li>



<li>.80-1.0 &#8220;very strong&#8221;</li>
</ul>



<p>More information on the Pearson correlation can be found <a href="https://www.relataly.com/category/data-science/pearson-correlation/" target="_blank" rel="noreferrer noopener">here</a> and in <a href="https://www.relataly.com/stock-market-correlation-matrix-in-python/103/" target="_blank" rel="noreferrer noopener">this article on the correlation between covid-19 and the stock market</a>.</p>



<p>We will calculate a correlation matrix that provides the correlation coefficient for all features in our subset, incl. sale_price.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># 4.1 Correlation Matrix
# correlation heatmap allows us to identify highly correlated explanatory variables and reduce collinearity
plt.figure(figsize = (9,8))
plt.yticks(rotation=0)
correlation = df.corr()
ax =  sns.heatmap(correlation, cmap='GnBu',square=True, linewidths=.1, cbar_kws={&quot;shrink&quot;: .82},annot=True,
            fmt='.1',annot_kws={&quot;size&quot;:10})
sns.set(font_scale=0.8)
for f in ax.texts:
        f.set_text(f.get_text())  </pre></div>



<figure class="wp-block-image size-full is-resized"><img decoding="async" data-attachment-id="9400" data-permalink="https://www.relataly.com/exploratory-feature-preparation-for-regression-with-python-and-scikit-learn/8832/image-24-9/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/09/image-24.png" data-orig-size="646,549" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-24" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/09/image-24.png" src="https://www.relataly.com/wp-content/uploads/2022/09/image-24.png" alt="Heatmap in Python that shows the correlation between selected variables in our car dataset" class="wp-image-9400" width="554" height="471" srcset="https://www.relataly.com/wp-content/uploads/2022/09/image-24.png 646w, https://www.relataly.com/wp-content/uploads/2022/09/image-24.png 300w" sizes="(max-width: 554px) 100vw, 554px" /></figure>



<p>All our remaining numeric features strongly correlate with price (positive or negative). However, this is not all that matters. Ideally, we want to have features that have a low correlation with each other. We can see that prod_year and condition are moderately correlated (coefficient: 0.5). Because prod_year is more correlated with price (coefficient: 0.6) than condition (coefficient: 0.5), we drop the condition variable. </p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}">df.drop(columns='condition', inplace=True)</pre></div>



<h3 class="wp-block-heading">Step #5 Data Preprocessing </h3>



<p>Now our subset contains the following variables:</p>



<ul class="wp-block-list">
<li>prod_year</li>



<li>maker</li>



<li>model</li>



<li>trim</li>



<li>body_type</li>



<li>odometer</li>



<li>int_color</li>



<li>sale_price</li>
</ul>



<p>Next, we prepare the data for use as input to train a regression model. Before we train the model, we need to make a few final preparations. For example, we use a label encoder to replace the strong_values of the categorical variables with numeric values.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># encode categorical variables 
def encode_categorical_variables(df):
    # create a list of categorical variables that we want to encode
    categorical_list = [x for x in df.columns if is_string_dtype(df[x])]
    le = LabelEncoder()
    # apply the encoding to the categorical variables
    # because the apply() function has no inplace argument,  we use the following syntax to transform the df
    df[categorical_list] = df[categorical_list].apply(LabelEncoder().fit_transform)
    return df
df_final_subset = encode_categorical_variables(df)
df_all_ = encode_categorical_variables(df_all)
# create a copy of the dataframe but without the target variable
df_without_target = df.drop(columns=[target_name])
df_final_subset.head()</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">	prod_year	maker	model	trim	body_type	odometer	int_color	sale_price	date
0	2015		23		594		794		31			16639.0		0			21500		8
1	2015		34		59		98		32			5554.0		0			10900		17
2	2014		2		46		180		32			14414.0		0			49750		8
3	2015		34		59		98		32			11398.0		0			14100		13
4	2015		7		325		789		32			14538.0		0			7200		158</pre></div>



<h3 class="wp-block-heading" id="h-step-6-splitting-the-data-and-training-the-model">Step #6 Splitting the Data and Training the Model</h3>



<p>To ensure that our regression model does not know the target variable, we separate car price (y) from features (x). Last, we split the data into separate datasets for training and testing. The result is four different data sets: x_train, y_train, x_test, and y_test.</p>



<p>Once the split function has prepared the datasets, we the regression model. Our model uses the Random Decision Forest algorithm from the scikit learn package. As a so-called ensemble model, the Random Forest is a robust Machine Learning algorithm. It considers predictions from a set of multiple independent estimators. </p>



<p>The Random Forest algorithm has a wide range of hyperparameters. While we could optimize our model further by testing various configurations (hyperparameter tuning), this is not the focus of this article. Therefore, we will use the default hyperparameters for our model as defined by <a href="https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html?highlight=random%20forest#sklearn.ensemble.RandomForestClassifier" target="_blank" rel="noreferrer noopener">scikit-learn</a>. Please visit one of my recent articles on <a href="https://www.relataly.com/using-random-search-to-tune-the-hyperparameters-of-a-random-decision-forest-with-python/6875/" target="_blank" rel="noreferrer noopener">hyperparameter tuning</a>, if you want to learn more about this topic.</p>



<p>For comparison reasons, we train two models—one model with our subset of selected features. The second model uses all features, cleansed but without any further manipulations. </p>



<p>We use shuffled cross-validation (cv=5) to evaluate our model&#8217;s performance on different data folds.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}">def splitting(df, name):
    # separate labels from training data
    X = df.drop(columns=[target_name])
    y = df[target_name] #Prediction label
    # split the data into x_train and y_train data sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, random_state=0)
    # print the shapes: the result is: (rows, training_sequence, features) (prediction value, )
    print(name + '')
    print('train: ', X_train.shape, y_train.shape)
    print('test: ', X_test.shape, y_test.shape)
    return X, y, X_train, X_test, y_train, y_test
# train the model
def train_model(X, y, X_train, y_train):
    estimator = RandomForestRegressor() 
    cv = ShuffleSplit(n_splits=5, test_size=0.3, random_state=0)
    scores = cross_val_score(estimator, X, y, cv=cv)
    estimator.fit(X_train, y_train)
    return scores, estimator
# train the model with the subset of selected features
X_sub, y_sub, X_train_sub, X_test_sub, y_train_sub, y_test_sub = splitting(df_final_subset, 'subset')
scores_sub, estimator_sub = train_model(X_sub, y_sub, X_train_sub, y_train_sub)
    
# train the model with all features
X_all, y_all, X_train_all, X_test_all, y_train_all, y_test_all = splitting(df_all_, 'fullset')
scores_all, estimator_all = train_model(X_all, y_all, X_train_all, y_train_all)</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">subset
train:  (76592, 8) (76592,)
test:  (32826, 8) (32826,)</pre></div>



<h3 class="wp-block-heading" id="h-step-7-comparing-regression-models">Step #7 Comparing Regression Models</h3>



<p>Finally, we want to see how the model performs and how its performance compares against the model that uses all variables. </p>



<h4 class="wp-block-heading" id="h-7-1-model-scoring">7.1 Model Scoring</h4>



<p>We use different regression metrics to measure the performance. Then we create a barplot that compares the performance scores across the different validation folds (due to cross-validation). </p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># 7.1 Model Scoring 
def create_metrics(scores, estimator, X_test, y_test, col_name):
    scores_df = pd.DataFrame({col_name:scores})
    # predict on the test set
    y_pred = estimator.predict(X_test)
    y_df = pd.DataFrame(y_test)
    y_df['PredictedPrice']=y_pred
    # Mean Absolute Error (MAE)
    MAE = mean_absolute_error(y_test, y_pred)
    print('Mean Absolute Error (MAE): ' + str(np.round(MAE, 2)))
    # Mean Absolute Percentage Error (MAPE)
    MAPE = mean_absolute_percentage_error(y_test, y_pred)
    print('Mean Absolute Percentage Error (MAPE): ' + str(np.round(MAPE*100, 2)) + ' %')
    
    # calculate the feature importance scores
    r = permutation_importance(estimator, X_test, y_test, n_repeats=30, random_state=0)
    data_im = pd.DataFrame(r.importances_mean, columns=['feature_permuation_score'])
    data_im['feature_names'] = X_test.columns
    data_im = data_im.sort_values('feature_permuation_score', ascending=False)
    
    return scores_df, data_im
scores_df_sub, data_im_sub = create_metrics(scores_sub, estimator_sub, X_test_sub, y_test_sub, 'subset')
scores_df_all, data_im_all = create_metrics(scores_all, estimator_all, X_test_all, y_test_all, 'fullset')
scores_df = pd.concat([scores_df_sub, scores_df_all],  axis=1)
# visualize how the two models have performed in each fold
fig, ax = plt.subplots(figsize=(10, 6))
scores_df.plot(y=[&quot;subset&quot;, &quot;fullset&quot;], kind=&quot;bar&quot;, ax=ax)
ax.set_title('Cross validation scores')
ax.set(ylim=(0, 1))
ax.tick_params(axis=&quot;x&quot;, rotation=0, labelsize=10, length=0)</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">Mean Absolute Error (MAE): 1643.39
Mean Absolute Percentage Error (MAPE): 24.36 %
Mean Absolute Error (MAE): 1813.78
Mean Absolute Percentage Error (MAPE): 25.23 %</pre></div>



<figure class="wp-block-image size-full is-resized"><img decoding="async" data-attachment-id="9436" data-permalink="https://www.relataly.com/exploratory-feature-preparation-for-regression-with-python-and-scikit-learn/8832/image-29-8/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/09/image-29.png" data-orig-size="746,468" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-29" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/09/image-29.png" src="https://www.relataly.com/wp-content/uploads/2022/09/image-29.png" alt="barplot that visualizes cross validation for a car price regression model" class="wp-image-9436" width="494" height="310" srcset="https://www.relataly.com/wp-content/uploads/2022/09/image-29.png 746w, https://www.relataly.com/wp-content/uploads/2022/09/image-29.png 300w" sizes="(max-width: 494px) 100vw, 494px" /></figure>



<p>The subset model achieves an absolute percentage error of around 24%, which is not so bad. But more importantly, our model performs better than the model that uses all features. However, the subset model is less complex as it only uses eight features instead of 12. So it is easier to understand and less costly to train.</p>



<h4 class="wp-block-heading">7.2 Feature Permutation Importance Scores</h4>



<p>Next, we calculate feature importance scores. In this way, we can determine which features attribute the most to the predictive power of our model. Feature importance scores are a useful tool in the feature engineering process, as they provide insights into how the features in our subset contribute to the overall performance of our predictive model. Features with low importance scores can be eliminated from the subset or replaced with other features.</p>



<p>Again we will compare our subset model to the model that uses all available features from the initial dataset. </p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># compare the feature importance scores of the subset model to the fullset model
fig, axs = plt.subplots(1, 2, figsize=(20, 8))
sns.barplot(data=data_im_sub, y='feature_names', x=&quot;feature_permuation_score&quot;, ax=axs[0])
axs[0].set_title(&quot;Feature importance scores of the subset model&quot;)
sns.barplot(data=data_im_all, y='feature_names', x=&quot;feature_permuation_score&quot;, ax=axs[1])
axs[1].set_title(&quot;Feature importance scores of the fullset model&quot;)</pre></div>



<figure class="wp-block-image size-large"><img decoding="async" width="1024" height="421" data-attachment-id="9437" data-permalink="https://www.relataly.com/exploratory-feature-preparation-for-regression-with-python-and-scikit-learn/8832/cross-validation-scores-1/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/09/cross-validation-scores-1.png" data-orig-size="1200,493" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="cross-validation-scores-1" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/09/cross-validation-scores-1.png" src="https://www.relataly.com/wp-content/uploads/2022/09/cross-validation-scores-1-1024x421.png" alt="Barplots that compare feature importance between the full dataset model and the subset model" class="wp-image-9437" srcset="https://www.relataly.com/wp-content/uploads/2022/09/cross-validation-scores-1.png 1024w, https://www.relataly.com/wp-content/uploads/2022/09/cross-validation-scores-1.png 300w, https://www.relataly.com/wp-content/uploads/2022/09/cross-validation-scores-1.png 768w, https://www.relataly.com/wp-content/uploads/2022/09/cross-validation-scores-1.png 1200w" sizes="(max-width: 1024px) 100vw, 1024px" /></figure>



<p>In the subset model, most features are relevant to the model&#8217;s performance. Only date and int_color do not seem to have a significant impact. For the full set model, five out of 12 features hardly contribute to the model performance (date, int_color, ext_color, state, transmission_type). </p>



<p>Once you have a strong subset of features, you can automate the feature selection process using different techniques, e.g., forward or backward selection. Automated feature selection techniques will test different model variants with varying feature combinations to determine the best input dataset. This step is often done at the end of the feature engineering process. However, this is something for another article. </p>



<h2 class="wp-block-heading" id="h-conclusions">Conclusions</h2>



<p>That&#8217;s it for now! This tutorial has presented an exploratory approach to feature exploration, engineering, and selection. You have gained an overview of tools and graphs that are useful in identifying and preparing features. The second part was a Python hands-on tutorial. We followed an exploratory feature engineering process to build a regression model for car prices. We used various techniques to discover and sort features and make a vital feature subset. These techniques include data cleansing, descriptive statistics, and univariate and bivariate analysis (incl. correlation). We also used some techniques for feature manipulation, including binning. Finally, we compared our subset model to one that uses all available data. </p>



<p>If you take away one learning from this article, remember that in machine learning, less is often more. So training classic machine learning models on carefully curated feature subsets likely outperforms models that use all available information. </p>



<p>I hope this article was helpful. I am always trying to improve and learn from my audience. So, if you have any questions or suggestions, please write them in the comments. </p>



<h2 class="wp-block-heading" id="h-sources-and-further-reading">Sources and Further Reading</h2>



<ol class="wp-block-list">
<li><a href="https://amzn.to/3eD49Kv" target="_blank" rel="noreferrer noopener">Zheng and Casari (2018) Feature Engineering for Machine Learning</a></li>



<li><a href="https://amzn.to/3TrBdDY" target="_blank" rel="noreferrer noopener">David Forsyth (2019) Applied Machine Learning Springer</a></li>



<li><a href="https://amzn.to/3T38bLe" target="_blank" rel="noreferrer noopener">Chip Huyen (2022) Designing Machine Learning Systems: An Iterative Process for Production-Ready Applications</a></li>
</ol>



<p class="has-contrast-2-color has-base-3-background-color has-text-color has-background"><em>The links above to Amazon are affiliate links. By buying through these links, you support the Relataly.com blog and help to cover the hosting costs. Using the links does not affect the price.</em></p>



<p>Stock-market prediction is a typical regression problem. To learn more about feature engineering for stock-market prediction, check out <a href="https://www.relataly.com/feature-engineering-for-multivariate-time-series-models-with-python/1813/" target="_blank" rel="noreferrer noopener">this article on multivariate stock-market forecasting</a>.</p>
<p>The post <a href="https://www.relataly.com/exploratory-feature-preparation-for-regression-with-python-and-scikit-learn/8832/">Feature Engineering and Selection for Regression Models with Python and Scikit-learn</a> appeared first on <a href="https://www.relataly.com">relataly.com</a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://www.relataly.com/exploratory-feature-preparation-for-regression-with-python-and-scikit-learn/8832/feed/</wfw:commentRss>
			<slash:comments>1</slash:comments>
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">8832</post-id>	</item>
		<item>
		<title>Create a Personalized Movie Recommendation Engine using Content-based Filtering in Python</title>
		<link>https://www.relataly.com/content-based-movie-recommender-using-python/4294/</link>
					<comments>https://www.relataly.com/content-based-movie-recommender-using-python/4294/#respond</comments>
		
		<dc:creator><![CDATA[Florian Follonier]]></dc:creator>
		<pubDate>Mon, 25 Jul 2022 11:29:00 +0000</pubDate>
				<category><![CDATA[Content-based Filtering]]></category>
		<category><![CDATA[Correlation]]></category>
		<category><![CDATA[Cosine Similarity]]></category>
		<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[nltk]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[Recommender Systems]]></category>
		<category><![CDATA[Retail]]></category>
		<category><![CDATA[Scikit-Learn]]></category>
		<category><![CDATA[Seaborn]]></category>
		<category><![CDATA[AI in E-Commerce]]></category>
		<category><![CDATA[Intermediate Tutorials]]></category>
		<category><![CDATA[Supervised Learning]]></category>
		<guid isPermaLink="false">https://www.relataly.com/?p=4294</guid>

					<description><![CDATA[<p>Content-based recommender systems are a popular type of machine learning algorithm that recommends relevant articles based on what a user has previously consumed or liked. This approach aims to identify items with certain keywords, understand what the customer likes, and then identify other items that are similar to items the user has previously consumed or ... <a title="Create a Personalized Movie Recommendation Engine using Content-based Filtering in Python" class="read-more" href="https://www.relataly.com/content-based-movie-recommender-using-python/4294/" aria-label="Read more about Create a Personalized Movie Recommendation Engine using Content-based Filtering in Python">Read more</a></p>
<p>The post <a href="https://www.relataly.com/content-based-movie-recommender-using-python/4294/">Create a Personalized Movie Recommendation Engine using Content-based Filtering in Python</a> appeared first on <a href="https://www.relataly.com">relataly.com</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow">
<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p>Content-based recommender systems are a popular type of machine learning algorithm that recommends relevant articles based on what a user has previously consumed or liked. This approach aims to identify items with certain keywords, understand what the customer likes, and then identify other items that are similar to items the user has previously consumed or rated. The recommendations are based on the similarity of the items, represented by similarity scores in a vector matrix. The attributes used to describe an item are called &#8220;content.&#8221; For example, in the case of movie recommendations, content could be the genre, actors, director, year of release, etc. A well-designed content-based recommendation service will suggest movies of the same genre, actors, or keywords. This tutorial will implement a content-based recommendation service for movies using Python and Scikit-learn. </p>



<p>The rest of this tutorial proceeds as follows: After a brief introduction to content-based recommenders, we will work with a database that contains several thousands of <a href="https://www.imdb.com/" target="_blank" rel="noreferrer noopener">IMDB </a>movie titles and create a feature model that uses actors, release year, and a short description for each movie. In this tutorial, you will also learn how to deal with some challenges of building a content-based recommender. For example, we will look at how we can engineer features for content-based model words and reduce the dimensionality of our model. Finally, we use our model to generate some sample predictions.</p>



<p>Note: Another popular type of recommender system that I have covered in a previous article is <a href="https://www.relataly.com/building-a-movie-recommender-using-collaborative-filtering/4376/" target="_blank" rel="noreferrer noopener">collaborative filtering</a>.</p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%">
<figure class="wp-block-image size-full"><img decoding="async" width="508" height="508" data-attachment-id="12673" data-permalink="https://www.relataly.com/signpost-decisions-recommender-system-machine-learning-python-tutorial-midjourney-2-min/" data-orig-file="https://www.relataly.com/wp-content/uploads/2023/03/signpost-decisions-recommender-system-machine-learning-python-tutorial-midjourney-2-min.png" data-orig-size="508,508" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="signpost decisions recommender system machine learning python tutorial midjourney (2)-min" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2023/03/signpost-decisions-recommender-system-machine-learning-python-tutorial-midjourney-2-min.png" src="https://www.relataly.com/wp-content/uploads/2023/03/signpost-decisions-recommender-system-machine-learning-python-tutorial-midjourney-2-min.png" alt="Recommendation systems can ease decision-making. Image created with Midjourney." class="wp-image-12673" srcset="https://www.relataly.com/wp-content/uploads/2023/03/signpost-decisions-recommender-system-machine-learning-python-tutorial-midjourney-2-min.png 508w, https://www.relataly.com/wp-content/uploads/2023/03/signpost-decisions-recommender-system-machine-learning-python-tutorial-midjourney-2-min.png 300w, https://www.relataly.com/wp-content/uploads/2023/03/signpost-decisions-recommender-system-machine-learning-python-tutorial-midjourney-2-min.png 140w" sizes="(max-width: 508px) 100vw, 508px" /><figcaption class="wp-element-caption">Recommendation systems can ease decision-making. Image created with <a href="http://www.midjourney.com" target="_blank" rel="noreferrer noopener">Midjourney</a>.</figcaption></figure>
</div>
</div>
</div>
</div>



<h2 class="wp-block-heading" id="h-what-is-content-based-filtering">What is Content-Based Filtering?</h2>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p>The idea behind content-based recommenders is to generate recommendations based on user&#8217;s preferences and tastes. These preferences revolve around past user choices, for example, the number of times a user has watched a movie, purchased an item, or clicked on a link. </p>



<p>Content-based filtering uses domain-specific item features to measure the similarity between items. Given the user preferences, the algorithm will recommend items similar to what the user has consumed or liked before. For movie recommendations, this content can be the genre, actors, release year, director, film length, or keywords used to describe the movies. This approach works particularly well for domains with a lot of textual metadata, such as movies and videos, books, or products.</p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%">
<figure class="wp-block-kadence-image kb-image_efc597-b6 size-large"><img decoding="async" width="512" height="500" data-attachment-id="12668" data-permalink="https://www.relataly.com/polaroid-film-movie-recommender-machine-learning-python-tutorial-min/" data-orig-file="https://www.relataly.com/wp-content/uploads/2023/03/polaroid-film-movie-recommender-machine-learning-python-tutorial-min.png" data-orig-size="518,506" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="polaroid film movie recommender machine learning python tutorial-min" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2023/03/polaroid-film-movie-recommender-machine-learning-python-tutorial-min.png" src="https://www.relataly.com/wp-content/uploads/2023/03/polaroid-film-movie-recommender-machine-learning-python-tutorial-min-512x500.png" alt="" class="kb-img wp-image-12668" srcset="https://www.relataly.com/wp-content/uploads/2023/03/polaroid-film-movie-recommender-machine-learning-python-tutorial-min.png 512w, https://www.relataly.com/wp-content/uploads/2023/03/polaroid-film-movie-recommender-machine-learning-python-tutorial-min.png 300w, https://www.relataly.com/wp-content/uploads/2023/03/polaroid-film-movie-recommender-machine-learning-python-tutorial-min.png 518w" sizes="(max-width: 512px) 100vw, 512px" /><figcaption>Content-based movie recommendations will suggest more of the same, for example, actors, genres, stories, and directors.<br></figcaption></figure>
</div>
</div>



<p></p>



<h3 class="wp-block-heading">Basic Steps to Building a Content-based Recommender System</h3>



<p>The approach to building a content-based recommender involves four essential steps:</p>



<ol class="wp-block-list">
<li>The first step is to create a so-called &#8216;bag of words&#8217; model from the input data, which is a list of words used to characterize the items. This step involves selecting useful content for describing and differentiating the items. The more precise the information, the better the recommendations will be. </li>



<li>The next step is to turn the bag (of words) into a feature vector. Different algorithms can be used for this step, for example, the Tfdif vectorizer or the count vectorizer. The result is a vector matrix with items as records and features as columns. This step often also includes applying techniques for dimensionality reduction. </li>



<li>The idea of content-based recommendations is based on measuring item similarity. Similarity scores are assigned through pairwise comparison. Here again, we can choose between different measures, e.g., the dot product or cosine similarity. </li>



<li>Once you have the similarity scores, you can return the most similar items by sorting the data by similarity scores. Given user preferences (single or multiple items a user consumed or liked), the algorithm will then recommend the most similar items. </li>
</ol>



<figure class="wp-block-image size-large is-resized"><img decoding="async" data-attachment-id="8937" data-permalink="https://www.relataly.com/content-based-movie-recommender-using-python/4294/image-11-5/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/07/image-11.png" data-orig-size="1911,1227" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-11" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/07/image-11.png" src="https://www.relataly.com/wp-content/uploads/2022/07/image-11-1024x657.png" alt="" class="wp-image-8937" width="1169" height="751" srcset="https://www.relataly.com/wp-content/uploads/2022/07/image-11.png 1024w, https://www.relataly.com/wp-content/uploads/2022/07/image-11.png 300w, https://www.relataly.com/wp-content/uploads/2022/07/image-11.png 768w, https://www.relataly.com/wp-content/uploads/2022/07/image-11.png 1536w, https://www.relataly.com/wp-content/uploads/2022/07/image-11.png 1911w" sizes="(max-width: 1169px) 100vw, 1169px" /><figcaption class="wp-element-caption">Approach to Building a Content-based Recommender System</figcaption></figure>



<h3 class="wp-block-heading">Similarity Scoring</h3>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p>The quality of the content-based recommendations is significantly influenced by how well the algorithm succeeds in measuring the similarity of the items. There are different techniques to calculate similarity, including Cosine Similarity, Pearson Similarity, Dot Product, and Euclidian Distance. They have in common that they use numerical characteristics of the text to calculate the distance between text vectors in an n-dimensional vector space.</p>



<p>It is worth denoting that these techniques can only measure word-level similarity. This means the algorithms compare the word of the item for word without considering the semantic meaning of the sentences. In some instances, this can lead to errors. For example, how similar are <em>&#8220;now that they were sitting on a bank, he noticed she stole his heart, and he was in love&#8221;</em> and <em>&#8220;They are gangsters who love to steal from a large bank&#8221;</em>? By just looking at the words, one may appear similar because the words have a good overlap.</p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%"></div>
</div>



<p></p>



<h3 class="wp-block-heading">Pros and Cons of Content-based Filtering</h3>



<p>Like most machine learning algorithms, content-based recommenders have their strength and weaknesses.</p>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow">
<p><strong>Advantages</strong></p>



<ul class="wp-block-list">
<li>Content-based filtering is good at capturing a user&#8217;s specific interests and will recommend more of the same (for example, genre, actors, directors, etc.). It will also recommend niche items if they match the user preferences, even if these items draw little attention.</li>



<li>Another advantage is that the model can generate recommendations for a specific user without the knowledge of other users. This is particularly helpful if you want to generate predictions for many users.</li>
</ul>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow">
<p><strong>Disadvantages</strong></p>



<ul class="wp-block-list">
<li>On the other hand, there are also a couple of downsides. The feature representation of the items has to be done manually to a certain extent, and the prediction quality strongly depends on whether items are described in detail. Therefore, content-based filtering requires a lot of expertise.</li>



<li>Since recommendations are based on the user&#8217;s previous interests. However, the recommendations are unlikely to go beyond that and expand to areas (e.g., genres) that are still unknown to the user.  Content-based models thus tend to develop some tunnel vision, so that the model recommends more and more of the same.</li>
</ul>
</div>
</div>



<div style="height:54px" aria-hidden="true" class="wp-block-spacer"></div>



<h2 class="wp-block-heading" id="h-implementing-a-content-based-movie-recommender-in-python">Implementing a Content-based Movie Recommender in Python</h2>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p>In the following, we will implement a content-based movie recommender using Python and Scikit-learn. We will carry out all steps necessary to create a content-based recommender. The data comes from an IMDB dataset containing more than 40k films between 1996 and 2018. Based on the data, we define the features we want to use for recommending the movies. These features include the genre, director, main actors, plot keywords, or other metadata associated with the movies. Then we preprocess the data to extract these features and create a feature matrix. The feature matrix becomes the foundation for a similarity matrix that measures the similarity between the items based on their feature vectors. Finally, we use the similarity matrix to generate recommendations for a given item. </p>



<p>By the end of this Python tutorial, you will have learned how to implement a content-based recommendation system for movies using Python and Scikit-learn. This knowledge can be applied to other types of recommendations, such as articles, products, or songs.</p>



<p>The code is available on the GitHub repository. </p>



<div class="wp-block-kadence-advancedbtn kb-buttons-wrap kb-btns_c5cd86-dc"><a class="kb-button kt-button button kb-btn_a1ac8b-37 kt-btn-size-standard kt-btn-width-type-full kb-btn-global-inherit  kt-btn-has-text-true kt-btn-has-svg-true  wp-block-button__link wp-block-kadence-singlebtn" href="https://github.com/flo7up/relataly-public-python-tutorials/blob/master/05%20Recommender%20Systems/032%20Movie%20Recommender%20using%20Content-based%20Filtering.ipynb" target="_blank" rel="noreferrer noopener"><span class="kb-svg-icon-wrap kb-svg-icon-fe_eye kt-btn-icon-side-left"><svg viewBox="0 0 24 24"  fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"  aria-hidden="true"><path d="M1 12s4-8 11-8 11 8 11 8-4 8-11 8-11-8-11-8z"/><circle cx="12" cy="12" r="3"/></svg></span><span class="kt-btn-inner-text">View on GitHub </span></a>

<a class="kb-button kt-button button kb-btn_f7ca5b-85 kt-btn-size-standard kt-btn-width-type-full kb-btn-global-inherit  kt-btn-has-text-true kt-btn-has-svg-true  wp-block-button__link wp-block-kadence-singlebtn" href="https://github.com/flo7up/relataly-public-python-API-tutorials" target="_blank" rel="noreferrer noopener"><span class="kb-svg-icon-wrap kb-svg-icon-fa_github kt-btn-icon-side-left"><svg viewBox="0 0 496 512"  fill="currentColor" xmlns="http://www.w3.org/2000/svg"  aria-hidden="true"><path d="M165.9 397.4c0 2-2.3 3.6-5.2 3.6-3.3.3-5.6-1.3-5.6-3.6 0-2 2.3-3.6 5.2-3.6 3-.3 5.6 1.3 5.6 3.6zm-31.1-4.5c-.7 2 1.3 4.3 4.3 4.9 2.6 1 5.6 0 6.2-2s-1.3-4.3-4.3-5.2c-2.6-.7-5.5.3-6.2 2.3zm44.2-1.7c-2.9.7-4.9 2.6-4.6 4.9.3 2 2.9 3.3 5.9 2.6 2.9-.7 4.9-2.6 4.6-4.6-.3-1.9-3-3.2-5.9-2.9zM244.8 8C106.1 8 0 113.3 0 252c0 110.9 69.8 205.8 169.5 239.2 12.8 2.3 17.3-5.6 17.3-12.1 0-6.2-.3-40.4-.3-61.4 0 0-70 15-84.7-29.8 0 0-11.4-29.1-27.8-36.6 0 0-22.9-15.7 1.6-15.4 0 0 24.9 2 38.6 25.8 21.9 38.6 58.6 27.5 72.9 20.9 2.3-16 8.8-27.1 16-33.7-55.9-6.2-112.3-14.3-112.3-110.5 0-27.5 7.6-41.3 23.6-58.9-2.6-6.5-11.1-33.3 2.6-67.9 20.9-6.5 69 27 69 27 20-5.6 41.5-8.5 62.8-8.5s42.8 2.9 62.8 8.5c0 0 48.1-33.6 69-27 13.7 34.7 5.2 61.4 2.6 67.9 16 17.7 25.8 31.5 25.8 58.9 0 96.5-58.9 104.2-114.8 110.5 9.2 7.9 17 22.9 17 46.4 0 33.7-.3 75.4-.3 83.6 0 6.5 4.6 14.4 17.3 12.1C428.2 457.8 496 362.9 496 252 496 113.3 383.5 8 244.8 8zM97.2 352.9c-1.3 1-1 3.3.7 5.2 1.6 1.6 3.9 2.3 5.2 1 1.3-1 1-3.3-.7-5.2-1.6-1.6-3.9-2.3-5.2-1zm-10.8-8.1c-.7 1.3.3 2.9 2.3 3.9 1.6 1 3.6.7 4.3-.7.7-1.3-.3-2.9-2.3-3.9-2-.6-3.6-.3-4.3.7zm32.4 35.6c-1.6 1.3-1 4.3 1.3 6.2 2.3 2.3 5.2 2.6 6.5 1 1.3-1.3.7-4.3-1.3-6.2-2.2-2.3-5.2-2.6-6.5-1zm-11.4-14.7c-1.6 1-1.6 3.6 0 5.9 1.6 2.3 4.3 3.3 5.6 2.3 1.6-1.3 1.6-3.9 0-6.2-1.4-2.3-4-3.3-5.6-2z"/></svg></span><span class="kt-btn-inner-text">Relataly GitHub Repo </span></a></div>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%">
<figure class="wp-block-image size-full"><img decoding="async" src="https://www.relataly.com/wp-content/uploads/2023/01/DALL·E-2023-01-12-19.26.12-A-plush-robot-watching-a-movie-on-tv-min.png" alt="This Robot doesn't know what to watch on tv. Let's build a recommender system for him! Image generated using DALL-E 2 by OpenAI." class="wp-image-11993"/><figcaption class="wp-element-caption">This Robot doesn&#8217;t know what to watch on tv. Let&#8217;s build a recommender system for him! Image generated using <a href="https://openai.com/dall-e-2/" target="_blank" rel="noreferrer noopener">DALL-E 2 by OpenAI</a>.</figcaption></figure>
</div>
</div>



<h3 class="wp-block-heading" id="h-prerequisites">Prerequisites</h3>



<p>Before you start with the coding part, ensure you have set up your Python 3 environment and required packages. If you don&#8217;t have an environment, consider the&nbsp;<a href="https://www.anaconda.com/products/individual" target="_blank" rel="noreferrer noopener">Anaconda Python environment</a>. Follow <a href="https://www.relataly.com/category/data-science/setup-anaconda-environment/" target="_blank" rel="noreferrer noopener">this tutorial</a> to set it up.</p>



<p>Also, make sure you install all required packages. In this tutorial, we will be working with the following standard packages:&nbsp;</p>



<ul class="wp-block-list">
<li><em><a href="https://pandas.pydata.org/" target="_blank" rel="noreferrer noopener">pandas</a></em></li>



<li><em><a href="https://numpy.org/" target="_blank" rel="noreferrer noopener">NumPy</a></em></li>



<li><em><a href="https://matplotlib.org/" target="_blank" rel="noreferrer noopener">matplotlib</a></em></li>
</ul>



<p>In addition, we will be using <a href="https://seaborn.pydata.org/" target="_blank" rel="noreferrer noopener">Seaborn </a>for visualization and the natural language processing library <a href="https://www.nltk.org/" target="_blank" rel="noreferrer noopener">nltk</a>. </p>



<p>You can install these packages by using one of the following commands: </p>



<ul class="wp-block-list">
<li><em>pip install &lt;package name&gt;</em></li>



<li><em>conda install &lt;package name&gt;</em>&nbsp;(if you are using the anaconda packet manager)</li>
</ul>



<h3 class="wp-block-heading" id="h-about-the-imdb-movies-dataset">About the IMDB Movies Dataset</h3>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p>We will train our movie recommender on a popular Movies Dataset (you can download it from <a href="https://grouplens.org/datasets/movielens/latest/" target="_blank" rel="noreferrer noopener">grouplens.org</a>). The MovieLens recommendation service collected the Dataset from 610 users between 1996 and 2018. Unpack the data into the working folder of your project.</p>



<p>The full Dataset contains metadata on over 45,000 movies and 26 million ratings from over 270,000 users. The Dataset contains the following files (Source of the data description: <a href="https://www.kaggle.com/rounakbanik/the-movies-dataset" target="_blank" rel="noreferrer noopener">Kaggle.com</a>):</p>



<ul class="wp-block-list">
<li><strong>movies_metadata.csv:</strong>&nbsp;The main Movies Metadata file contains information on 45,000 movies featured in the Full MovieLens Dataset. Features include posters, backdrops, budget, revenue, release dates, languages, production countries, and companies.
<ul class="wp-block-list">
<li></li>
</ul>
</li>



<li><strong>ratings_small.csv:</strong>&nbsp;The subset of 100,000 ratings from 700 users on 9,000 movies. Each line corresponds to a 5-star movie rating with half-star increments (0.5 &#8211; 5.0 stars).</li>



<li><strong>keywords.csv:</strong>&nbsp;Contains the movie plot keywords for our MovieLens movies. Available in the form of a stringified JSON Object.</li>



<li><strong>credits.csv:</strong>&nbsp;Consists of Cast and Crew Information for all our films. Available in the form of a stringified JSON Object.</li>
</ul>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%">
<figure class="wp-block-image size-full is-resized"><img decoding="async" data-attachment-id="7128" data-permalink="https://www.relataly.com/building-a-movie-recommender-using-collaborative-filtering/4376/mdb-movie-database/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/04/MDB-Movie-Database.png" data-orig-size="1024,537" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="recomender systems collaborative filtering imdb movies" data-image-description="&lt;p&gt;recomender systems collaborative filtering imdb movies&lt;/p&gt;
" data-image-caption="&lt;p&gt;recomender systems collaborative filtering imdb movies&lt;/p&gt;
" data-large-file="https://www.relataly.com/wp-content/uploads/2022/04/MDB-Movie-Database.png" src="https://www.relataly.com/wp-content/uploads/2022/04/MDB-Movie-Database.png" alt="MDB Movie Database Recommender Systems Collaborative Filtering " class="wp-image-7128" width="366" height="192" srcset="https://www.relataly.com/wp-content/uploads/2022/04/MDB-Movie-Database.png 1024w, https://www.relataly.com/wp-content/uploads/2022/04/MDB-Movie-Database.png 300w, https://www.relataly.com/wp-content/uploads/2022/04/MDB-Movie-Database.png 768w" sizes="(max-width: 366px) 100vw, 366px" /><figcaption class="wp-element-caption">IMDB Movie Database </figcaption></figure>
</div>
</div>



<p>Several other files are included that we won&#8217;t use, incl. ratings_small, links_small, and links.</p>



<p>You can download it <a href="https://grouplens.org/datasets/movielens/latest/" target="_blank" rel="noreferrer noopener">here</a> or from <a href="https://www.kaggle.com/rounakbanik/the-movies-dataset" target="_blank" rel="noreferrer noopener">Kaggle</a>.</p>



<h3 class="wp-block-heading" id="h-step-1-load-the-data">Step #1: Load the Data</h3>



<p>Our goal is to create a content-based recommender system for movie recommendations. In this case, the content will be meta information on movies, such as genre, actors, the description.</p>



<p>We begin by making imports and loading the data from three files:</p>



<ul class="wp-block-list">
<li>movies_metadata.csv</li>



<li>credits.csv</li>



<li>keywords.csv</li>
</ul>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}">import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
sns.set_style('white', { 'axes.spines.right': False, 'axes.spines.top': False})
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.decomposition import TruncatedSVD
from nltk.corpus import stopwords

# the IMDB movies data is available on Kaggle.com
# https://www.kaggle.com/datasets/rounakbanik/the-movies-dataset

# in case you have placed the files outside of your working directory, you need to specify the path
path = 'data/movie_recommendations/' 

# load the movie metadata
df_meta=pd.read_csv(path + 'movies_metadata.csv', low_memory=False, encoding='UTF-8') 

# some records have invalid ids, which is why we remove them
df_meta = df_meta.drop([19730, 29503, 35587])

# convert the id to type int and set id as index
df_meta = df_meta.set_index(df_meta['id'].str.strip().replace(',','').astype(int))
pd.set_option('display.max_colwidth', 20)
df_meta.head(2)</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">		adult	belongs_to_collection			budget		genres				homepage			id		imdb_id		original_language	original_title	overview			...	release_date	revenue		runtime	spoken_languages	status		tagline	title	video	vote_average	vote_count
id																					
862		False	{'id': 10194, 'n...				30000000	[{'id': 16, 'nam...	http://toystory....	862		tt0114709	en					Toy Story		Led by Woody, An...	...	1995-10-30		373554033.0	81.0	[{'iso_639_1': '...	Released	NaN	Toy Story	False	7.7	5415.0
8844	False	NaN								65000000	[{'id': 12, 'nam...	NaN					8844	tt0113497	en				Jumanji				When siblings Ju...	...	1995-12-15		262797249.0	104.0	[{'iso_639_1': '...	Released	Roll the dice an...	Jumanji	False	6.9	2413.0</pre></div>



<p>After we have loaded credits and keywords, we will combine the data into a single dataframe. Now we have various input fields available. However, we will only use keywords, cast, year of release, genres, and overview. If you like, you can enhance the data with additional inputs, for example, budget, running time, or film language. </p>



<p>Once we have gathered our data in a single dataframe, we print out the first rows to gain an overview of the data. </p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># load the movie credits
df_credits = pd.read_csv(path + 'credits.csv', encoding='UTF-8')
df_credits = df_credits.set_index('id')

# load the movie keywords
df_keywords=pd.read_csv(path + 'keywords.csv', low_memory=False, encoding='UTF-8') 
df_keywords = df_keywords.set_index('id')

# merge everything into a single dataframe 
df_k_c = df_keywords.merge(df_credits, left_index=True, right_on='id')
df = df_k_c.merge(df_meta[['release_date','genres','overview','title']], left_index=True, right_on='id')
df.head(3)</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">		keywords			cast				crew				release_date	genres				overview			title
id							
862		[{'id': 931, 'na...	[{'cast_id': 14,...	[{'credit_id': '...	1995-10-30		[{'id': 16, 'nam...	Led by Woody, An...	Toy Story
8844	[{'id': 10090, '...	[{'cast_id': 1, ...	[{'credit_id': '...	1995-12-15		[{'id': 12, 'nam...	When siblings Ju...	Jumanji
15602	[{'id': 1495, 'n...	[{'cast_id': 2, ...	[{'credit_id': '...	1995-12-22		[{'id': 10749, '...	A family wedding...	Grumpier Old Men</pre></div>



<div style="height:15px" aria-hidden="true" class="wp-block-spacer"></div>



<p>We can see cast, crew, and genres have a dictionary-like structure. To create a cosine similarity matrix, we need to extract the keywords from these columns and gather them in a single column. This is what we will do in the next step. </p>



<h3 class="wp-block-heading" id="h-step-2-feature-engineering-and-data-cleaning">Step #2: Feature Engineering and Data Cleaning</h3>



<p>A problem with modeling text is that machine learning algorithms have difficulty processing text directly. An essential step in creating content-based recommenders is bringing the text into a machine-readable form. This is what we call feature engineering. </p>



<h4 class="wp-block-heading">2.1 Creating a Bag-of-Words Model</h4>



<p>We begin with feature engineering and creating the bag of words. As mentioned, a bag of words is a list of words relevant to describe items in a dataset, such as films, and differentiate them. Creating a bag of words removes stopwords but preserves multiplicity so that words can occur multiple times in the concatenated text. Later, each word can be used as a feature in calculating cosine similarities. </p>



<p>The input for a bag of words does not necessarily come from a single input column. We will use keywords, genres, cast, and overview and merge them into a new single column that we call tags. Make sure to capture the text field&#8217;s nature. We will keep names and surnames together and not split them, as we will do with the words from the overview column. The result of this process is our bag.</p>



<p>In addition, we add the movie title and a new index (id), which will later ease working with the similarity matrix. Finally, we print the first rows of our feature dataframe. </p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># create an empty DataFrame
df_movies = pd.DataFrame()

# extract the keywords
df_movies['keywords'] = df['keywords'].apply(lambda x: [i['name'] for i in eval(x)])
df_movies['keywords'] = df_movies['keywords'].apply(lambda x: ' '.join([i.replace(&quot; &quot;, &quot;&quot;) for i in x]))

# extract the overview
df_movies['overview'] = df['overview'].fillna('')

# extract the release year 
df_movies['release_date'] = pd.to_datetime(df['release_date'], errors='coerce').apply(lambda x: str(x).split('-')[0] if x != np.nan else np.nan)

# extract the actors
df_movies['cast'] = df['cast'].apply(lambda x: [i['name'] for i in eval(x)])
df_movies['cast'] = df_movies['cast'].apply(lambda x: ' '.join([i.replace(&quot; &quot;, &quot;&quot;) for i in x]))

# extract genres
df_movies['genres'] = df['genres'].apply(lambda x: [i['name'] for i in eval(x)])
df_movies['genres'] = df_movies['genres'].apply(lambda x: ' '.join([i.replace(&quot; &quot;, &quot;&quot;) for i in x]))

# add the title
df_movies['title'] = df['title']

# merge fields into a tag field
df_movies['tags'] = df_movies['keywords'] + df_movies['cast']+' '+df_movies['genres']+' '+df_movies['release_date']

# drop records with empty tags and dublicates
df_movies.drop(df_movies[df_movies['tags']==''].index, inplace=True)
df_movies.drop_duplicates(inplace=True)

# add a fresh index to the dataframe, which we will later use when refering to items in a vector matrix
df_movies['new_id'] = range(0, len(df_movies))

# Reduce the data to relevant columns
df_movies = df_movies[['new_id', 'title', 'tags']]

# display the data
pd.set_option('display.max_colwidth', 500)
pd.set_option('display.expand_frame_repr', False)
print(df_movies.shape)
df_movies.head(5)</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">		new_id	title							tags
id			
862		0		Toy Story						jealousy toy boy friendship friends rivalry boynextdoor newtoy toycomestolifeTomHanks TimAllen DonRickles JimVarney WallaceShawn JohnRatzenberger AnniePotts JohnMorris ErikvonDetten LaurieMetcalf R.LeeErmey SarahFreeman PennJillette Animation Comedy Family 1995
8844	1		Jumanji							boardgame disappearance basedonchildren'sbook newhome recluse giantinsectRobinWilliams JonathanHyde KirstenDunst BradleyPierce BonnieHunt BebeNeuwirth DavidAlanGrier PatriciaClarkson AdamHann-Byrd LauraBellBundy JamesHandy GillianBarber BrandonObray CyrusThiedeke GaryJosephThorup LeonardZola LloydBerry MalcolmStewart AnnabelKershaw DarrylHenriques RobynDriscoll PeterBryant SarahGilson FloricaVlad JuneLion BrendaLockmuller Adventure Fantasy Family 1995
15602	2		Grumpier Old Men				fishing bestfriend duringcreditsstinger oldmenWalterMatthau JackLemmon Ann-Margret SophiaLoren DarylHannah BurgessMeredith KevinPollak Romance Comedy 1995
31357	3		Waiting to Exhale				basedonnovel interracialrelationship singlemother divorce chickflickWhitneyHouston AngelaBassett LorettaDevine LelaRochon GregoryHines DennisHaysbert MichaelBeach MykeltiWilliamson LamontJohnson WesleySnipes Comedy Drama Romance 1995
11862	4		Father of the Bride Part II		baby midlifecrisis confidence aging daughter motherdaughterrelationship pregnancy contraception gynecologistSteveMartin DianeKeaton MartinShort KimberlyWilliams-Paisley GeorgeNewbern KieranCulkin BDWong PeterMichaelGoetz KateMcGregor-Stewart JaneAdams EugeneLevy LoriAlan Comedy 1995</pre></div>



<h4 class="wp-block-heading" id="h-2-2-visualizing-text-length">2.2 Visualizing Text Length</h4>



<p>We can use a bar chart to illustrate each movie&#8217;s word bag length. This gives us an idea of how detailed the movie descriptions are. Items with short descriptions have, in principle, a lower probability of being recommended later. Recommenders produce better results if the length of the descriptions is somewhat balanced.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># add the tag length to the movies df
df_movies['tag_len'] = df_movies['tags'].apply(lambda x: len(x))

# illustrate the tag text length
sns.displot(data=df_movies.dropna(), bins=list(range(0, 2000, 25)), height=5, x='tag_len', aspect=3, kde=True)
plt.title('Distribution of tag text length')
plt.xlim([0, 2500])</pre></div>



<figure class="wp-block-image size-large"><img decoding="async" width="1024" height="346" data-attachment-id="8866" data-permalink="https://www.relataly.com/content-based-movie-recommender-using-python/4294/tags-text-length-visualization-python-content-based-recommender-system-2/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/07/tags-text-length-visualization-python-content-based-recommender-system-2.png" data-orig-size="1083,366" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="tags-text-length-visualization-python-content-based-recommender-system-2" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/07/tags-text-length-visualization-python-content-based-recommender-system-2.png" src="https://www.relataly.com/wp-content/uploads/2022/07/tags-text-length-visualization-python-content-based-recommender-system-2-1024x346.png" alt="content-based recommender system - illustration of the distribution of word length in our bag-of-word model" class="wp-image-8866" srcset="https://www.relataly.com/wp-content/uploads/2022/07/tags-text-length-visualization-python-content-based-recommender-system-2.png 1024w, https://www.relataly.com/wp-content/uploads/2022/07/tags-text-length-visualization-python-content-based-recommender-system-2.png 300w, https://www.relataly.com/wp-content/uploads/2022/07/tags-text-length-visualization-python-content-based-recommender-system-2.png 768w, https://www.relataly.com/wp-content/uploads/2022/07/tags-text-length-visualization-python-content-based-recommender-system-2.png 1083w" sizes="(max-width: 1024px) 100vw, 1024px" /></figure>



<h3 class="wp-block-heading" id="h-step-3-vectorization-using-tfidfvectorizer">Step #3: Vectorization using TfidfVectorizer</h3>



<p>The next step is to create a vector matrix from the Bag of Words model. Each column from the matrix represents a word feature. This step is the basis for determining the similarity of the movies afterward. Before the vectorization, we will remove stop words from the text (e.g., and, it, that, or, why, where, etc.). In addition, I limited the number of features in the matrix to 5000 to reduce training time.</p>



<p>A simple vectorization approach is to determine the word frequency for each movie using a count vectorizer. However, a frequently mentioned disadvantage of this approach is that it does not consider how often a word occurs. For example, some words may appear in almost all items. On the other hand, some words may be prevalent in a few items but are rare in general.  So we can argue that observing rare words in an item is more informative than observing common words. Instead of a count vectorizer, we will use a more practical approach called TfidfVectorizer from the <a href="https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html" target="_blank" rel="noreferrer noopener">scikit-learn package</a>. </p>



<p>Tfidf stands for term frequency-inverse document frequency. Compared to a count vectorizer, the tf-idf vectorizer considers the overall word frequencies and weights the general importance of the words when spanning the vectors. This way, tf-idf can determine which words are more important than others, reducing the model&#8217;s complexity and improving performance. This <a href="https://medium.com/@cmukesh8688/tf-idf-vectorizer-scikit-learn-dbc0244a911a" target="_blank" rel="noreferrer noopener">medium article</a> explains the math behind tf-idf vectorization in more detail.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># set a custom stop list from nltk
stop = list(stopwords.words('english'))

# create the tfid vectorizer, alternatively you can also use countVectorizer
tfidf =  TfidfVectorizer(max_features=5000, analyzer = 'word', stop_words=set(stop))
vectorized_data = tfidf.fit_transform(df_movies['tags'])
count_matrix = pd.DataFrame(vectorized_data.toarray(), index=df_movies['tags'].index.tolist())
print(count_matrix)

</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">			0     1     2     3     4     5     6     7     8     9     ...  4990  4991  4992  4993  4994  4995  4996  4997  4998  4999
862      	0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0  ...   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0
8844     	0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0  ...   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0
15602    	0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0  ...   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0
31357    	0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0  ...   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0
11862    	0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0  ...   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0
...      	...   ...   ...   ...   ...   ...   ...   ...   ...   ...  ...   ...   ...   ...   ...   ...   ...   ...   ...   ...   ...
439050   	0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0  ...   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0
111109   	0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0  ...   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0
67758    	0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0  ...   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0
227506   	0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0  ...   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0
461257   	0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0  ...   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0

[45432 rows x 5000 columns]</pre></div>



<p>The vectorization process results in a feature matrix in which each feature is a word from the text bag of words. </p>



<p>We can display features with the get_feature_names_out function from the tfidf vectorizer.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># print feature names
print(tfidf.get_feature_names_out()[940:990])</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">['climbing' 'clinteastwood' 'clinthoward' 'clive' 'cliveowen'
 'cliverevill' 'cliverussell' 'clone' 'clorisleachman' 'cloviscornillac'
 'clown' 'clugulager' 'clydekusatsu' 'co' 'coach' 'cobb' 'cocaine' 'code'
 'coffin' 'cohen' 'coldwar' 'cole' 'colehauser' 'coleman' 'colinfarrell'
 'colinfirth' 'colinhanks' 'colinkenny' 'colinsalmon' 'colleencamp'
 'college' 'colmfeore' 'colmmeaney' 'coma' 'combat' 'comedian' 'comedy'
 'comicbook' 'comingofage' 'comingout' 'common' 'communism' 'communist'
 'company' 'competition' 'composer' 'computer' 'con' 'concentrationcamp'
 'concert']</pre></div>



<p>As you can see, features are specific words,</p>



<h3 class="wp-block-heading" id="h-step-4-dimensionality-reduction-and-calculate-consine-similarities">Step #4 Dimensionality Reduction and Calculate Consine Similarities</h3>



<p>In the previous section, we created a vector matrix that contains movies and features. This matrix is the foundation for calculating similarity scores for all movies. Before we assign feature scores, we will apply dimensionality reduction.</p>



<h4 class="wp-block-heading">4.1 Dimensionality Reduction using SVD</h4>



<p>The matrix spans a high-dimensional vector space with more than 5000 feature columns. Do we need all of these features? The answer is most likely not. There are likely a lot of words in the matrix that only occur once or twice. On the other hand, words may occur in almost all movies. How can we deal with this issue?</p>



<p>The reason for this is that we have a very dimensional vector space. By reducing this space to fewer, more essential features, we can save some time training our recommender model. We will use TruncatedSVD from the scikit-learn package, a popular algorithm for dimensionality reduction. The algorithm smoothens the matrix and approximates it to a lower dimensional space, thereby reducing noise and model complexity.</p>



<p>This way, we will reduce the vector space from 5000 to 3000 features. </p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># reduce dimensionality for improved performance
svd = TruncatedSVD(n_components=3000)
reduced_data = svd.fit_transform(count_matrix)</pre></div>



<h4 class="wp-block-heading">4.2 Calculate Text Similarity Scores for all Movies</h4>



<p>Now that we have reduced the complexity of our vector matrix, we can calculate the similarity scores for all movies. In this process, we assign a similarity score to all item pairs that measure content closeness according to the position of the items in the vector space. </p>



<p>We use the cosine function to calculate the similarity value of the movies. The cosine similarity is a mathematical calculation to determine the mathematical similarity of two vectors. In our case, the vectors are the movie descriptions. The cosine similarity function uses these feature vectors to compare each movie to every other and assigns them a similarity value. </p>



<ul class="wp-block-list">
<li>A similarity value of -1 means that two feature vectors are correlated, and the movies are entirely different. </li>



<li>A value of 1 means that the two movies are identical. </li>



<li>A value of 0 is between and means f an average match of the feature vectors.</li>
</ul>



<p>The cosine similarity function will calculate pairwise similarities for all movies in our vector matrix. We can determine the number of pairwise comparisons with the formula k²/2, whereby k is the number of items in the vector matrix. In our case, we have a k of 45000 movies. This means the cosine similarity function must calculate about 1 billion similarity scores. So don&#8217;t worry if the process takes some time to complete. </p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># compute the cosine similarity matrix
similarity = cosine_similarity(reduced_data)
similarity</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">array([[ 1.00000000e+00,  9.75542082e-02,  6.00755620e-02, ...,
        -3.03965235e-04,  0.00000000e+00,  5.81243547e-05],
       [ 9.75542082e-02,  1.00000000e+00,  5.92929339e-02, ...,
        -2.97565163e-03,  0.00000000e+00,  4.57945869e-05],
       [ 6.00755620e-02,  5.92929339e-02,  1.00000000e+00, ...,
         9.40459504e-03,  0.00000000e+00, -2.22415551e-04],
       ...,
       [-3.03965235e-04, -2.97565163e-03,  9.40459504e-03, ...,
         1.00000000e+00,  0.00000000e+00, -2.60823346e-04],
       [ 0.00000000e+00,  0.00000000e+00,  0.00000000e+00, ...,
         0.00000000e+00,  0.00000000e+00,  0.00000000e+00],
       [ 5.81243547e-05,  4.57945869e-05, -2.22415551e-04, ...,
        -2.60823346e-04,  0.00000000e+00,  1.00000000e+00]])</pre></div>



<div style="height:38px" aria-hidden="true" class="wp-block-spacer"></div>



<h3 class="wp-block-heading" id="h-step-5-generate-content-based-movie-recommendations">Step #5: Generate Content-based Movie Recommendations </h3>



<p>Once you have created the similarity matrix, it&#8217;s time to generate some recommendations. We begin by generating recommendations based on a single movie. In the cosine similarity matrix, the most similar movies have the highest similarity scores. Once we have the film with the highest scores, we can visualize the results in a bar chart that shows the cosine similarity scores. </p>



<p>The example below displays the results of the movie &#8220;The Matrix.&#8221; Oh, how I love this movie 🙂</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># create a function that takes in movie title as input and returns a list of the most similar movies
def get_recommendations(title, n, cosine_sim=similarity):
    
    # get the index of the movie that matches the title
    movie_index = df_movies[df_movies.title==title].new_id.values[0]
    print(movie_index, title)
    
    # get the pairwsie similarity scores of all movies with that movie and sort the movies based on the similarity scores
    sim_scores_all = sorted(list(enumerate(cosine_sim[movie_index])), key=lambda x: x[1], reverse=True)
    
    # checks if recommendations are limited
    if n &gt; 0:
        sim_scores_all = sim_scores_all[1:n+1]
        
    # get the movie indices of the top similar movies
    movie_indices = [i[0] for i in sim_scores_all]
    scores = [i[1] for i in sim_scores_all]
    
    # return the top n most similar movies from the movies df
    top_titles_df = pd.DataFrame(df_movies.iloc[movie_indices]['title'])
    top_titles_df['sim_scores'] = scores
    top_titles_df['ranking'] = range(1, len(top_titles_df) + 1)
    
    return top_titles_df, sim_scores_all

# generate a list of recommendations for a specific movie title
movie_name = 'The Matrix'
number_of_recommendations = 15
top_titles_df, _ = get_recommendations(movie_name, number_of_recommendations)
 
# visualize the results
def show_results(movie_name, top_titles_df):
    fix, ax = plt.subplots(figsize=(11, 5))
    sns.barplot(data=top_titles_df, y='title', x= 'sim_scores', color='blue')
    plt.xlim((0,1))
    plt.title(f'Top 15 recommendations for {movie_name}')
    pct_values = ['{:.2}'.format(elm) for elm in list(top_titles_df['sim_scores'])]
    ax.bar_label(container=ax.containers[0], labels=pct_values, size=12)

show_results(movie_name, top_titles_df)</pre></div>



<p>Example for the movies &#8220;Spectre&#8221; and &#8220;The Lion King&#8221;</p>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow">
<figure class="wp-block-image size-full"><img decoding="async" width="830" height="329" data-attachment-id="9748" data-permalink="https://www.relataly.com/content-based-movie-recommender-using-python/4294/image-2-13/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/10/image-2.png" data-orig-size="830,329" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-2" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/10/image-2.png" src="https://www.relataly.com/wp-content/uploads/2022/10/image-2.png" alt="" class="wp-image-9748" srcset="https://www.relataly.com/wp-content/uploads/2022/10/image-2.png 830w, https://www.relataly.com/wp-content/uploads/2022/10/image-2.png 300w, https://www.relataly.com/wp-content/uploads/2022/10/image-2.png 768w" sizes="(max-width: 830px) 100vw, 830px" /></figure>
</div>
</div>



<h3 class="wp-block-heading">Step #6: Generate Content-based Movie Recommendations</h3>



<p>But what if you want to generate recommendations for specific users that have seen several movies? For this, we can aggregate the similarity scores for all films the user has seen. This way, we create a new dataframe that sums up similarity scores. To return the top-recommended movies, we can sort this dataframe by similarity scores and replace the top elements. </p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># list of movies a user has seen
movie_list = ['The Lion King', 'Seven', 'RoboCop 3', 'Blade Runner', 'Quantum of Solace', 'Casino Royale', 'Skyfall']

# create a copy of the movie dataframe and add a column in which we aggregated the scores
user_scores = pd.DataFrame(df_movies['title'])
user_scores['sim_scores'] = 0.0

# top number of scores to be considered for each movie
number_of_recommendations = 10000
for movie_name in movie_list:
    top_titles_df, _ = get_recommendations(movie_name, number_of_recommendations)
    # aggregate the scores
    user_scores = pd.concat([user_scores, top_titles_df[['title', 'sim_scores']]]).groupby(['title'], as_index=False).sum({'sim_scores'})
# sort and print the aggregated scores
user_scores.sort_values(by='sim_scores', ascending=False)[1:20]</pre></div>



<figure class="wp-block-image size-large is-resized"><img decoding="async" data-attachment-id="8888" data-permalink="https://www.relataly.com/content-based-movie-recommender-using-python/4294/image-8-12/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/07/image-8.png" data-orig-size="1375,583" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-8" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/07/image-8.png" src="https://www.relataly.com/wp-content/uploads/2022/07/image-8-1024x434.png" alt="Content-based movie recommdations for a user that has previously watched The Lion King, Seven, RoboCop 3, Blade Runner, Quantum of Solace, Casino Royale, Skyfall" class="wp-image-8888" width="805" height="341" srcset="https://www.relataly.com/wp-content/uploads/2022/07/image-8.png 1024w, https://www.relataly.com/wp-content/uploads/2022/07/image-8.png 300w, https://www.relataly.com/wp-content/uploads/2022/07/image-8.png 768w, https://www.relataly.com/wp-content/uploads/2022/07/image-8.png 1375w" sizes="(max-width: 805px) 100vw, 805px" /></figure>



<div style="height:65px" aria-hidden="true" class="wp-block-spacer"></div>



<h2 class="wp-block-heading" id="h-summary">Summary</h2>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p>In this tutorial, you have learned to implement a simple content-based recommender system for movie recommendations in Python.  We have used several movie-specific details to calculate a similarity matrix for all movies in our dataset. Finally, we have used this model to generate recommendations for two cases:</p>



<ul class="wp-block-list">
<li>Films that are similar to a specific movie</li>



<li>Films that are recommended based on the watchlist of a particular user.</li>
</ul>



<p>A downside of content-based recommenders is that you cannot test their performance unless you know how users perceived the recommendations. This is because content-based recommenders can only determine which items in a dataset are similar. To understand how well the suggestions work, you must include additional data about actual user preferences. </p>



<p>More advanced recommenders will combine content-based recommendations with user-item interactions (<a href="https://www.relataly.com/building-a-movie-recommender-using-collaborative-filtering/4376/" target="_blank" rel="noreferrer noopener">e.g., collaborative filtering</a>). Such models are called hybrid recommenders, but this is something for another article.</p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%">
<figure class="wp-block-image size-large"><img decoding="async" width="512" height="472" data-attachment-id="12683" data-permalink="https://www.relataly.com/dog-with-popcorn-machine-learning-movie-recommender-python-tutorial-relataly-midjourney-ai-min/" data-orig-file="https://www.relataly.com/wp-content/uploads/2023/03/dog-with-popcorn-machine-learning-movie-recommender-python-tutorial-relataly-midjourney-ai-min.png" data-orig-size="532,490" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="dog with popcorn machine learning movie recommender python tutorial relataly midjourney ai-min" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2023/03/dog-with-popcorn-machine-learning-movie-recommender-python-tutorial-relataly-midjourney-ai-min.png" src="https://www.relataly.com/wp-content/uploads/2023/03/dog-with-popcorn-machine-learning-movie-recommender-python-tutorial-relataly-midjourney-ai-min-512x472.png" alt="" class="wp-image-12683" srcset="https://www.relataly.com/wp-content/uploads/2023/03/dog-with-popcorn-machine-learning-movie-recommender-python-tutorial-relataly-midjourney-ai-min.png 512w, https://www.relataly.com/wp-content/uploads/2023/03/dog-with-popcorn-machine-learning-movie-recommender-python-tutorial-relataly-midjourney-ai-min.png 300w, https://www.relataly.com/wp-content/uploads/2023/03/dog-with-popcorn-machine-learning-movie-recommender-python-tutorial-relataly-midjourney-ai-min.png 532w" sizes="(max-width: 512px) 100vw, 512px" /><figcaption class="wp-element-caption">Image created with <a href="http://www.midjourney.com" target="_blank" rel="noreferrer noopener">Midjourney</a></figcaption></figure>
</div>
</div>



<p></p>



<h2 class="wp-block-heading">Sources and Further Reading</h2>



<p>Below are some resources for further reading on recommender systems and content-based models.</p>



<p><strong>Books</strong></p>



<ol class="wp-block-list"><li><a href="https://amzn.to/3T3Sl2V" target="_blank" rel="noreferrer noopener">Charu C. Aggarwal (2016) Recommender Systems: The Textbook</a></li><li><a href="https://amzn.to/3D0P8eX" target="_blank" rel="noreferrer noopener">Kin Falk (2019) Practical Recommender Systems</a></li><li><a href="https://amzn.to/3MAy8j5" target="_blank" rel="noreferrer noopener">Andriy Burkov (2020) Machine Learning Engineering</a></li><li><a href="https://amzn.to/3D0gB0e" target="_blank" rel="noreferrer noopener">Oliver Theobald (2020) Machine Learning For Absolute Beginners: A Plain English Introduction</a></li><li><a href="https://amzn.to/3S9Nfkl" target="_blank" rel="noreferrer noopener">Aurélien Géron (2019) Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems </a></li><li><a href="https://amzn.to/3EKidwE" target="_blank" rel="noreferrer noopener">David Forsyth (2019) Applied Machine Learning Springer</a></li></ol>



<p class="has-contrast-2-color has-base-3-background-color has-text-color has-background"><em>The links above to Amazon are affiliate links. By buying through these links, you support the Relataly.com blog and help to cover the hosting costs. Using the links does not affect the price.</em></p>



<p><strong>Articles</strong></p>



<ol class="wp-block-list"><li><a href="https://surprise.readthedocs.io/en/stable/getting_started.html" target="_blank" rel="noreferrer noopener">Getting started with Suprise</a></li><li><a href="https://onespire.hu/sap-news-en/history-of-recommender-systems/" target="_blank" rel="noreferrer noopener">About the history of Recommender Systems</a></li><li><a href="https://www.freecodecamp.org/news/singular-value-decomposition-vs-matrix-factorization-in-recommender-systems-b1e99bc73599/" target="_blank" rel="noreferrer noopener"></a><a href="https://www.freecodecamp.org/news/singular-value-decomposition-vs-matrix-factorization-in-recommender-systems-b1e99bc73599/" target="_blank" rel="noreferrer noopener">Singular value decomposition vs. matrix factorization</a></li><li><a href="https://proceedings.neurips.cc/paper/2007/file/d7322ed717dedf1eb4e6e52a37ea7bcd-Paper.pdf" target="_blank" rel="noreferrer noopener">Probabilistic Matrix Factorization</a></li></ol>
<p>The post <a href="https://www.relataly.com/content-based-movie-recommender-using-python/4294/">Create a Personalized Movie Recommendation Engine using Content-based Filtering in Python</a> appeared first on <a href="https://www.relataly.com">relataly.com</a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://www.relataly.com/content-based-movie-recommender-using-python/4294/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">4294</post-id>	</item>
		<item>
		<title>Unveiling Hidden Patterns in the Cryptocurrency Market with Affinity Propagation and Python</title>
		<link>https://www.relataly.com/crypto-market-cluster-analysis-using-affinity-propagation-python/8114/</link>
					<comments>https://www.relataly.com/crypto-market-cluster-analysis-using-affinity-propagation-python/8114/#comments</comments>
		
		<dc:creator><![CDATA[Florian Follonier]]></dc:creator>
		<pubDate>Mon, 02 May 2022 18:34:02 +0000</pubDate>
				<category><![CDATA[Affinity Propagation (Clustering)]]></category>
		<category><![CDATA[Clustering]]></category>
		<category><![CDATA[Coinmarketcap API]]></category>
		<category><![CDATA[Correlation]]></category>
		<category><![CDATA[Covariance]]></category>
		<category><![CDATA[Crypto Exchange APIs]]></category>
		<category><![CDATA[Data Visualization]]></category>
		<category><![CDATA[Dimensionality Reduction]]></category>
		<category><![CDATA[Finance]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[Scikit-Learn]]></category>
		<category><![CDATA[Seaborn]]></category>
		<category><![CDATA[Stock Market Forecasting]]></category>
		<category><![CDATA[Time Series Forecasting]]></category>
		<category><![CDATA[Advanced Tutorials]]></category>
		<category><![CDATA[AI in Finance]]></category>
		<category><![CDATA[Cryptocurrencies]]></category>
		<category><![CDATA[Financial Analysis]]></category>
		<category><![CDATA[Stock Market Cluster Analysis]]></category>
		<guid isPermaLink="false">https://www.relataly.com/?p=8114</guid>

					<description><![CDATA[<p>Affinity propagation is a powerful unsupervised clustering technique that can identify hidden patterns in large datasets. In the cryptocurrency world, where new coins are constantly emerging and prices can be highly volatile, affinity propagation can help investors simplify the chaos. By analyzing historical price data, affinity propagation groups coins into clusters based on their past ... <a title="Unveiling Hidden Patterns in the Cryptocurrency Market with Affinity Propagation and Python" class="read-more" href="https://www.relataly.com/crypto-market-cluster-analysis-using-affinity-propagation-python/8114/" aria-label="Read more about Unveiling Hidden Patterns in the Cryptocurrency Market with Affinity Propagation and Python">Read more</a></p>
<p>The post <a href="https://www.relataly.com/crypto-market-cluster-analysis-using-affinity-propagation-python/8114/">Unveiling Hidden Patterns in the Cryptocurrency Market with Affinity Propagation and Python</a> appeared first on <a href="https://www.relataly.com">relataly.com</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p>Affinity propagation is a powerful unsupervised clustering technique that can identify hidden patterns in large datasets. In the cryptocurrency world, where new coins are constantly emerging and prices can be highly volatile, affinity propagation can help investors simplify the chaos.</p>



<p>By analyzing historical price data, affinity propagation groups coins into clusters based on their past price fluctuations. Such a cluster analysis enables crypto investors to identify promising entry and exit points, ultimately helping them make smarter investment decisions.</p>



<p>To use this technique effectively, it&#8217;s important to understand essential concepts such as covariance, lasso regression, and affinity propagation. Once you understand these concepts, you can apply them to analyze price time series data and identify hidden patterns.</p>



<p>Finally, visualizing the results in two and three dimensions can better understand the relationships between coins and their respective clusters. The resulting crypto market map can be a powerful tool for investors to gain insight into the market&#8217;s structure and make informed investment decisions.</p>



<div class="wp-block-group"><div class="wp-block-group__inner-container is-layout-constrained wp-block-group-is-layout-constrained">
<div class="wp-block-kadence-infobox kt-info-box_317393-a1"><span class="kt-blocks-info-box-link-wrap info-box-link kt-blocks-info-box-media-align-top kt-info-halign-left"><div class="kt-infobox-textcontent"><h2 class="kt-blocks-info-box-title">Disclaimer</h2><p class="kt-blocks-info-box-text">This article does not constitute financial advice. Stock markets can be very volatile and are generally difficult to predict. Predictive models and other forms of analytics applied in this article only serve the purpose of illustrating machine learning use cases.</p></div></span></div>
</div></div>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%"></div>
</div>



<h2 class="wp-block-heading">What is Stock Market Clustering?</h2>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p>Clustering stock markets refers to grouping stocks based on their similarities or common characteristics. This can be done using various clustering algorithms, which analyze the data and assign each stock market to a cluster based on its similarity to other stock markets in the same cluster. In this article, we will run a cluster analysis on historical time series data. This approach involves grouping stocks into clusters based on their historical performance over a certain period of time. </p>



<p>Clustering stock market data can be useful for a variety of purposes, such as identifying patterns or trends in the data, comparing the performance of different stocks or sectors, or generating investment recommendations. However, it&#8217;s important to keep in mind that clustering is just one tool among many for analyzing stock market data, and it&#8217;s important to consider a range of factors when making investment decisions. It can also be used to compare the performance of different stock markets and identify potential risks or correlations between them.</p>



<p>Also: <a href="https://www.relataly.com/cryptocurrency-price-charts-with-color-overlay-python/2820/" target="_blank" rel="noreferrer noopener">Color-Coded Cryptocurrency Price Charts in Python</a></p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%">
<figure class="wp-block-image size-full"><img decoding="async" width="509" height="506" data-attachment-id="12694" data-permalink="https://www.relataly.com/neural-network-machine-learning-python-affinity-propagation-midjourney-relataly-crypto-min/" data-orig-file="https://www.relataly.com/wp-content/uploads/2023/03/neural-network-machine-learning-python-affinity-propagation-midjourney-relataly-crypto-min.png" data-orig-size="509,506" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="neural network machine learning python affinity propagation midjourney relataly crypto-min" data-image-description="&lt;p&gt;neural network machine learning python affinity propagation midjourney relataly crypto-min&lt;/p&gt;
" data-image-caption="&lt;p&gt;neural network machine learning python affinity propagation midjourney relataly crypto-min&lt;/p&gt;
" data-large-file="https://www.relataly.com/wp-content/uploads/2023/03/neural-network-machine-learning-python-affinity-propagation-midjourney-relataly-crypto-min.png" src="https://www.relataly.com/wp-content/uploads/2023/03/neural-network-machine-learning-python-affinity-propagation-midjourney-relataly-crypto-min.png" alt="neural network machine learning python midjourney relataly crypto market map" class="wp-image-12694" srcset="https://www.relataly.com/wp-content/uploads/2023/03/neural-network-machine-learning-python-affinity-propagation-midjourney-relataly-crypto-min.png 509w, https://www.relataly.com/wp-content/uploads/2023/03/neural-network-machine-learning-python-affinity-propagation-midjourney-relataly-crypto-min.png 300w, https://www.relataly.com/wp-content/uploads/2023/03/neural-network-machine-learning-python-affinity-propagation-midjourney-relataly-crypto-min.png 140w" sizes="(max-width: 509px) 100vw, 509px" /><figcaption class="wp-element-caption">We can use a crypto market map to illustrate the price correlation between cryptocurrencies.</figcaption></figure>
</div>
</div>



<h2 class="wp-block-heading">What&#8217;s the Problem with Prototype-based Clustering?</h2>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p>Clustering is an unsupervised learning technique that groups similar objects into clusters and separates them from different ones. One of the most popular clustering techniques is <a href="https://www.relataly.com/category/machine-learning-algorithms/k-means/" target="_blank" rel="noreferrer noopener">k-means</a>. K-means belongs to the so-called prototype-based clustering techniques, which divide data points into a predefined number of groups (in the case of k-means, the groups are of equal variance). </p>



<p>The prototype-based clustering approach works great if the number of clusters in a dataset is known and the clusters have similar despair. However, when we deal with real-world problems, we often encounter more complex data for which the optimal number of clusters is unknown and difficult or even impossible to guess. In such a case, affinity propagation has a significant advantage because it can automatically estimate the number of clusters. </p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%"></div>
</div>



<h2 class="wp-block-heading">Affinity Propagation: What it is and How it Works </h2>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow">
<p>The idea of affinity propagation is to identify clusters by measuring the similarity of data points relative to one another. The algorithm chooses data points as cluster centers that best represent other data points near them. </p>



<p>We can imagine the process of identifying these representative data points as an election. Each data point (i) is a voter who casts votes and a candidate (k) who can receive votes from other voters. Votes are a measure of the similarity of data points. A voter who gives many votes to a candidate expresses that this data point is similar to him and therefore is suitable for representing him as a cluster center. The voting process continues until the algorithm reaches a consensus and selects a set number of cluster candidates.</p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow">
<figure class="wp-block-image size-large"><img decoding="async" width="1024" height="383" data-attachment-id="8208" data-permalink="https://www.relataly.com/crypto-market-cluster-analysis-using-affinity-propagation-python/8114/image-5/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/05/image-5.png" data-orig-size="1584,592" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-5" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/05/image-5.png" src="https://www.relataly.com/wp-content/uploads/2022/05/image-5-1024x383.png" alt="Affinity Propagation: Data points cast votes for candidates and receive votes from other data points " class="wp-image-8208" srcset="https://www.relataly.com/wp-content/uploads/2022/05/image-5.png 1024w, https://www.relataly.com/wp-content/uploads/2022/05/image-5.png 300w, https://www.relataly.com/wp-content/uploads/2022/05/image-5.png 768w, https://www.relataly.com/wp-content/uploads/2022/05/image-5.png 1536w, https://www.relataly.com/wp-content/uploads/2022/05/image-5.png 1584w" sizes="(max-width: 1024px) 100vw, 1024px" /><figcaption class="wp-element-caption">Affinity Propagation: Data points cast votes for candidates and receive votes from other data points </figcaption></figure>
</div>
</div>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow">
<p>The clustering process involves many separate steps (<a href="https://towardsdatascience.com/unsupervised-machine-learning-affinity-propagation-algorithm-explained-d1fef85f22c8" target="_blank" rel="noreferrer noopener">This article</a> provides a detailed description of the steps involved) and works with several matrices: </p>



<ul class="wp-block-list">
<li>The similarity matrix assesses the suitability of data points (candidates) to act as cluster centers.</li>



<li>The availability matrix (or responsibility matrix) collects the support of the data points for the candidates (potential cluster centers) and their suitability to represent them.</li>



<li>The criterion matrix sums up the results and defines the clusters. Data points with equal scores in the criterion matrix are considered part of the same cluster.</li>
</ul>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow">
<figure class="wp-block-image size-full"><img decoding="async" width="691" height="334" data-attachment-id="8274" data-permalink="https://www.relataly.com/crypto-market-cluster-analysis-using-affinity-propagation-python/8114/image-9/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/05/image-9.png" data-orig-size="691,334" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-9" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/05/image-9.png" src="https://www.relataly.com/wp-content/uploads/2022/05/image-9.png" alt="Criterion Matrix: Data Points (Cryptos) with equal numbers are part of the same cluster" class="wp-image-8274" srcset="https://www.relataly.com/wp-content/uploads/2022/05/image-9.png 691w, https://www.relataly.com/wp-content/uploads/2022/05/image-9.png 300w" sizes="(max-width: 691px) 100vw, 691px" /><figcaption class="wp-element-caption">Criterion Matrix: Data Points (Cryptos) with equal numbers are part of the same cluster</figcaption></figure>
</div>
</div>



<h2 class="wp-block-heading" id="h-time-series-clustering-using-affinity-propagation-visualizing-cryptocurrency-market-structures-in-python">Time Series Clustering using Affinity Propagation &#8211; Visualizing Cryptocurrency Market Structures in Python</h2>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p>Ready to implement affinity propagation in Python to analyze the crypto market structure and create a visual representation of price similarity? Let&#8217;s dive in!</p>



<p>First, we define a portfolio of cryptocurrencies and download their historical price quotes from coinmarketcap. We then visualize the time series on separate line charts to ensure that the data has been loaded successfully. After preparing and cleaning the data, we can move on to clustering the cryptocurrencies into groups with similar price movements using Affinity Propagation.</p>



<p>Unlike other clustering algorithms, we don&#8217;t set the number of clusters in advance. Instead, we let affinity propagation determine the optimal number of clusters for our portfolio. Finally, we calculate the covariance matrix between clusters and arrange the cryptocurrencies on a 2D map into clusters. We create a network overlay based on covariance to better understand the relationships between different clusters.</p>



<p>With affinity propagation, we can identify hidden patterns in the crypto market and group coins into clusters based on their past price fluctuations. This process allows us to identify promising entry and exit points, ultimately helping us make smarter investment decisions. Plus, the 2D map and network overlay help us visualize the relationships between different clusters and coins.</p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%">
<figure class="wp-block-image size-full"><img decoding="async" width="456" height="509" data-attachment-id="10401" data-permalink="https://www.relataly.com/crypto-market-cluster-analysis-using-affinity-propagation-python/8114/image-33-2/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/12/image-33.png" data-orig-size="456,509" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-33" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/12/image-33.png" src="https://www.relataly.com/wp-content/uploads/2022/12/image-33.png" alt="exemplary price correlation map created with the help of the affinity propagation clustering algorithm, python scikit-learn" class="wp-image-10401" srcset="https://www.relataly.com/wp-content/uploads/2022/12/image-33.png 456w, https://www.relataly.com/wp-content/uploads/2022/12/image-33.png 269w" sizes="(max-width: 456px) 100vw, 456px" /><figcaption class="wp-element-caption">We can use affinity propagation to cluster financial assets and visualize them on a map.</figcaption></figure>
</div>
</div>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p>The Python code for this tutorial is available in the relataly repository on GitHub.</p>



<div class="wp-block-kadence-advancedbtn kb-buttons-wrap kb-btns_03b447-31"><a class="kb-button kt-button button kb-btn_03610b-ed kt-btn-size-standard kt-btn-width-type-full kb-btn-global-inherit  kt-btn-has-text-true kt-btn-has-svg-true  wp-block-button__link wp-block-kadence-singlebtn" href="https://github.com/flo7up/relataly-public-python-tutorials/blob/master/00%20Data%20Visualization/042%20Vizualizing%20Stock%20Market%20Structures%20using%20Cluster%20Analysis%20in%20Python.ipynb" target="_blank" rel="noreferrer noopener"><span class="kb-svg-icon-wrap kb-svg-icon-fe_eye kt-btn-icon-side-left"><svg viewBox="0 0 24 24"  fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"  aria-hidden="true"><path d="M1 12s4-8 11-8 11 8 11 8-4 8-11 8-11-8-11-8z"/><circle cx="12" cy="12" r="3"/></svg></span><span class="kt-btn-inner-text">View on GitHub </span></a>

<a class="kb-button kt-button button kb-btn_81f956-88 kt-btn-size-standard kt-btn-width-type-full kb-btn-global-inherit  kt-btn-has-text-true kt-btn-has-svg-true  wp-block-button__link wp-block-kadence-singlebtn" href="https://github.com/flo7up/relataly-public-python-API-tutorials" target="_blank" rel="noreferrer noopener"><span class="kb-svg-icon-wrap kb-svg-icon-fa_github kt-btn-icon-side-left"><svg viewBox="0 0 496 512"  fill="currentColor" xmlns="http://www.w3.org/2000/svg"  aria-hidden="true"><path d="M165.9 397.4c0 2-2.3 3.6-5.2 3.6-3.3.3-5.6-1.3-5.6-3.6 0-2 2.3-3.6 5.2-3.6 3-.3 5.6 1.3 5.6 3.6zm-31.1-4.5c-.7 2 1.3 4.3 4.3 4.9 2.6 1 5.6 0 6.2-2s-1.3-4.3-4.3-5.2c-2.6-.7-5.5.3-6.2 2.3zm44.2-1.7c-2.9.7-4.9 2.6-4.6 4.9.3 2 2.9 3.3 5.9 2.6 2.9-.7 4.9-2.6 4.6-4.6-.3-1.9-3-3.2-5.9-2.9zM244.8 8C106.1 8 0 113.3 0 252c0 110.9 69.8 205.8 169.5 239.2 12.8 2.3 17.3-5.6 17.3-12.1 0-6.2-.3-40.4-.3-61.4 0 0-70 15-84.7-29.8 0 0-11.4-29.1-27.8-36.6 0 0-22.9-15.7 1.6-15.4 0 0 24.9 2 38.6 25.8 21.9 38.6 58.6 27.5 72.9 20.9 2.3-16 8.8-27.1 16-33.7-55.9-6.2-112.3-14.3-112.3-110.5 0-27.5 7.6-41.3 23.6-58.9-2.6-6.5-11.1-33.3 2.6-67.9 20.9-6.5 69 27 69 27 20-5.6 41.5-8.5 62.8-8.5s42.8 2.9 62.8 8.5c0 0 48.1-33.6 69-27 13.7 34.7 5.2 61.4 2.6 67.9 16 17.7 25.8 31.5 25.8 58.9 0 96.5-58.9 104.2-114.8 110.5 9.2 7.9 17 22.9 17 46.4 0 33.7-.3 75.4-.3 83.6 0 6.5 4.6 14.4 17.3 12.1C428.2 457.8 496 362.9 496 252 496 113.3 383.5 8 244.8 8zM97.2 352.9c-1.3 1-1 3.3.7 5.2 1.6 1.6 3.9 2.3 5.2 1 1.3-1 1-3.3-.7-5.2-1.6-1.6-3.9-2.3-5.2-1zm-10.8-8.1c-.7 1.3.3 2.9 2.3 3.9 1.6 1 3.6.7 4.3-.7.7-1.3-.3-2.9-2.3-3.9-2-.6-3.6-.3-4.3.7zm32.4 35.6c-1.6 1.3-1 4.3 1.3 6.2 2.3 2.3 5.2 2.6 6.5 1 1.3-1.3.7-4.3-1.3-6.2-2.2-2.3-5.2-2.6-6.5-1zm-11.4-14.7c-1.6 1-1.6 3.6 0 5.9 1.6 2.3 4.3 3.3 5.6 2.3 1.6-1.3 1.6-3.9 0-6.2-1.4-2.3-4-3.3-5.6-2z"/></svg></span><span class="kt-btn-inner-text">Relataly GitHub Repo </span></a></div>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%"></div>
</div>



<h3 class="wp-block-heading" id="h-prerequisites">Prerequisites</h3>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p>Before beginning the coding part, ensure that you have set up your Python 3 environment and required packages. Consider <a href="https://www.anaconda.com/products/individual" target="_blank" rel="noreferrer noopener">Anaconda </a>if you don&#8217;t have a Python environment set up yet. To set it up, you can follow the steps in&nbsp;<a href="https://www.relataly.com/category/data-science/setup-anaconda-environment/" target="_blank" rel="noreferrer noopener">this tutorial</a>. Also, make sure you install all required packages. In this tutorial, we will be working with the following standard packages:&nbsp;</p>



<ul class="wp-block-list">
<li><a href="https://pandas.pydata.org/" target="_blank" rel="noreferrer noopener">pandas</a></li>



<li><a href="https://numpy.org/" target="_blank" rel="noreferrer noopener">NumPy</a></li>



<li><a href="https://matplotlib.org/" target="_blank" rel="noreferrer noopener">matplotlib</a></li>



<li><a href="https://seaborn.pydata.org/api.html" target="_blank" rel="noreferrer noopener">Seaborn</a></li>
</ul>



<p>Please also make sure you have the <a href="https://pypi.org/project/cryptocmd/" target="_blank" rel="noreferrer noopener">Cmcscaper</a> package installed. We will be using it to download past crypto prices from coinmarketcap.</p>



<p>You can install these packages using console commands:</p>



<ul class="wp-block-list">
<li><em>pip install &lt;package name&gt;</em></li>



<li><em>conda install &lt;package name&gt;</em>&nbsp;(if you are using the anaconda packet manager)</li>
</ul>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%"></div>
</div>



<h3 class="wp-block-heading" id="h-step-1-load-the-stock-market-data">Step #1: Load the Stock Market Data</h3>



<p>We start by loading historical crypto price data from Coinmarketcap. To download the data, we use Cmcscraper, a Python library that allows us to collect Coinmarketcap data without signing up for the official API.</p>



<p>The download returns a dataframe with daily price quotes (Close, Open, Avg) for cryptocurrencies between 2016 and today. You can use the dictionary (&#8220;symbol_dict&#8221;) to control which cryptos you want to include in the data. We limit the data we use in our cluster analysis to the last 50 days. In this way, we let the correlation consider earlier price developments. But it&#8217;s up to you to specify a different period. In addition, instead of using absolute price values, we will use daily percentage fluctuations.</p>



<p>Also: <a href="https://www.relataly.com/streaming-crypto-prices-via-the-gate-io-api-with-python/3982/" target="_blank" rel="noreferrer noopener">Requesting Crypto Price Data from the Gate.io REST API in Python</a></p>



<p>Loading the data can take several minutes, depending on how many cryptocurrencies we include in the request. So it makes sense not to load the data every time you run the code. Therefore, the code below stores the historical prices in a CSV file. </p>



<p>The script will check if the data already exists if you run the code below. If it does, it will use the data from the CSV file. Otherwise, it will load a fresh copy of the data from coinmarketcap.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># A tutorial for this file is available at www.relataly.com
# Tested with Python 3.8.8, Matplotlib 3.5, Scikit-learn 0.24.1, Seaborn 0.11.1, numpy 1.19.5

from cryptocmd import CmcScraper
import pandas as pd 
import matplotlib.pyplot as plt 
import numpy as np 
import seaborn as sns
from sklearn import cluster, covariance, manifold
import requests
import json


#get a dictionary of the top 100 coin symbols and names from an API
def get_symbol_dict():
    url = 'https://api.coinmarketcap.com/data-api/v3/cryptocurrency/listing?start=1&amp;limit=50&amp;sortBy=market_cap&amp;sortType=desc&amp;convert=USD&amp;cryptoType=all&amp;tagType=all&amp;audited=false'
    response = requests.get(url)
    data = json.loads(response.text)
    df = pd.DataFrame(data['data']['cryptoCurrencyList'])

    # exclude stable coins
    df = df[~df['symbol'].isin(['USDT', 'USDC', 'BUSD', 'DAI', 'TUSD', 'PAX', 'GUSD', 'HUSD', 'USDK', 'USDS', 'USDP', 'USDN', 'USDSB', 'USDX', 'USD++', 'BIDR', 'IDRT', 'VAI', 'BGBP'])]
    df = df[['symbol', 'name']]
    df = df.set_index('symbol')
    df = df.to_dict()
    df = df['name']
    return df

symbol_dict = get_symbol_dict()


# Download historic crypto prices via CmcScraper
def load_fresh_data_and_save_to_disc(symbol_dict, save_path):
    # Extract symbols and names from the symbol_dict
    symbols, names = np.array(sorted(symbol_dict.items())).T
    
    # Initialize an empty DataFrame for storing the prices
    df_crypto = pd.DataFrame()

    # Download and process the price data for each symbol
    for symbol in symbols:
        print(f&quot;Fetching prices for {symbol}...&quot;)
        
        # Download the price data using CmcScraper
        scraper = CmcScraper(symbol)
        df_coin_prices = scraper.get_dataframe()

        # Process the price data and add it to df_crypto
        df = pd.DataFrame({
            f&quot;{symbol}_Open&quot;: df_coin_prices[&quot;Open&quot;],
            f&quot;{symbol}_Close&quot;: df_coin_prices[&quot;Close&quot;],
            f&quot;{symbol}_Avg&quot;: (df_coin_prices[&quot;Close&quot;] + df_coin_prices[&quot;Open&quot;]) / 2,
            f&quot;{symbol}_p&quot;: (df_coin_prices[&quot;Open&quot;] - df_coin_prices[&quot;Close&quot;]) / df_coin_prices[&quot;Open&quot;]
        })
        df_crypto = pd.concat([df_crypto, df], axis=1)

    # Save the price data to a CSV file
    X_df_filtered = df_crypto.filter(like=&quot;_p&quot;)
    X_df_filtered.to_csv(save_path + &quot;historical_crypto_prices.csv&quot;)

    return names, symbols, X_df_filtered
        

# If set to False the data will only be downloaded when you execute the code
# Set to True, if you want a fresh copy of the data.  
fetch_new_data = True 
save_path = '' # path where the price data will be stored in a csv file

# Fetch fresh data via the scraping package, or use data from the csv file on disk
if fetch_new_data == False:
    try:
        print('loading from disk')
        X_df_filtered = pd.read_csv(save_path + 'historical_crypto_prices.csv')
        if 'Unnamed: 0' in X_df_filtered.columns: 
            X_df_filtered = X_df_filtered.drop(['Unnamed: 0'], axis=1)
            symbols, names = np.array(sorted(symbol_dict.items())).T
        print(list(X_df_filtered.columns))
    except:
        print('no existing price data found - loading fresh data from coinmarketcap and saving them to disk')
        names, symbols, X_df_filtered = load_fresh_data_and_save_to_disc(symbol_dict, save_path)
        print(list(symbols))
else:
       print('loading fresh data from coinmarketcap and saving them to disk')
       names, symbols, X_df_filtered = load_fresh_data_and_save_to_disc(symbol_dict, save_path)
       print(list(symbols))

# Limit the price data to the last t days
t= 14 # in days
X_df_filtered = X_df_filtered[:t]
X_df_filtered.head()</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">	ACM_p		ADA_p		ARK_p		ATM_p		ATOM_p		AVAX_p		BAT_p		BCH_p		BLZ_p		BNB_p		...	THETA_p		UNI_p		USDT_p		VET_p		WAVES_p		XLM_p		XMR_p		XRP_p		ZIL_p		ZRX_p
0	0.031987	-0.037645	-0.005702	0.030928	-0.005897	-0.012404	-0.012262	-0.022529	0.008072	-0.007111	...	-0.021994	-0.023758	-0.000103	-0.021024	-0.015416	-0.004096	-0.022988	-0.027397	-0.016659	-0.012255
1	0.028192	0.065034	0.122306	0.010310	0.093558	0.106811	0.082863	0.075567	0.062105	0.054733	...	0.067264	0.081040	0.000136	0.077203	0.092987	0.078562	0.111519	0.071696	0.076484	0.085094
2	0.040771	0.016097	-0.133345	0.018963	0.011304	-0.033328	-0.007616	0.011458	-0.019993	0.005134	...	-0.005104	-0.024190	0.000077	0.002218	0.008920	0.004139	-0.031822	-0.012107	-0.003906	-0.021170
3	-0.027698	0.005129	-0.031516	-0.002639	0.022235	-0.008117	0.003969	0.019119	0.015403	0.005920	...	0.007992	0.027203	0.000003	0.000701	0.010739	0.005324	-0.007914	0.007168	0.004556	-0.003786
4	-0.021129	-0.019053	0.003273	-0.008121	0.002883	-0.004927	0.002548	-0.000599	0.028492	-0.012181	...	0.000198	-0.025817	-0.000047	-0.002800	-0.051515	-0.004861	0.015134	-0.000596	-0.010343	0.004530</pre></div>



<p>The data looks good, so let&#8217;s continue.</p>



<h3 class="wp-block-heading">Step #2 Plotting Crypto Price Charts</h3>



<p>Now that the data is available, we can visualize it in various line graphs. The visualization helps us better understand what kind of data we are dealing with and check if the download was successful.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Create Prices Charts for all Cryptocurrencies
list_length = X_df_filtered.shape[1]
ncols = 10
nrows = int(round(list_length / ncols, 0))
height = list_length/3 if list_length &gt; 30 else 4
fig, axs = plt.subplots(nrows=nrows, ncols=ncols, sharex=True, sharey=True, figsize=(20, height))
for i, ax in enumerate(fig.axes):
        if i &lt; list_length:
            sns.lineplot(data=X_df_filtered, x=X_df_filtered.index, y=X_df_filtered.iloc[:, i], ax=ax)
            ax.set_title(X_df_filtered.columns[i])
plt.show()</pre></div>



<figure class="wp-block-image size-large is-resized"><img decoding="async" data-attachment-id="8232" data-permalink="https://www.relataly.com/crypto-market-cluster-analysis-using-affinity-propagation-python/8114/price-charts-stock-price-prediction/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/05/price-charts-stock-price-prediction.png" data-orig-size="1176,790" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="price-charts-stock-price-prediction" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/05/price-charts-stock-price-prediction.png" src="https://www.relataly.com/wp-content/uploads/2022/05/price-charts-stock-price-prediction-1024x688.png" alt="Clustering Crypto Market Structures with Affinity Propagation: Daily Price Quotes for different Cryptocurrencies" class="wp-image-8232" width="937" height="629" srcset="https://www.relataly.com/wp-content/uploads/2022/05/price-charts-stock-price-prediction.png 1024w, https://www.relataly.com/wp-content/uploads/2022/05/price-charts-stock-price-prediction.png 300w, https://www.relataly.com/wp-content/uploads/2022/05/price-charts-stock-price-prediction.png 768w, https://www.relataly.com/wp-content/uploads/2022/05/price-charts-stock-price-prediction.png 1176w" sizes="(max-width: 937px) 100vw, 937px" /></figure>



<p>We can see the lineplots for all cryptocurrencies and everything looks as expected.</p>



<h3 class="wp-block-heading" id="h-step-3-clustering-cryptocurrencies-using-affinity-propagation">Step #3 Clustering Cryptocurrencies using Affinity Propagation</h3>



<p>Next, we must prepare the data and run the affinity propagation algorithm. For some cryptocurrencies, we may encounter data that contains NaN values. Because clustering is sensitive to missing values, we must ensure good data quality. In addition, the Python code below will convert the DataFrame into a NumPy array and transpose it into a form where we have crypto assets as records and the days as columns.</p>



<p>Running the code below returns a dictionary of clusters with the cryptocurrencies assigned to them by the affinity propagation algorithm.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Drop NaN values
X_df = pd.DataFrame(np.array(X_df_filtered)).dropna()
# Transpose the data to structure prices along columns
X = X_df.copy()
X /= X.std(axis=0)
X = np.array(X)
# Define an edge model based on covariance
edge_model = covariance.GraphicalLassoCV()
# Standardize the time series
edge_model.fit(X)
# Group cryptos to clusters using affinity propagation
# The number of clusters will be determined by the algorithm
cluster_centers_indices , labels = cluster.affinity_propagation(edge_model.covariance_, random_state=1)
cluster_dict = {}
n_labels = labels.max()
print(f&quot;{n_labels} Clusters&quot;)
for i in range(n_labels + 1):
    clusters = ', '.join(names[labels == i])
    print('Cluster %i: %s' % ((i + 1), clusters))
    cluster_dict[i] = (clusters)</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">9 Clusters
Cluster 1: Binance Coin, Cake Defi
Cluster 2: Bitcoin Cash, Bitcoin, BitTorrent, Decred, EOS, Ethereum Classic, Ethereum, Ampleforth, Komodo, Solana, Sys Coin, DOT
Cluster 3: Celsius
Cluster 4: Doge Coin
Cluster 5: Cardano, ATOM, Avalance, Enjin, Internet Computer, Link, Loopring, Polygon, IOTA, NEO, Synthetix, Theta, Vechain
Cluster 6: Litecoin
Cluster 7: ACM Token, Atletico Madrid Token, Chilliz, Juventus Turin Token, PSG Token
Cluster 8: LRC
Cluster 9: Tether
Cluster 10: ARK, Battoken, BLZ, Digibyte, AS Rom Token, WAVES, Stellar Lumen, Monero, Ripple, Zilliqa, Zer0</pre></div>



<p>We can see that the algorithm has identified 13 different clusters in the data and a couple of clusters with only a single member. You will most likely encounter different results depending on when you run it. </p>



<h3 class="wp-block-heading">Step #4 Create a 2D Positioning Model based on the Graph Structure</h3>



<p>In addition to clusters, we want to show the covariance between cryptocurrencies in our Crypto Market map. We need a graph-like structure that contains the covariance and position data of the cryptocurrencies for each crypto pair.</p>



<p>In addition, we use a node position model that calculates their relative position on a 2D plane from the covariance of the cryptocurrencies. However, the positions are only relative, so the absolute axes have no meaning.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Create a node_position_model that find the best position of the cryptos on a 2D plane
# The number of components defines the dimensions in which the nodes will be positioned
node_position_model = manifold.LocallyLinearEmbedding(n_components=2, eigen_solver='dense', n_neighbors=20)
embedding = node_position_model.fit_transform(X.T).T
# The result are x and y coordindates for all cryptocurrencies
pd.DataFrame(embedding)
# Create an edge_model that represents the partial correlations between the nodes
partial_correlations = edge_model.precision_.copy()
d = 1 / np.sqrt(np.diag(partial_correlations))
partial_correlations *= d
partial_correlations *= d[:, np.newaxis]
# Only consider partial correlations above a specific threshold (0.02)
non_zero = (np.abs(np.triu(partial_correlations, k=1)) &gt; 0.02)
# Convert the Positioning Model into a DataFrame
data = pd.DataFrame.from_dict({&quot;embedding_x&quot;:embedding[0],&quot;embedding_y&quot;:embedding[1]})
# Add the labels to the 2D positioning model
data[&quot;labels&quot;] = labels
print(data.shape)
data.head()</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">(48, 3)
	embedding_x	embedding_y	labels
0	0.400590	-0.136473	6
1	-0.081908	-0.086039	4
2	-0.033982	-0.038526	9
3	0.416745	0.076849	6
4	-0.041938	0.031966	4</pre></div>



<p>The next step is to create a graph of the partial correlations. </p>



<h3 class="wp-block-heading">Step #5 Visualize the Crypto Market Structure</h3>



<p>Our goal is to visualize differences in the covariance between crypto pairs by varying the connection strengths. We calculate the line strength by normalizing the covariance of the crypto pairs. In addition, we visualize the distribution of the covariance. </p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Create an array with the segments for connecting the data points
start_idx, end_idx = np.where(non_zero) 
segments = [[np.array([embedding[:, start], embedding[:, stop]]).T, start, stop] for start, stop in zip(start_idx, end_idx)]
# Create a normalized representation of partial correlation between crypto currencies
# We can later use covariance to vizualize the strength of the connections
pc = np.abs(partial_correlations[non_zero])
normalized = (pc-min(pc))/(max(pc)-min(pc))
# plot the distribution of covariance between the cryptocurrencies
sns.histplot(pc)</pre></div>



<figure class="wp-block-image size-full is-resized"><img decoding="async" data-attachment-id="8251" data-permalink="https://www.relataly.com/crypto-market-cluster-analysis-using-affinity-propagation-python/8114/covariance-histogram/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/05/covariance-histogram.png" data-orig-size="382,248" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="covariance-histogram" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/05/covariance-histogram.png" src="https://www.relataly.com/wp-content/uploads/2022/05/covariance-histogram.png" alt="cryptocurrency market structure visualization with affinity propagation, histogram of covariance between historical crypto prices" class="wp-image-8251" width="540" height="351" srcset="https://www.relataly.com/wp-content/uploads/2022/05/covariance-histogram.png 382w, https://www.relataly.com/wp-content/uploads/2022/05/covariance-histogram.png 300w" sizes="(max-width: 540px) 100vw, 540px" /></figure>



<p>The hist plot shows that the covariance between the crypto pairs is mostly below 0.005.</p>



<p>Finally, it is time to map cryptocurrencies on a 2D plane. To do this, we first define the cryptocurrencies using their relative position data with a scatterplot. We set the color of the points based on their clusters so that points in the same cluster are colored the same. Subsequently, we connect the points to the data from the edge model. The covariance between the crypto pairs determines the strength of their connections.</p>



<p>We also define the color of the connections as follows. </p>



<ul class="wp-block-list">
<li>The map only shows connections with a covariance greater than 0.002.</li>



<li>Connections with a covariance greater than 0.05 are colored red. </li>



<li>Otherwise, connections between points within a cluster are shown in the cluster&#8217;s color. </li>



<li>We color connections in grey that are between points of different clusters.</li>
</ul>



<p>Last but not least, we add the labels of the cryptocurrencies.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Visualization
plt.figure(1, facecolor='w', figsize=(20, 8))
plt.clf()
ax = plt.axes([0., 0., 1., 1.])

# Plot the nodes using the coordinates of our embedding
sc = sns.scatterplot(
    data=data,
    x=&quot;embedding_x&quot;,
    y=&quot;embedding_y&quot;,
    zorder=1,
    s=350 * d ** 2,
    c=labels,
    cmap=plt.cm.nipy_spectral,
    alpha=.9,
    #palette=&quot;muted&quot;,
)

# Plot the covariance edges between the nodes (scatter points)
line_strength = 3.2
    
for index, ((x, y), start, stop) in enumerate(segments):     
    norm_partial_correlation = normalized[index]
    if list(data.iloc[[start]]['labels'])[0] == list(data.iloc[[stop]]['labels'])[0]:
        if norm_partial_correlation &gt; 0.5:
            color = 'red'; linestyle='solid'
        else:
            color = plt.cm.nipy_spectral(list(data.iloc[[start]]['labels'])[0] / float(n_labels)); linestyle='solid'
    else:
        if norm_partial_correlation &gt; 0.5:
            color = 'red'; linestyle='solid'
        else:
            color = 'grey'; linestyle='dashed'
    # Plot the edges
    # if x and y larger than 0
    if x[0] &gt; 0 and y[0] &gt; 0:
        plt.plot(x, y, alpha=.4, zorder=0, linewidth=normalized[index]*line_strength, color=color, linestyle=linestyle)

    
# Labels the nodes and position the labels to avoid overlap with other labels
for name, label, (x, y) in zip(names, labels, embedding.T):
    color = plt.cm.nipy_spectral(label / float(n_labels))
    ax.annotate(
        name,
        xy=(x, y),
        xytext=(5, 2),
        textcoords='offset points',
        ha='right',
        va='bottom',
        fontsize=10,
        color='black',
        bbox=dict(facecolor='w', edgecolor=&quot;w&quot;, alpha=.0),
     )</pre></div>



<figure class="wp-block-image size-large is-resized"><img decoding="async" data-attachment-id="8229" data-permalink="https://www.relataly.com/crypto-market-cluster-analysis-using-affinity-propagation-python/8114/cryptocurrency-map/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/05/cryptocurrency-map.png" data-orig-size="1473,590" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="cryptocurrency-map" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/05/cryptocurrency-map.png" src="https://www.relataly.com/wp-content/uploads/2022/05/cryptocurrency-map-1024x410.png" alt="Visualizing the Crypto Market Structure - Clusters of Cryptocurrencies determined by Affinity Propagation, Connections between cryptocurrencies defined by partial covariance in daily price fluctuations" class="wp-image-8229" width="1177" height="471" srcset="https://www.relataly.com/wp-content/uploads/2022/05/cryptocurrency-map.png 1024w, https://www.relataly.com/wp-content/uploads/2022/05/cryptocurrency-map.png 300w, https://www.relataly.com/wp-content/uploads/2022/05/cryptocurrency-map.png 768w, https://www.relataly.com/wp-content/uploads/2022/05/cryptocurrency-map.png 1473w" sizes="(max-width: 1177px) 100vw, 1177px" /></figure>



<p>Note that you will likely see a different map when you run the code on your machine. Differences result from changes in market prices and covariance that lead to other graph structures. </p>



<p>Let&#8217;s see what the crypto market map tells us.</p>



<h4 class="wp-block-heading">Interpreting the Cryptomarket Map</h4>



<p>The 2D crypto market map tells us several things:</p>



<ul class="wp-block-list">
<li>Most cryptos fall into the light green and dark green clusters corresponding to different types of crypto (Decentralized Finance Coins, NFT/Metaverse Coins).</li>



<li>There is a significant covariance between large-cap players in the crypto space, such as Cardano and Loopring and Ethereum and Bitcoin, which is plausible considering recent price movements. Some results are surprising, for example, the partial correlation between NEO and Ethereum Classic. </li>



<li>Some clusters are isolated and contain only a single member, for example, Tether, Komodo, AC Milan token, Wave token, and Dogecoin). The reason is that the prices of these coins/tokens have developed independently of the market.<ul><li> Tether is a stablecoin that does not change in price. It, therefore, strongly differs from the other cryptocurrencies on our map. </li></ul>
<ul class="wp-block-list">
<li>Komodo has been trading sideways without following the general market trend. </li>



<li>And the MCM token is a soccer token that has recently outperformed the market.</li>
</ul>
</li>



<li>Soccer tokens are colored in dark blue. These tokens&#8217; prices correlate with how the soccer clubs performed during the current season. It, therefore, makes perfect sense that these tokens are grouped into a cluster. An exception is the AC Milan token, which recently performed better than the other soccer tokens.</li>
</ul>



<h3 class="wp-block-heading">Step #6 Creating a 3D Representation</h3>



<p>Instead of a 2D representation of the data points, we can also use a 3D node positioning model. For this purpose, the node positioning model distributes the affinity values over three dimensions.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Find the best position of the cryptos on a 3D plane
node_position_model = manifold.LocallyLinearEmbedding(n_components=3, eigen_solver='dense', n_neighbors=20)
embedding = node_position_model.fit_transform(X.T).T
# The result are x and y coordindates for all cryptocurrencies
pd.DataFrame(embedding)
# Display a graph of the partial correlations
partial_correlations = edge_model.precision_.copy()
d = 1 / np.sqrt(np.diag(partial_correlations))
partial_correlations *= d
partial_correlations *= d[:, np.newaxis]
non_zero = (np.abs(np.triu(partial_correlations, k=1)) &gt; 0.02)
data = pd.DataFrame.from_dict({&quot;embedding_x&quot;:embedding[0],&quot;embedding_y&quot;:embedding[1],&quot;embedding_z&quot;:embedding[1]})
data[&quot;labels&quot;] = labels
data[&quot;names&quot;] = names
import matplotlib.pyplot as plt
import numpy as np
fig = plt.figure(figsize=(20,20))
ax = fig.add_subplot(projection='3d')
xs = data[&quot;embedding_x&quot;]
ys = data[&quot;embedding_y&quot;]
zs = data[&quot;embedding_z&quot;]
sc = ax.scatter(xs, ys, zs, c=labels, s=100)
    
for i in range(len(data)):
    x = xs[i]
    y = ys[i]
    z = zs[i]
    label = data[&quot;names&quot;][i]
    ax.text(x, y, z, label)
    
plt.legend(*sc.legend_elements(), bbox_to_anchor=(1.05, 1), loc=2)
plt.show()</pre></div>



<figure class="wp-block-image size-large is-resized"><img decoding="async" data-attachment-id="8243" data-permalink="https://www.relataly.com/crypto-market-cluster-analysis-using-affinity-propagation-python/8114/3d-cluster-representation-affinity-propagation/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/05/3d-cluster-representation-affinity-propagation.png" data-orig-size="1210,1101" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="3d-cluster-representation-affinity-propagation" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/05/3d-cluster-representation-affinity-propagation.png" src="https://www.relataly.com/wp-content/uploads/2022/05/3d-cluster-representation-affinity-propagation-1024x932.png" alt="3d representation of the cryptocurrency market structure, affinity propagation relataly" class="wp-image-8243" width="891" height="811" srcset="https://www.relataly.com/wp-content/uploads/2022/05/3d-cluster-representation-affinity-propagation.png 1024w, https://www.relataly.com/wp-content/uploads/2022/05/3d-cluster-representation-affinity-propagation.png 300w, https://www.relataly.com/wp-content/uploads/2022/05/3d-cluster-representation-affinity-propagation.png 768w, https://www.relataly.com/wp-content/uploads/2022/05/3d-cluster-representation-affinity-propagation.png 1210w" sizes="(max-width: 891px) 100vw, 891px" /></figure>



<h2 class="wp-block-heading" id="h-summary">Summary</h2>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p>Affinity propagation is a powerful technique for clustering items when the optimal number of clusters is unknown. In this article, we&#8217;ve demonstrated how to apply affinity propagation to analyze the cryptocurrency market and identify groups of assets based on similar price fluctuations.</p>



<p>In our example, we identified 13 groups of cryptocurrencies without specifying the number of clusters in advance. We also visualized the market structure on a 2D and 3D map using a node distribution technique. This approach can be extended to analyze and cluster stock markets, highlighting complex price patterns among multiple financial assets.</p>



<p>Once you&#8217;ve identified clusters, you can dive deeper into individual groups. Sometimes, outliers that temporarily break out of their usual pattern indicate interesting investment opportunities. These outliers can eventually return to the price pattern of their group, or they may represent forerunners of their group, indicating broader market movements.</p>



<p>By using affinity propagation, we can visualize financial assets in a new and exciting way. If you have any questions or comments about this approach, please let me know.</p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%"></div>
</div>



<h2 class="wp-block-heading" id="h-sources-and-further-reading">Sources and Further Reading</h2>



<p>This article modifies some of the code from <a href="https://scikit-learn.org/stable/auto_examples/applications/plot_stock_market.html#sphx-glr-auto-examples-applications-plot-stock-market-py" target="_blank" rel="noreferrer noopener">Scikit-learn and adapts it from the stock market</a> to cryptocurrencies.</p>



<ol class="wp-block-list">
<li><a href="https://amzn.to/3yIQdWi" target="_blank" rel="noreferrer noopener">Jansen (2020) Machine Learning for Algorithmic Trading: Predictive models to extract signals from market and alternative data for systematic trading strategies with Python</a></li>



<li><a href="https://amzn.to/3S9Nfkl" target="_blank" rel="noreferrer noopener">Aurélien Géron (2019) Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems </a></li>



<li><a href="https://amzn.to/3EKidwE" target="_blank" rel="noreferrer noopener">David Forsyth (2019) Applied Machine Learning Springer</a></li>



<li><a href="https://amzn.to/3MAy8j5" target="_blank" rel="noreferrer noopener">Andriy Burkov (2020) Machine Learning Engineering</a></li>



<li>Images are created using Midjourney, an AI that creates images from text.</li>
</ol>



<p class="has-contrast-2-color has-base-3-background-color has-text-color has-background"><em>The links above to Amazon are affiliate links. By buying through these links, you support the Relataly.com blog and help to cover the hosting costs. Using the links does not affect the price.</em></p>
<p>The post <a href="https://www.relataly.com/crypto-market-cluster-analysis-using-affinity-propagation-python/8114/">Unveiling Hidden Patterns in the Cryptocurrency Market with Affinity Propagation and Python</a> appeared first on <a href="https://www.relataly.com">relataly.com</a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://www.relataly.com/crypto-market-cluster-analysis-using-affinity-propagation-python/8114/feed/</wfw:commentRss>
			<slash:comments>1</slash:comments>
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">8114</post-id>	</item>
		<item>
		<title>Using Random Search to Tune the Hyperparameters of a Random Decision Forest with Python</title>
		<link>https://www.relataly.com/using-random-search-to-tune-the-hyperparameters-of-a-random-decision-forest-with-python/6875/</link>
					<comments>https://www.relataly.com/using-random-search-to-tune-the-hyperparameters-of-a-random-decision-forest-with-python/6875/#respond</comments>
		
		<dc:creator><![CDATA[Florian Follonier]]></dc:creator>
		<pubDate>Thu, 07 Apr 2022 17:55:36 +0000</pubDate>
				<category><![CDATA[Algorithms]]></category>
		<category><![CDATA[Classification (two-class)]]></category>
		<category><![CDATA[Cross-Validation]]></category>
		<category><![CDATA[Finance]]></category>
		<category><![CDATA[Hyperparameter Tuning]]></category>
		<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[Random Decision Forests]]></category>
		<category><![CDATA[Sales Forecasting]]></category>
		<category><![CDATA[Scikit-Learn]]></category>
		<category><![CDATA[Seaborn]]></category>
		<category><![CDATA[Use Cases]]></category>
		<category><![CDATA[Classic Machine Learning]]></category>
		<category><![CDATA[Intermediate Tutorials]]></category>
		<category><![CDATA[Multivariate Models]]></category>
		<category><![CDATA[Random Forest Regression]]></category>
		<category><![CDATA[Random Search]]></category>
		<category><![CDATA[Supervised Learning]]></category>
		<category><![CDATA[Tuning Random Decision Forests]]></category>
		<guid isPermaLink="false">https://www.relataly.com/?p=6875</guid>

					<description><![CDATA[<p>Perfecting your machine learning model&#8217;s hyperparameters can often feel like hunting for a proverbial needle in a haystack. But with the Random Search algorithm, this intricate process of hyperparameter tuning can be efficiently automated, saving you valuable time and effort. Hyperparameters are properties intrinsic to your model, like the number of estimators in an ensemble ... <a title="Using Random Search to Tune the Hyperparameters of a Random Decision Forest with Python" class="read-more" href="https://www.relataly.com/using-random-search-to-tune-the-hyperparameters-of-a-random-decision-forest-with-python/6875/" aria-label="Read more about Using Random Search to Tune the Hyperparameters of a Random Decision Forest with Python">Read more</a></p>
<p>The post <a href="https://www.relataly.com/using-random-search-to-tune-the-hyperparameters-of-a-random-decision-forest-with-python/6875/">Using Random Search to Tune the Hyperparameters of a Random Decision Forest with Python</a> appeared first on <a href="https://www.relataly.com">relataly.com</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p>Perfecting your machine learning model&#8217;s hyperparameters can often feel like hunting for a proverbial needle in a haystack. But with the Random Search algorithm, this intricate process of hyperparameter tuning can be efficiently automated, saving you valuable time and effort. Hyperparameters are properties intrinsic to your model, like the number of estimators in an ensemble model, and heavily influence its performance. Unlike model parameters, which are discovered during training by the machine learning algorithm, hyperparameters require pre-specification.</p>



<p>In this comprehensive Python tutorial, we&#8217;ll guide you on how to harness the power of Random Search to optimize a regression model&#8217;s hyperparameters. Our illustrative example utilizes a Support Vector Machine (SVM) for predicting house prices. However, the fundamental principles you&#8217;ll learn can be seamlessly applied to any model. So why painstakingly fine-tune hyperparameters manually when Random Search can handle the task efficiently?</p>



<p>Here&#8217;s a preview of what this Python tutorial entails:</p>



<ol class="wp-block-list">
<li>A brief overview of how Random Search operates and instances where it might be preferable to Grid Search.</li>



<li>A hands-on Python tutorial featuring a public house price dataset from Kaggle.com. The aim here is to train a regression model capable of predicting US house prices based on various properties.</li>



<li>Training a &#8216;best-guess&#8217; model in Python, followed by using Random Search to discover a model with enhanced performance.</li>



<li>Finally, we&#8217;ll implement cross-validation to validate our models&#8217; performance.</li>
</ol>



<p>By the end of this tutorial, you&#8217;ll be well-equipped to let Random Search efficiently fine-tune your model&#8217;s hyperparameters, freeing up your time for other crucial tasks.</p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%"></div>
</div>



<h2 class="wp-block-heading" id="h-hyperparameter-tuning">Hyperparameter Tuning </h2>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p>Hyperparameters are configuration options that allow us to customize machine learning models and improve their performance. While normal parameters are the internal coefficients that the model learns during training, we need to specify hyperparameters before the training. It is usually impossible to find the best configuration without testing different configurations. </p>



<p>Searching for a suitable model configuration is called &#8220;hyperparameter tuning&#8221; or &#8220;hyperparameter optimization.&#8221; Machine learning algorithms have varying hyperparameters and parameter values. For example, a random decision forest classifier allows us to configure varying parameters such as the number of trees, the maximum tree depth, and the minimum number of nodes required for a new branch. </p>



<p>The hyperparameters and the range of possible parameter values span a search space in which we seek to identify the best configuration. The larger the search space, the more difficult it gets to find an optimal model. We can use random search to automatize this process.</p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%">
<figure class="wp-block-image size-full"><img decoding="async" width="506" height="501" data-attachment-id="12416" data-permalink="https://www.relataly.com/using-random-search-to-tune-the-hyperparameters-of-a-random-decision-forest-with-python/6875/random-hyperparameter-tuning-machine-learning/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2023/02/random-hyperparameter-tuning-machine-learning.png" data-orig-size="506,501" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="random-hyperparameter-tuning-machine-learning" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2023/02/random-hyperparameter-tuning-machine-learning.png" src="https://www.relataly.com/wp-content/uploads/2023/02/random-hyperparameter-tuning-machine-learning.png" alt="Random search can be an efficient way to tune the hyperparameters of a machine learning model. Image generated with Midjourney. Image of exploding dices with different colors. " class="wp-image-12416" srcset="https://www.relataly.com/wp-content/uploads/2023/02/random-hyperparameter-tuning-machine-learning.png 506w, https://www.relataly.com/wp-content/uploads/2023/02/random-hyperparameter-tuning-machine-learning.png 300w, https://www.relataly.com/wp-content/uploads/2023/02/random-hyperparameter-tuning-machine-learning.png 140w" sizes="(max-width: 506px) 100vw, 506px" /><figcaption class="wp-element-caption">Random search can be an efficient way to tune the hyperparameters of a machine learning model. Image generated with <a href="http://www.midjourney.com" target="_blank" rel="noreferrer noopener">Midjourney</a></figcaption></figure>
</div>
</div>



<h2 class="wp-block-heading">Techniques for Tuning Hyperparameters</h2>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p>Hyperparameter tuning is the process of adjusting the hyperparameters of a machine learning algorithm to optimize its performance on a specific dataset or task. Several techniques can be used for hyperparameter tuning, including:</p>



<ol class="wp-block-list">
<li><strong><a href="https://www.relataly.com/hyperparameter-tuning-with-grid-search/2261/" target="_blank" rel="noreferrer noopener">Grid Search</a>:</strong> grid search is a brute-force search algorithm that systematically evaluates a given set of hyperparameter values by training and evaluating a model for each combination of values. It is a simple and effective technique, but it can be computationally expensive, especially for large or complex datasets.</li>



<li><strong>Random Search: </strong>As mentioned, random search is an alternative to grid search that randomly samples a given set of hyperparameter values rather than evaluating all possible combinations. It can be more efficient than grid search, but it may not find the optimal set of hyperparameters.</li>



<li><strong>Bayesian Optimization:</strong> A bayesian optimization is a probabilistic approach to hyperparameter tuning, which uses Bayesian inference to model the distribution of hyperparameter values that are likely to produce a good performance. It can be more efficient and effective than grid search or random search, but it can be more challenging to implement and interpret.</li>



<li><strong>Genetic Algorithms:</strong> genetic algorithms are optimization algorithms inspired by the principles of natural selection and genetics. They use a population of candidate solutions, which are iteratively evolved and selected based on their fitness or performance, to find the optimal set of hyperparameters.</li>
</ol>



<p>In this article, we specifically look at the Random Search technique.</p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%">
<figure class="wp-block-image size-full"><img decoding="async" width="892" height="512" data-attachment-id="12417" data-permalink="https://www.relataly.com/using-random-search-to-tune-the-hyperparameters-of-a-random-decision-forest-with-python/6875/image-of-an-old-mechanic-tuning-a-car-hyperparameter-tuning-machine-learning-python-tutorial-scikit-learn/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2023/02/Image-of-an-old-mechanic-tuning-a-car.-Hyperparameter-Tuning-Machine-Learning-Python-Tutorial-Scikit-Learn.png" data-orig-size="892,512" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="Image-of-an-old-mechanic-tuning-a-car.-Hyperparameter-Tuning-Machine-Learning-Python-Tutorial-Scikit-Learn" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2023/02/Image-of-an-old-mechanic-tuning-a-car.-Hyperparameter-Tuning-Machine-Learning-Python-Tutorial-Scikit-Learn.png" src="https://www.relataly.com/wp-content/uploads/2023/02/Image-of-an-old-mechanic-tuning-a-car.-Hyperparameter-Tuning-Machine-Learning-Python-Tutorial-Scikit-Learn.png" alt="You can spend much time tuning a machine learning model. Image generated with Midjourney. portrait of an old mechanic working on a car. relataly.com" class="wp-image-12417" srcset="https://www.relataly.com/wp-content/uploads/2023/02/Image-of-an-old-mechanic-tuning-a-car.-Hyperparameter-Tuning-Machine-Learning-Python-Tutorial-Scikit-Learn.png 892w, https://www.relataly.com/wp-content/uploads/2023/02/Image-of-an-old-mechanic-tuning-a-car.-Hyperparameter-Tuning-Machine-Learning-Python-Tutorial-Scikit-Learn.png 300w, https://www.relataly.com/wp-content/uploads/2023/02/Image-of-an-old-mechanic-tuning-a-car.-Hyperparameter-Tuning-Machine-Learning-Python-Tutorial-Scikit-Learn.png 768w" sizes="(max-width: 892px) 100vw, 892px" /><figcaption class="wp-element-caption">You can spend much time tuning a machine learning model. Image generated with Midjourney.</figcaption></figure>
</div>
</div>



<h2 class="wp-block-heading">What is Random Search?</h2>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p>The random search algorithm generates models from hyperparameter permutations randomly selected from a grid of parameter values. The idea behind the randomized approach is that testing random configurations efficiently identifies a good model. We can use random search both for regression and classification models.</p>



<p>Random Search and Grid Search are the most popular techniques for hyperparametric tuning, and both methods are often compared. Unlike random search, grid search covers the search space exhaustively by trying all possible variants. The technique works well for testing a small number of configurations already known to work well. </p>



<p>As long as both search space and training time are small, the grid search technique is excellent for finding the best model. However, the number of model variants increases exponentially with the size of the search space. It is often more efficient for large search spaces or complex models to use random search.</p>



<p>Since random search does not exhaustively cover the search space, it does not necessarily yield the best model. However, it is also much faster than grid search and efficient in delivering a suitable model in a short time.</p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%"><div class="wp-block-image">
<figure class="alignright size-full is-resized"><img decoding="async" data-attachment-id="6924" data-permalink="https://www.relataly.com/using-random-search-to-tune-the-hyperparameters-of-a-random-decision-forest-with-python/6875/image-16-9/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/04/image-16.png" data-orig-size="640,469" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-16" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/04/image-16.png" src="https://www.relataly.com/wp-content/uploads/2022/04/image-16.png" alt="random decision forest python,
hyperparameter tuning,
comparison between random search and grid search" class="wp-image-6924" width="409" height="301" srcset="https://www.relataly.com/wp-content/uploads/2022/04/image-16.png 640w, https://www.relataly.com/wp-content/uploads/2022/04/image-16.png 300w" sizes="(max-width: 409px) 100vw, 409px" /><figcaption class="wp-element-caption">Random Search vs. Exhaustive Grid Search</figcaption></figure>
</div></div>
</div>



<div style="height:48px" aria-hidden="true" class="wp-block-spacer"></div>



<h2 class="wp-block-heading" id="h-tuning-the-hyperparameters-of-a-random-decision-forest-regressor-in-python-using-random-search">Tuning the Hyperparameters of a Random Decision Forest Regressor in Python using Random Search</h2>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p><br>In this tutorial, we delve into the use of the Random Search algorithm in Python, specifically for predicting house prices. We&#8217;ll be using a dataset rich in diverse house characteristics. Various elements, such as data quality and quantity, model intricacy, the selection of machine learning algorithms, and housing market stability, significantly influence the accuracy of house price predictions.</p>



<p>Our initial model employs a Random Decision Forest algorithm, which we&#8217;ll optimize using a random search approach for hyperparameters tuning. By identifying and implementing a more advantageous configuration, we aim to enhance our model&#8217;s performance significantly.</p>



<p>Here&#8217;s a concise outline of the steps we&#8217;ll undertake:</p>



<ol class="wp-block-list">
<li>Loading the house price dataset</li>



<li>Exploring the dataset intricacies</li>



<li>Preparing the data for modeling</li>



<li>Training a baseline Random Decision Forest model</li>



<li>Implementing a random search approach for model optimization</li>



<li>Measuring and evaluating the performance of our optimized model</li>
</ol>



<p>Through this step-by-step guide, you&#8217;ll learn to enhance model performance, further refining your understanding of Random Search algorithm implementation in Python.</p>



<p>The Python code is available in the relataly GitHub repository. </p>



<div class="wp-block-kadence-advancedbtn kb-buttons-wrap kb-btns_f9d778-26"><a class="kb-button kt-button button kb-btn_b5d394-00 kt-btn-size-standard kt-btn-width-type-full kb-btn-global-inherit  kt-btn-has-text-true kt-btn-has-svg-true  wp-block-button__link wp-block-kadence-singlebtn" href="https://github.com/flo7up/relataly-public-python-tutorials/blob/master/11%20Hyperparamter%20Tuning/016%20Hyperparameter%20Tuning%20of%20Random%20Decision%20Forests%20using%20Grid%20Search.ipynb" target="_blank" rel="noreferrer noopener"><span class="kb-svg-icon-wrap kb-svg-icon-fe_eye kt-btn-icon-side-left"><svg viewBox="0 0 24 24"  fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"  aria-hidden="true"><path d="M1 12s4-8 11-8 11 8 11 8-4 8-11 8-11-8-11-8z"/><circle cx="12" cy="12" r="3"/></svg></span><span class="kt-btn-inner-text">View on GitHub </span></a>

<a class="kb-button kt-button button kb-btn_ce738a-fc kt-btn-size-standard kt-btn-width-type-full kb-btn-global-inherit  kt-btn-has-text-true kt-btn-has-svg-true  wp-block-button__link wp-block-kadence-singlebtn" href="https://github.com/flo7up/relataly-public-python-API-tutorials" target="_blank" rel="noreferrer noopener"><span class="kb-svg-icon-wrap kb-svg-icon-fa_github kt-btn-icon-side-left"><svg viewBox="0 0 496 512"  fill="currentColor" xmlns="http://www.w3.org/2000/svg"  aria-hidden="true"><path d="M165.9 397.4c0 2-2.3 3.6-5.2 3.6-3.3.3-5.6-1.3-5.6-3.6 0-2 2.3-3.6 5.2-3.6 3-.3 5.6 1.3 5.6 3.6zm-31.1-4.5c-.7 2 1.3 4.3 4.3 4.9 2.6 1 5.6 0 6.2-2s-1.3-4.3-4.3-5.2c-2.6-.7-5.5.3-6.2 2.3zm44.2-1.7c-2.9.7-4.9 2.6-4.6 4.9.3 2 2.9 3.3 5.9 2.6 2.9-.7 4.9-2.6 4.6-4.6-.3-1.9-3-3.2-5.9-2.9zM244.8 8C106.1 8 0 113.3 0 252c0 110.9 69.8 205.8 169.5 239.2 12.8 2.3 17.3-5.6 17.3-12.1 0-6.2-.3-40.4-.3-61.4 0 0-70 15-84.7-29.8 0 0-11.4-29.1-27.8-36.6 0 0-22.9-15.7 1.6-15.4 0 0 24.9 2 38.6 25.8 21.9 38.6 58.6 27.5 72.9 20.9 2.3-16 8.8-27.1 16-33.7-55.9-6.2-112.3-14.3-112.3-110.5 0-27.5 7.6-41.3 23.6-58.9-2.6-6.5-11.1-33.3 2.6-67.9 20.9-6.5 69 27 69 27 20-5.6 41.5-8.5 62.8-8.5s42.8 2.9 62.8 8.5c0 0 48.1-33.6 69-27 13.7 34.7 5.2 61.4 2.6 67.9 16 17.7 25.8 31.5 25.8 58.9 0 96.5-58.9 104.2-114.8 110.5 9.2 7.9 17 22.9 17 46.4 0 33.7-.3 75.4-.3 83.6 0 6.5 4.6 14.4 17.3 12.1C428.2 457.8 496 362.9 496 252 496 113.3 383.5 8 244.8 8zM97.2 352.9c-1.3 1-1 3.3.7 5.2 1.6 1.6 3.9 2.3 5.2 1 1.3-1 1-3.3-.7-5.2-1.6-1.6-3.9-2.3-5.2-1zm-10.8-8.1c-.7 1.3.3 2.9 2.3 3.9 1.6 1 3.6.7 4.3-.7.7-1.3-.3-2.9-2.3-3.9-2-.6-3.6-.3-4.3.7zm32.4 35.6c-1.6 1.3-1 4.3 1.3 6.2 2.3 2.3 5.2 2.6 6.5 1 1.3-1.3.7-4.3-1.3-6.2-2.2-2.3-5.2-2.6-6.5-1zm-11.4-14.7c-1.6 1-1.6 3.6 0 5.9 1.6 2.3 4.3 3.3 5.6 2.3 1.6-1.3 1.6-3.9 0-6.2-1.4-2.3-4-3.3-5.6-2z"/></svg></span><span class="kt-btn-inner-text">Relataly Github Repo </span></a></div>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%">
<figure class="wp-block-image size-full"><img decoding="async" width="505" height="504" data-attachment-id="12422" data-permalink="https://www.relataly.com/using-random-search-to-tune-the-hyperparameters-of-a-random-decision-forest-with-python/6875/isometric-house-min/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2023/02/isometric-house-min.png" data-orig-size="505,504" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="isometric-house-min" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2023/02/isometric-house-min.png" src="https://www.relataly.com/wp-content/uploads/2023/02/isometric-house-min.png" alt="Now that we have trained a house price prediction model, we can use it to asses the price of new houses. Image generated with Midjourney. Python machine learning tutorial. relataly.com" class="wp-image-12422" srcset="https://www.relataly.com/wp-content/uploads/2023/02/isometric-house-min.png 505w, https://www.relataly.com/wp-content/uploads/2023/02/isometric-house-min.png 300w, https://www.relataly.com/wp-content/uploads/2023/02/isometric-house-min.png 140w" sizes="(max-width: 505px) 100vw, 505px" /><figcaption class="wp-element-caption">Once we have trained a house price prediction model, we can use it to asses the price of new houses. Image generated with <a href="http://www.midjourney.com" target="_blank" rel="noreferrer noopener">Midjourney</a>.</figcaption></figure>
</div>
</div>



<h3 class="wp-block-heading" id="h-prerequisites">Prerequisites</h3>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p>Before starting the coding part, ensure that you have set up your Python (3.8 or higher) environment and required packages. If you don&#8217;t have an environment, follow&nbsp;<a href="https://www.relataly.com/anaconda-python-environment-machine-learning/1663/" target="_blank" rel="noreferrer noopener">this tutorial</a>&nbsp;to set up the&nbsp;<a href="https://www.anaconda.com/products/individual" target="_blank" rel="noreferrer noopener">Anaconda environment</a>.</p>



<p>Also, make sure you install all required packages. In this tutorial, we will be working with the following standard packages:&nbsp;</p>



<ul class="wp-block-list">
<li><em><a href="https://pandas.pydata.org/" target="_blank" rel="noreferrer noopener">pandas</a></em></li>



<li><em><a href="https://numpy.org/" target="_blank" rel="noreferrer noopener">NumPy</a></em></li>



<li><a href="https://docs.python.org/3/library/math.html" target="_blank" rel="noreferrer noopener">math</a></li>



<li><em><a href="https://matplotlib.org/" target="_blank" rel="noreferrer noopener">matplotlib</a></em></li>
</ul>



<p>In addition, we will be using the Python Machine Learning library <a href="https://scikit-learn.org/stable/" target="_blank" rel="noreferrer noopener">Scikit-learn</a> to implement the random forest and the grid search technique. </p>



<p>You can install packages using console commands:</p>



<ul class="wp-block-list">
<li><em>pip install &lt;package name&gt;</em></li>



<li><em>conda install &lt;package name&gt;</em>&nbsp;(if you are using the anaconda packet manager)</li>
</ul>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%"></div>
</div>



<h3 class="wp-block-heading" id="h-house-price-prediction-about-the-use-case-and-the-data">House Price Prediction: About the Use Case and the Data</h3>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p>House price prediction is the process of using statistical and machine learning techniques to predict the future value of a house. This can be useful for a variety of applications, such as helping homeowners and real estate professionals to make informed decisions about buying and selling properties. In order to make accurate predictions, it is important to have access to high-quality data about the housing market.</p>



<p>In this tutorial, we will work with a house price dataset from the <a href="//www.kaggle.com/c/house-prices-advanced-regression-techniques" target="_blank" rel="noreferrer noopener">house price regression challenge on Kaggle.com</a>. The dataset is available via a git hub repository. It contains information about 4800 houses sold between 2016 and 2020 in the US. The data includes the sale price and a list of 48 house characteristics, such as:</p>



<ul class="wp-block-list">
<li>Year &#8211; The year of construction, </li>



<li>SaleYear &#8211; The year in which the house was sold </li>



<li>Lot Area &#8211; The lot area of the house </li>



<li>Quality &#8211; The overall quality of the house from one (lowest) to ten (highest)</li>



<li>Road &#8211; The type of road, e.g., paved, etc. </li>



<li>Utility &#8211; The type of the utility </li>



<li>Park Lot Area &#8211; The parking space included with the property </li>



<li>Room number &#8211; The number of rooms </li>
</ul>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%">
<figure class="wp-block-image size-full"><img decoding="async" width="505" height="505" data-attachment-id="12421" data-permalink="https://www.relataly.com/using-random-search-to-tune-the-hyperparameters-of-a-random-decision-forest-with-python/6875/isometric-house-collection-python-machine-learning-min/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2023/02/isometric-house-collection-python-machine-learning-min.png" data-orig-size="505,505" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="isometric-house-collection-python-machine-learning-min" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2023/02/isometric-house-collection-python-machine-learning-min.png" src="https://www.relataly.com/wp-content/uploads/2023/02/isometric-house-collection-python-machine-learning-min.png" alt="Predicting house prices with machine learning. Image generated with Midjourney. Isometric view of houses. relataly.com" class="wp-image-12421" srcset="https://www.relataly.com/wp-content/uploads/2023/02/isometric-house-collection-python-machine-learning-min.png 505w, https://www.relataly.com/wp-content/uploads/2023/02/isometric-house-collection-python-machine-learning-min.png 300w, https://www.relataly.com/wp-content/uploads/2023/02/isometric-house-collection-python-machine-learning-min.png 140w" sizes="(max-width: 505px) 100vw, 505px" /><figcaption class="wp-element-caption">Predicting house prices with machine learning. Image generated with <a href="http://www.midjourney.com" target="_blank" rel="noreferrer noopener">Midjourney</a>.</figcaption></figure>
</div>
</div>



<h3 class="wp-block-heading" id="h-step-1-load-the-data">Step #1 Load the Data</h3>



<p>We begin by loading the house price data from the relataly GitHub repository. A separate download is not required.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># A tutorial for this file is available at www.relataly.com

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import RandomizedSearchCV, train_test_split
from sklearn.metrics import mean_absolute_error, mean_absolute_percentage_error
from sklearn import svm

# Source: 
# https://www.kaggle.com/c/house-prices-advanced-regression-techniques

# Load train and test datasets
path = &quot;https://raw.githubusercontent.com/flo7up/relataly_data/main/house_prices/train.csv&quot;
df = pd.read_csv(path)
print(df.columns)
df.head()</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">Index(['Id', 'MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street',
       'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig',
       'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType',
       'HouseStyle', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd',
       'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType',
       'MasVnrArea', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual',
       'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1',
       'BsmtFinType2', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating',
       'HeatingQC', 'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF',
       'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath',
       'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual',
       'TotRmsAbvGrd', 'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType',
       'GarageYrBlt', 'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual',
       'GarageCond', 'PavedDrive', 'WoodDeckSF', 'OpenPorchSF',
       'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'PoolQC',
       'Fence', 'MiscFeature', 'MiscVal', 'MoSold', 'YrSold', 'SaleType',
       'SaleCondition', 'SalePrice'],
      dtype='object')
	Id	MSSubClass	MSZoning	LotFrontage	LotArea	Street	Alley	LotShape	LandContour	Utilities	...	PoolArea	PoolQC	Fence	MiscFeature	MiscVal	MoSold	YrSold	SaleType	SaleCondition	SalePrice
0	1	60			RL			65.0		8450	Pave	NaN		Reg			Lvl			AllPub		...	0			NaN		NaN		NaN			0		2		2008	WD			Normal			208500
1	2	20			RL			80.0		9600	Pave	NaN		Reg			Lvl			AllPub		...	0			NaN		NaN		NaN			0		5		2007	WD			Normal			181500
2	3	60			RL			68.0		11250	Pave	NaN		IR1			Lvl			AllPub		...	0			NaN		NaN		NaN			0		9		2008	WD			Normal			223500
3	4	70			RL			60.0		9550	Pave	NaN		IR1			Lvl			AllPub		...	0			NaN		NaN		NaN			0		2		2006	WD			Abnorml			140000
4	5	60			RL			84.0		14260	Pave	NaN		IR1			Lvl			AllPub		...	0			NaN		NaN		NaN			0		12		2008	WD			Normal			250000
5 rows × 81 columns</pre></div>



<h3 class="wp-block-heading" id="h-step-2-explore-the-data">Step #2 Explore the Data</h3>



<p>Before jumping into preprocessing and model training, let&#8217;s quickly explore the data. A distribution plot can help us understand our dataset&#8217;s frequency of regression values.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Create histograms for feature columns separated by prediction label value
ax = sns.displot(data=df[['SalePrice']].dropna(), height=6, aspect=2)
plt.title('Sale Price Distribution')</pre></div>



<figure class="wp-block-image size-full"><img decoding="async" width="944" height="440" data-attachment-id="8424" data-permalink="https://www.relataly.com/using-random-search-to-tune-the-hyperparameters-of-a-random-decision-forest-with-python/6875/histplot/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/05/histplot.png" data-orig-size="944,440" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="histplot" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/05/histplot.png" src="https://www.relataly.com/wp-content/uploads/2022/05/histplot.png" alt="random search hyperparameter tuning python. random forest regression,
sale price distribution" class="wp-image-8424" srcset="https://www.relataly.com/wp-content/uploads/2022/05/histplot.png 944w, https://www.relataly.com/wp-content/uploads/2022/05/histplot.png 300w, https://www.relataly.com/wp-content/uploads/2022/05/histplot.png 768w" sizes="(max-width: 944px) 100vw, 944px" /></figure>



<p>For feature selection, it is helpful to understand the predictive power of the different variables in a dataset. We can use scatterplots to estimate the predictive power of specific features. Running the code below will create a scatterplot that visualizes the relation between the sale price, lot area, and the house&#8217;s overall quality.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Create histograms for feature columns separated by prediction label value
plt.figure(figsize=(16,6))
df_features = df[['SalePrice', 'LotArea', 'OverallQual']]
sns.scatterplot(data=df_features, x='LotArea', y='SalePrice', hue='OverallQual')
plt.title('Sale Price Distribution')</pre></div>



<figure class="wp-block-image size-full"><img decoding="async" width="966" height="387" data-attachment-id="8430" data-permalink="https://www.relataly.com/using-random-search-to-tune-the-hyperparameters-of-a-random-decision-forest-with-python/6875/overall-quality-vs-lotarea-depending-on-sale-price/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/05/Overall-Quality-vs-LotArea-depending-on-Sale-Price.png" data-orig-size="966,387" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="Overall-Quality-vs-LotArea-depending-on-Sale-Price" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/05/Overall-Quality-vs-LotArea-depending-on-Sale-Price.png" src="https://www.relataly.com/wp-content/uploads/2022/05/Overall-Quality-vs-LotArea-depending-on-Sale-Price.png" alt="random search hyperparameter tuning python. random forest regression,
scatter plot, feature selection" class="wp-image-8430" srcset="https://www.relataly.com/wp-content/uploads/2022/05/Overall-Quality-vs-LotArea-depending-on-Sale-Price.png 966w, https://www.relataly.com/wp-content/uploads/2022/05/Overall-Quality-vs-LotArea-depending-on-Sale-Price.png 300w, https://www.relataly.com/wp-content/uploads/2022/05/Overall-Quality-vs-LotArea-depending-on-Sale-Price.png 768w" sizes="(max-width: 966px) 100vw, 966px" /></figure>



<p>As expected, the scatterplot shows that the sale price increases with the overall quality. On the other hand, the LotArea has only a minor effect on the sale price. </p>



<h3 class="wp-block-heading">Step #3 Data Preprocessing</h3>



<p>Next, we prepare the data for use as input to train a regression model. Because we want to keep things simple, we reduce the number of variables and use only a small set of features. In addition, we encode categorical variables with integer dummy values.</p>



<p>To ensure that our regression model does not know the target variable, we separate house price (y) from features (x). Last, we split the data into separate datasets for training and testing. The result is four different data sets: x_train, y_train, x_test, and y_test.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}">def preprocessFeatures(df):   
    # Define a list of relevant features
    feature_list = ['SalePrice', 'OverallQual', 'Utilities', 'GarageArea', 'LotArea', 'OverallCond']
    df_dummy = pd.get_dummies(df[feature_list])
    # Cleanse records with na values
    #df_prep = df_prep.dropna()
    return df_dummy

df_base = preprocessFeatures(df)

# Split the data into x_train and y_train data sets
x_train, x_test, y_train, y_test = train_test_split( df_base.copy(), df_base['SalePrice'].copy(), train_size=0.7, random_state=0)
x_train</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">		OverallQual	GarageArea	LotArea	OverallCond	Utilities_AllPub	Utilities_NoSeWa
682		6			431			2887	5			1					0
960		5			0			7207	7			1					0
1384	6			280			9060	5			1					0
1100	2			246			8400	5			1					0
416		6			440			7844	7			1					0</pre></div>



<h3 class="wp-block-heading" id="h-step-4-train-different-regression-models-using-random-search">Step #4 Train Different Regression Models using Random Search</h3>



<p>Now that the dataset is ready, we can train the random decision forest regressor. To do this, we first define a dictionary with different parameter ranges. In addition, we need to define the number of model variants (n) that the algorithm should try. The random search algorithm then selects n random permutations from the grid and uses them to train the model. </p>



<p>We use the RandomSearchCV algorithm from the scikit-learn package. The &#8220;CV&#8221; in the function name stands for cross-validation. Cross-validation involves splitting the data into subsets (folds) and rotating them between training and validation runs. This way, each model is trained and tested multiple times on different data partitions. When the search algorithm finally evaluates the model configuration, it summarizes these results into a test score.</p>



<p>We use a Random Decision Forest &#8211; a robust machine learning algorithm that can handle classification and regression tasks. As a so-called ensemble model, the Random Forest considers predictions from a set of multiple independent estimators. The estimator is an important parameter to pass to the RandomSearchCV function. Random decision forests have several hyperparameters that we can use to influence their behavior. We define the following parameter ranges:</p>



<ul class="wp-block-list">
<li>max_leaf_nodes = [2, 3, 4, 5, 6, 7]</li>



<li>min_samples_split = [5, 10, 20, 50]</li>



<li>max_depth = [5,10,15,20]</li>



<li>max_features = [3,4,5]</li>



<li>n_estimators = [50, 100, 200]</li>
</ul>



<p>These parameter ranges define the search space from which the randomized search algorithm (RandomSearchCV) will select random configurations. Other parameters will use default values as defined by <a href="https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html?highlight=random%20forest#sklearn.ensemble.RandomForestClassifier" target="_blank" rel="noreferrer noopener">scikit-learn</a>.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Define the Estimator and the Parameter Ranges
dt = RandomForestRegressor()
number_of_iterations = 20
max_leaf_nodes = [2, 3, 4, 5, 6, 7]
min_samples_split = [5, 10, 20, 50]
max_depth = [5,10,15,20]
max_features = [3,4,5]
n_estimators = [50, 100, 200]

# Define the param distribution dictionary
param_distributions = dict(max_leaf_nodes=max_leaf_nodes, 
                           min_samples_split=min_samples_split, 
                           max_depth=max_depth,
                           max_features=max_features,
                           n_estimators=n_estimators)

# Build the gridsearch
grid = RandomizedSearchCV(estimator=dt, 
                          param_distributions=param_distributions, 
                          n_iter=number_of_iterations, 
                          cv = 5)

grid_results = grid.fit(x_train, y_train)

# Summarize the results in a readable format
print(&quot;Best params: {0}, using {1}&quot;.format(grid_results.cv_results_['mean_test_score'], grid_results.best_params_))
results_df = pd.DataFrame(grid_results.cv_results_)</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">Best params: [0.68738293 0.49581669 0.52138751 0.61235299 0.65360944 0.61165147
 0.70392285 0.52278886 0.67687248 0.68219638 0.70031536 0.65842909
 0.51939338 0.70801017 0.70911805 0.69543885 0.67983801 0.60744371
 0.68270285 0.70741042], using {'n_estimators': 100, 'min_samples_split': 5, 'max_leaf_nodes': 7, 'max_features': 3, 'max_depth': 15}
	mean_fit_time	std_fit_time	mean_score_time	std_score_time	param_n_estimators	param_min_samples_split	param_max_leaf_nodes	param_max_features	param_max_depth	params	split0_test_score	split1_test_score	split2_test_score	split3_test_score	split4_test_score	mean_test_score	std_test_score	rank_test_score
0	0.049196		0.002071		0.004074		0.000820		50					20						5	4	15	{'n_estimators': 50, 'min_samples_split': 20, ...	0.662973	0.705533	0.669520	0.702608	0.696280	0.687383	0.017637	7
1	0.041115		0.000554		0.003046		0.000094		50					50						2	3	10	{'n_estimators': 50, 'min_samples_split': 50, ...	0.490984	0.527231	0.426270	0.523086	0.511513	0.495817	0.036978	20
2	0.043325		0.000779		0.003486		0.000447		50					50						2	5	20	{'n_estimators': 50, 'min_samples_split': 50, ...	0.484524	0.559358	0.485459	0.517253	0.560343	0.521388	0.033545	18
3	0.162083		0.005665		0.012420		0.004788		200					5						3	3	20	{'n_estimators': 200, 'min_samples_split': 5, ...	0.586586	0.638341	0.573437	0.626793	0.636608	0.612353	0.027021	14
4	0.166659		0.003026		0.010958		0.000084		200					10						4	3	15	{'n_estimators': 200, 'min_samples_split': 10,...	0.633305	0.679161	0.623236	0.661864	0.670481	0.653609	0.021636	13</pre></div>



<p>These are the five best models and their respective hyperparameter configurations.</p>



<h3 class="wp-block-heading" id="h-step-5-select-the-best-model-and-measure-performance"><strong>Step #5 Select the best Model and Measure Performance</strong></h3>



<p>Finally, we will choose the best model from the list using the &#8220;best_model&#8221; function. We then calculate the MAE and the MAPE to understand how the model performs on the overall test dataset. We then print a comparison between actual sale prices and predicted sale prices.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Select the best Model and Measure Performance
best_model = grid_results.best_estimator_
y_pred = best_model.predict(x_test)
y_df = pd.DataFrame(y_test)
y_df['PredictedPrice']=y_pred
y_df.head()</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">	SalePrice	PredictedPrice
529	200624		166037.831002
491	133000		135860.757958
459	110000		123030.336177
279	192000		206488.444327
655	88000		130453.604206</pre></div>



<p>Next, let&#8217;s take a look at the classification errors.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Mean Absolute Error (MAE)
MAE = mean_absolute_error(y_pred, y_test)
print('Mean Absolute Error (MAE): ' + str(np.round(MAE, 2)))

# Mean Absolute Percentage Error (MAPE)
MAPE = mean_absolute_percentage_error(y_pred, y_test)
print('Median Absolute Percentage Error (MAPE): ' + str(np.round(MAPE*100, 2)) + ' %')</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">Mean Absolute Error (MAE): 29591.56 
Median Absolute Percentage Error (MAPE): 15.57 %</pre></div>



<p>On average, the model deviates from the actual value by 16 %. Considering we only used a fraction of the available features and defined a small search space, there is much room for improvement.</p>



<h2 class="wp-block-heading" id="h-summary">Summary</h2>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p>This article has shown how we can use grid Search in Python to efficiently search for the optimal hyperparameter configuration of a machine learning model. In the conceptual part, you learned about hyperparameters and how to use random search to try out all permutations of a predefined parameter grid. The second part was a Python hands-on tutorial, in which you learned to use random search to tune the hyperparameters of a regression model. We worked with a house price dataset and trained a random decision forest regressor that predicts the sale price for houses depending on several characteristics. Then we defined parameter ranges and tested random permutations. In this way, we quickly identified a configuration that outperforms our initial baseline model. </p>



<p>Remember that a random search efficiently identifies a good-performing model but does not necessarily return the best-performing one. Tech random search techniques can be used to tune the hyperparameters of both regression and classification models.</p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%"></div>
</div>



<p></p>



<h2 class="wp-block-heading">Sources and Further Reading</h2>



<p>I hope this article was helpful. If you have any questions or suggestions, please write them in the comments. </p>



<div style="display: inline-block;">
  <iframe sandbox="allow-popups allow-scripts allow-modals allow-forms allow-same-origin" style="width:120px;height:240px;" marginwidth="0" marginheight="0" scrolling="no" frameborder="0" src="//ws-eu.amazon-adsystem.com/widgets/q?ServiceVersion=20070822&amp;OneJS=1&amp;Operation=GetAdHtml&amp;MarketPlace=DE&amp;source=ss&amp;ref=as_ss_li_til&amp;ad_type=product_link&amp;tracking_id=flo7up-21&amp;language=de_DE&amp;marketplace=amazon&amp;region=DE&amp;placement=3030181162&amp;asins=3030181162&amp;linkId=669e46025028259138fbb5ccec12dfbe&amp;show_border=true&amp;link_opens_in_new_window=true"></iframe>
<iframe sandbox="allow-popups allow-scripts allow-modals allow-forms allow-same-origin" style="width:120px;height:240px;" marginwidth="0" marginheight="0" scrolling="no" frameborder="0" src="//ws-eu.amazon-adsystem.com/widgets/q?ServiceVersion=20070822&amp;OneJS=1&amp;Operation=GetAdHtml&amp;MarketPlace=DE&amp;source=ss&amp;ref=as_ss_li_til&amp;ad_type=product_link&amp;tracking_id=flo7up-21&amp;language=de_DE&amp;marketplace=amazon&amp;region=DE&amp;placement=1999579577&amp;asins=1999579577&amp;linkId=91d862698bf9010ff4c09539e4c49bf4&amp;show_border=true&amp;link_opens_in_new_window=true"></iframe>
<iframe sandbox="allow-popups allow-scripts allow-modals allow-forms allow-same-origin" style="width:120px;height:240px;" marginwidth="0" marginheight="0" scrolling="no" frameborder="0" src="//ws-eu.amazon-adsystem.com/widgets/q?ServiceVersion=20070822&amp;OneJS=1&amp;Operation=GetAdHtml&amp;MarketPlace=DE&amp;source=ss&amp;ref=as_ss_li_til&amp;ad_type=product_link&amp;tracking_id=flo7up-21&amp;language=de_DE&amp;marketplace=amazon&amp;region=DE&amp;placement=1839217715&amp;asins=1839217715&amp;linkId=356ba074068849ff54393f527190825d&amp;show_border=true&amp;link_opens_in_new_window=true"></iframe>
<iframe sandbox="allow-popups allow-scripts allow-modals allow-forms allow-same-origin" style="width:120px;height:240px;" marginwidth="0" marginheight="0" scrolling="no" frameborder="0" src="//ws-eu.amazon-adsystem.com/widgets/q?ServiceVersion=20070822&amp;OneJS=1&amp;Operation=GetAdHtml&amp;MarketPlace=DE&amp;source=ss&amp;ref=as_ss_li_til&amp;ad_type=product_link&amp;tracking_id=flo7up-21&amp;language=de_DE&amp;marketplace=amazon&amp;region=DE&amp;placement=1492032646&amp;asins=1492032646&amp;linkId=2214804dd039e7103577abd08722abac&amp;show_border=true&amp;link_opens_in_new_window=true"></iframe>
</div>



<p class="has-contrast-2-color has-base-3-background-color has-text-color has-background"><em>The links above to Amazon are affiliate links. By buying through these links, you support the Relataly.com blog and help to cover the hosting costs. Using the links does not affect the price.</em></p>
<p>The post <a href="https://www.relataly.com/using-random-search-to-tune-the-hyperparameters-of-a-random-decision-forest-with-python/6875/">Using Random Search to Tune the Hyperparameters of a Random Decision Forest with Python</a> appeared first on <a href="https://www.relataly.com">relataly.com</a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://www.relataly.com/using-random-search-to-tune-the-hyperparameters-of-a-random-decision-forest-with-python/6875/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">6875</post-id>	</item>
		<item>
		<title>Stock Market Forecasting Neural Networks for Multi-Output Regression in Python</title>
		<link>https://www.relataly.com/stock-price-prediction-multi-output-regression-using-neural-networks-in-python/5800/</link>
					<comments>https://www.relataly.com/stock-price-prediction-multi-output-regression-using-neural-networks-in-python/5800/#comments</comments>
		
		<dc:creator><![CDATA[Florian Follonier]]></dc:creator>
		<pubDate>Tue, 13 Jul 2021 21:10:23 +0000</pubDate>
				<category><![CDATA[Finance]]></category>
		<category><![CDATA[Keras]]></category>
		<category><![CDATA[Neural Networks]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[Recurrent Neural Networks]]></category>
		<category><![CDATA[Scikit-Learn]]></category>
		<category><![CDATA[Seaborn]]></category>
		<category><![CDATA[Stock Market Forecasting]]></category>
		<category><![CDATA[Time Series Forecasting]]></category>
		<category><![CDATA[Yahoo Finance API]]></category>
		<category><![CDATA[AI in Finance]]></category>
		<category><![CDATA[Deep Learning]]></category>
		<category><![CDATA[Intermediate Tutorials]]></category>
		<category><![CDATA[Multi-output Neural Network]]></category>
		<category><![CDATA[Multi-Step Time Series Forecasting]]></category>
		<category><![CDATA[Multivariate Models]]></category>
		<category><![CDATA[Stock Market Prediction]]></category>
		<category><![CDATA[Supervised Learning]]></category>
		<guid isPermaLink="false">https://www.relataly.com/?p=5800</guid>

					<description><![CDATA[<p>Multi-output time series regression can forecast several steps of a time series at once. The number of neurons in the final output layer determines how many steps the model can predict. Models with one output return single-step forecasts. Models with various outputs can return entire series of time steps and thus deliver a more detailed ... <a title="Stock Market Forecasting Neural Networks for Multi-Output Regression in Python" class="read-more" href="https://www.relataly.com/stock-price-prediction-multi-output-regression-using-neural-networks-in-python/5800/" aria-label="Read more about Stock Market Forecasting Neural Networks for Multi-Output Regression in Python">Read more</a></p>
<p>The post <a href="https://www.relataly.com/stock-price-prediction-multi-output-regression-using-neural-networks-in-python/5800/">Stock Market Forecasting Neural Networks for Multi-Output Regression in Python</a> appeared first on <a href="https://www.relataly.com">relataly.com</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p>Multi-output time series regression can forecast several steps of a time series at once. The number of neurons in the final output layer determines how many steps the model can predict. Models with one output return single-step forecasts. Models with various outputs can return entire series of time steps and thus deliver a more detailed projection of how a time series could develop in the future. This article is a hands-on Python tutorial that shows how to design a neural network architecture with multiple outputs. The goal is to create a multi-output model for stock-price forecasting using Python and Keras. By the end of this tutorial, you will have learned how to design a multi-output model for stock price forecasting using Python and Keras. This knowledge can be applied to other types of time series forecasting tasks, such as weather forecasting or sales forecasting.</p>



<p>This article proceeds as follows: We briefly discuss the architecture of a multi-output neural network. After familiarizing ourselves with the model architecture, we develop a Keras neural network for multi-output regression. For data preparation, we perform various steps, including cleaning, splitting, selecting, and scaling the data. Afterward, we define a model architecture with multiple LSTM layers and ten output neurons in the last layer. This architecture enables the model to generate projections for ten consecutive steps. After configuring the model architecture, we train the model with the historical daily prices of the Apple stock. Finally, we use this model to generate a ten-day forecast.</p>



<div class="wp-block-group"><div class="wp-block-group__inner-container is-layout-constrained wp-block-group-is-layout-constrained">
<div class="wp-block-kadence-infobox kt-info-box_317393-a1"><span class="kt-blocks-info-box-link-wrap info-box-link kt-blocks-info-box-media-align-top kt-info-halign-left"><div class="kt-infobox-textcontent"><h2 class="kt-blocks-info-box-title">Disclaimer</h2><p class="kt-blocks-info-box-text">This article does not constitute financial advice. Stock markets can be very volatile and are generally difficult to predict. Predictive models and other forms of analytics applied in this article only serve the purpose of illustrating machine learning use cases.</p></div></span></div>
</div></div>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%"></div>
</div>



<div style="height:5px" aria-hidden="true" class="wp-block-spacer"></div>



<h2 class="wp-block-heading" id="h-multi-output-regression-vs-single-output-regression">Multi-Output Regression vs. Single-Output Regression</h2>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p>In time series regression, we train a statistical model on the past values of a time series to make statements about how the time series develops further. During model training, we feed the model with so-called mini-batches and the corresponding target values. The model then creates forecasts for all input batches and compares these predictions to the actual target values to calculate the residuals (prediction errors). In this way, the model can adjust its parameters iteratively and learn to make better predictions.</p>



<p>Multivariate forecasting models take into account multiple input variables, such as historical time series data and additional features like moving averages or momentum indicators, to improve the accuracy of their predictions. The idea is that these various variables can help the model identify patterns in the data that suggest future price movements.</p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%"><div class="wp-block-image">
<figure class="alignright size-large is-resized"><img decoding="async" data-attachment-id="7569" data-permalink="https://www.relataly.com/stock-price-prediction-multi-output-regression-using-neural-networks-in-python/5800/multi-output-neural-networks-time-series-regression-architecture/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/04/multi-output-neural-networks-time-series-regression-architecture.png" data-orig-size="2017,1342" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="multi-output-neural-networks-time-series-regression-architecture" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/04/multi-output-neural-networks-time-series-regression-architecture.png" src="https://www.relataly.com/wp-content/uploads/2022/04/multi-output-neural-networks-time-series-regression-architecture-1024x681.png" alt="An exemplary architecture of a neural network with five input neurons (blue) and four output neurons (red), keras, python, tutorial, stock market prediction" class="wp-image-7569" width="371" height="247" srcset="https://www.relataly.com/wp-content/uploads/2022/04/multi-output-neural-networks-time-series-regression-architecture.png 1024w, https://www.relataly.com/wp-content/uploads/2022/04/multi-output-neural-networks-time-series-regression-architecture.png 300w, https://www.relataly.com/wp-content/uploads/2022/04/multi-output-neural-networks-time-series-regression-architecture.png 768w, https://www.relataly.com/wp-content/uploads/2022/04/multi-output-neural-networks-time-series-regression-architecture.png 1536w, https://www.relataly.com/wp-content/uploads/2022/04/multi-output-neural-networks-time-series-regression-architecture.png 2017w" sizes="(max-width: 371px) 100vw, 371px" /><figcaption class="wp-element-caption">An exemplary architecture of a neural network with five input neurons (blue) and four output neurons (red)</figcaption></figure>
</div></div>
</div>



<h2 class="wp-block-heading">The Architecture of a Neural Network with Multiple Outputs</h2>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p>Next, we will discuss the architecture of a neural network with multiple outputs. The architecture consists of several layers, including an input layer, several hidden layers, and an output layer. The number of neurons in the first layer must match the input data, and the number of neurons in the output layer determines the period length of the predictions. </p>



<p>Models with a single neuron in the output layer are used to predict a single time step. It is possible to predict multiple price steps with a single-output model. It requires a <a href="https://www.relataly.com/multi-step-time-series-forecasting-a-step-by-step-guide/275/" target="_blank" rel="noreferrer noopener">rolling forecasting approach</a> in which the outputs are iteratively reused to make further-reaching predictions. However, this way is somewhat cumbersome. A more elegant way is to train a multi-output model right away.</p>



<figure class="wp-block-image size-large is-resized is-style-default"><img decoding="async" data-attachment-id="7586" data-permalink="https://www.relataly.com/stock-price-prediction-multi-output-regression-using-neural-networks-in-python/5800/architecture-neural-network-multi-output-regression-model/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/04/architecture-neural-network-multi-output-regression-model.png" data-orig-size="3395,1503" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="architecture-neural-network-multi-output-regression-model" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/04/architecture-neural-network-multi-output-regression-model.png" src="https://www.relataly.com/wp-content/uploads/2022/04/architecture-neural-network-multi-output-regression-model-1024x453.png" alt="The inputs and outputs of a neural network for time series regression with five input neurons and four outputs. Stock market forecasting" class="wp-image-7586" width="755" height="334" srcset="https://www.relataly.com/wp-content/uploads/2022/04/architecture-neural-network-multi-output-regression-model.png 1024w, https://www.relataly.com/wp-content/uploads/2022/04/architecture-neural-network-multi-output-regression-model.png 300w, https://www.relataly.com/wp-content/uploads/2022/04/architecture-neural-network-multi-output-regression-model.png 768w, https://www.relataly.com/wp-content/uploads/2022/04/architecture-neural-network-multi-output-regression-model.png 1536w, https://www.relataly.com/wp-content/uploads/2022/04/architecture-neural-network-multi-output-regression-model.png 2048w, https://www.relataly.com/wp-content/uploads/2022/04/architecture-neural-network-multi-output-regression-model.png 2475w" sizes="(max-width: 755px) 100vw, 755px" /><figcaption class="wp-element-caption">The inputs and outputs of a neural network for time series regression with five input neurons and four outputs</figcaption></figure>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%"></div>
</div>



<p> </p>



<h2 class="wp-block-heading">Training Neural Networks with Multiple Outputs</h2>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p>A model with multiple neurons in the output layer can predict numerous steps once per batch. Multi-output regression models train on many sequences of subsequent values, followed by the consecutive output sequence. The model architecture thus contains multiple neurons in the initial layer and various neurons in the output layer (as illustrated). </p>



<p>In a multi-output regression model, each neuron in the output layer is responsible for predicting a different time step in the future. To train such a model, you need to provide a sequence of input data followed by the corresponding sequence of output data. For example, if you want to predict the stock price for the next ten days, you would provide a sequence of input data containing the historical stock prices for the past 50 days, followed by a sequence of output data containing the stock prices for the next 10 days.</p>



<p>The model will then learn to map the input sequence to the output sequence so that it can make predictions for multiple time steps in the future based on the input data. </p>



<p>In the next part of this tutorial, we will walk through the process of developing a multi-output regression model in more detail.</p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%"></div>
</div>



<h2 class="wp-block-heading has-contrast-color has-text-color" id="h-implementing-a-neural-network-model-for-multi-output-multi-step-regression-in-python">Implementing a Neural Network Model for Multi-Output Multi-Step Regression in Python</h2>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p>Let&#8217;s get started with the hands-on Python part. In the following, we will develop a neural network with Keras and Tensorflow that forecasts the Apple stock price. To prepare the data for a neural network with multiple outputs in time series forecasting, we will spend the most time preparing it and bringing it into the right shape. Broadly this involves the following steps:</p>



<ol class="wp-block-list">
<li>Load the time series data that we want to use as input and output for your model. We use historical price data that is available via the yahoo finance API.</li>



<li>Then we split our data into training and testing sets. We will use the training set to fit the model and the testing set to evaluate the model&#8217;s performance.</li>



<li>Preprocess the data: This includes scaling the data and selecting relevant features.</li>



<li>Reshape the data and bring them into a format that can be input into the neural network. This involves converting the data into a 3D array for time series data.</li>



<li>Finally, we will train our model and generate the forecasting.</li>
</ol>



<p>The code is available on the GitHub repository.</p>



<div class="wp-block-kadence-advancedbtn kb-buttons-wrap kb-btns_e5fd46-d1"><a class="kb-button kt-button button kb-btn_25b4a6-dd kt-btn-size-standard kt-btn-width-type-full kb-btn-global-inherit  kt-btn-has-text-true kt-btn-has-svg-true  wp-block-button__link wp-block-kadence-singlebtn" href="https://github.com/flo7up/relataly-public-python-tutorials/blob/master/01%20Time%20Series%20Forecasting%20%26%20Regression/006%20Multi-Output%20Regression.ipynb" target="_blank" rel="noreferrer noopener"><span class="kb-svg-icon-wrap kb-svg-icon-fe_eye kt-btn-icon-side-left"><svg viewBox="0 0 24 24"  fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"  aria-hidden="true"><path d="M1 12s4-8 11-8 11 8 11 8-4 8-11 8-11-8-11-8z"/><circle cx="12" cy="12" r="3"/></svg></span><span class="kt-btn-inner-text">View on GitHub </span></a>

<a class="kb-button kt-button button kb-btn_3ee95e-9c kt-btn-size-standard kt-btn-width-type-full kb-btn-global-inherit  kt-btn-has-text-true kt-btn-has-svg-true  wp-block-button__link wp-block-kadence-singlebtn" href="https://github.com/flo7up/relataly-public-python-API-tutorials" target="_blank" rel="noreferrer noopener"><span class="kb-svg-icon-wrap kb-svg-icon-fa_github kt-btn-icon-side-left"><svg viewBox="0 0 496 512"  fill="currentColor" xmlns="http://www.w3.org/2000/svg"  aria-hidden="true"><path d="M165.9 397.4c0 2-2.3 3.6-5.2 3.6-3.3.3-5.6-1.3-5.6-3.6 0-2 2.3-3.6 5.2-3.6 3-.3 5.6 1.3 5.6 3.6zm-31.1-4.5c-.7 2 1.3 4.3 4.3 4.9 2.6 1 5.6 0 6.2-2s-1.3-4.3-4.3-5.2c-2.6-.7-5.5.3-6.2 2.3zm44.2-1.7c-2.9.7-4.9 2.6-4.6 4.9.3 2 2.9 3.3 5.9 2.6 2.9-.7 4.9-2.6 4.6-4.6-.3-1.9-3-3.2-5.9-2.9zM244.8 8C106.1 8 0 113.3 0 252c0 110.9 69.8 205.8 169.5 239.2 12.8 2.3 17.3-5.6 17.3-12.1 0-6.2-.3-40.4-.3-61.4 0 0-70 15-84.7-29.8 0 0-11.4-29.1-27.8-36.6 0 0-22.9-15.7 1.6-15.4 0 0 24.9 2 38.6 25.8 21.9 38.6 58.6 27.5 72.9 20.9 2.3-16 8.8-27.1 16-33.7-55.9-6.2-112.3-14.3-112.3-110.5 0-27.5 7.6-41.3 23.6-58.9-2.6-6.5-11.1-33.3 2.6-67.9 20.9-6.5 69 27 69 27 20-5.6 41.5-8.5 62.8-8.5s42.8 2.9 62.8 8.5c0 0 48.1-33.6 69-27 13.7 34.7 5.2 61.4 2.6 67.9 16 17.7 25.8 31.5 25.8 58.9 0 96.5-58.9 104.2-114.8 110.5 9.2 7.9 17 22.9 17 46.4 0 33.7-.3 75.4-.3 83.6 0 6.5 4.6 14.4 17.3 12.1C428.2 457.8 496 362.9 496 252 496 113.3 383.5 8 244.8 8zM97.2 352.9c-1.3 1-1 3.3.7 5.2 1.6 1.6 3.9 2.3 5.2 1 1.3-1 1-3.3-.7-5.2-1.6-1.6-3.9-2.3-5.2-1zm-10.8-8.1c-.7 1.3.3 2.9 2.3 3.9 1.6 1 3.6.7 4.3-.7.7-1.3-.3-2.9-2.3-3.9-2-.6-3.6-.3-4.3.7zm32.4 35.6c-1.6 1.3-1 4.3 1.3 6.2 2.3 2.3 5.2 2.6 6.5 1 1.3-1.3.7-4.3-1.3-6.2-2.2-2.3-5.2-2.6-6.5-1zm-11.4-14.7c-1.6 1-1.6 3.6 0 5.9 1.6 2.3 4.3 3.3 5.6 2.3 1.6-1.3 1.6-3.9 0-6.2-1.4-2.3-4-3.3-5.6-2z"/></svg></span><span class="kt-btn-inner-text">Relataly GitHub Repo </span></a></div>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%">
<figure class="wp-block-image size-large"><img decoding="async" width="512" height="337" data-attachment-id="12797" data-permalink="https://www.relataly.com/stock-price-prediction-multi-output-regression-using-neural-networks-in-python/5800/multiple_waterfalls_coming_from_a_single_waterfall/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2023/03/multiple_waterfalls_coming_from_a_single_waterfall.png" data-orig-size="768,505" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="multiple_waterfalls_coming_from_a_single_waterfall" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2023/03/multiple_waterfalls_coming_from_a_single_waterfall.png" src="https://www.relataly.com/wp-content/uploads/2023/03/multiple_waterfalls_coming_from_a_single_waterfall-512x337.png" alt="Neural network architectures with multiple outputs allow for more potent solutions but are more complex to train. Image created with Midjourney. Stock market forecasting, multi-output multi-step  regression, python" class="wp-image-12797" srcset="https://www.relataly.com/wp-content/uploads/2023/03/multiple_waterfalls_coming_from_a_single_waterfall.png 512w, https://www.relataly.com/wp-content/uploads/2023/03/multiple_waterfalls_coming_from_a_single_waterfall.png 300w, https://www.relataly.com/wp-content/uploads/2023/03/multiple_waterfalls_coming_from_a_single_waterfall.png 768w" sizes="(max-width: 512px) 100vw, 512px" /><figcaption class="wp-element-caption">Neural network architectures with multiple outputs allow for more potent solutions but are more complex to train. Image created with <a href="http://www.midjourney.com" target="_blank" rel="noreferrer noopener">Midjourney</a>.</figcaption></figure>



<p></p>
</div>
</div>



<h3 class="wp-block-heading" id="h-prerequisites">Prerequisites</h3>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p>Before beginning the coding part, ensure that you have set up your Python 3 environment and required packages. If you don&#8217;t have a Python environment, consider <a href="https://www.anaconda.com/products/individual" target="_blank" rel="noreferrer noopener">Anaconda</a>. To set it up, you can follow the steps in&nbsp;<a href="https://www.relataly.com/category/data-science/setup-anaconda-environment/" target="_blank" rel="noreferrer noopener">this tutorial</a>.</p>



<p>Also, make sure you install all required packages. In this tutorial, we will be working with the following standard packages:&nbsp;</p>



<ul class="wp-block-list">
<li><em><a href="https://pandas.pydata.org/" target="_blank" rel="noreferrer noopener">pandas</a></em></li>



<li><em><a href="https://numpy.org/" target="_blank" rel="noreferrer noopener">NumPy</a></em></li>



<li><em><a href="https://matplotlib.org/" target="_blank" rel="noreferrer noopener">matplotlib</a></em></li>
</ul>



<p>In addition, we will be using the machine learning libraries Keras, Scikit-learn, and Tensorflow. For visualization, we will be using the Seaborn package.</p>



<p>Please also have either the <a href="https://pandas-datareader.readthedocs.io/en/latest/" target="_blank" rel="noreferrer noopener">pandas_datareader</a> or the <a href="https://pypi.org/project/yfinance/" target="_blank" rel="noreferrer noopener">yfinance</a> package installed. You will use one of these packages to retrieve the historical stock quotes.</p>



<p>You can install these packages using console commands:</p>



<ul class="wp-block-list">
<li><em>pip install &lt;package name&gt;</em></li>



<li><em>conda install &lt;package name&gt;</em>&nbsp;(if you are using the anaconda packet manager)</li>
</ul>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%"></div>
</div>



<h3 class="wp-block-heading" id="h-step-1-load-the-data">Step #1: Load the Data</h3>



<p>The Pandas DataReader library is our first choice for interacting with the yahoo finance API. If the library causes a problem (it sometimes does), you can also use the yfinance package, which should return the same data. We begin by loading historical price quotes of the Apple stock from the public yahoo finance API. Running the code below will load the data into a Pandas DataFrame.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># import pandas_datareader as webreader # Remote data access for pandas
import math # Mathematical functions 
import numpy as np # Fundamental package for scientific computing with Python
import pandas as pd # Additional functions for analysing and manipulating data
from datetime import date, timedelta, datetime # Date Functions
from pandas.plotting import register_matplotlib_converters # This function adds plotting functions for calender dates
import matplotlib.pyplot as plt # Important package for visualization - we use this to plot the market data
import matplotlib.dates as mdates # Formatting dates
from sklearn.metrics import mean_absolute_error, mean_squared_error # Packages for measuring model performance / errors
from keras.models import Sequential # Deep learning library, used for neural networks
from keras.layers import LSTM, Dense, Dropout # Deep learning classes for recurrent and regular densely-connected layers
from keras.callbacks import EarlyStopping # EarlyStopping during model training
from sklearn.preprocessing import RobustScaler, MinMaxScaler # This Scaler removes the median and scales the data according to the quantile range to normalize the price data 
import seaborn as sns

# from pandas_datareader.nasdaq_trader import get_nasdaq_symbols
# symbols = get_nasdaq_symbols()

# Setting the timeframe for the data extraction
today = date.today()
date_today = today.strftime(&quot;%Y-%m-%d&quot;)
date_start = '2010-01-01'

# Getting NASDAQ quotes
stockname = 'Apple'
symbol = 'AAPL'
# df = webreader.DataReader(
#     symbol, start=date_start, end=date_today, data_source=&quot;yahoo&quot;
# )

import yfinance as yf #Alternative package if webreader does not work: pip install yfinance
df = yf.download(symbol, start=date_start, end=date_today)

# # Create a quick overview of the dataset
df.head()</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">Tensorflow Version: 2.6.0
Num GPUs: 1
[*********************100%***********************]  1 of 1 completed
			Open		High		Low			Close		Adj Close	Volume
Date						
2010-01-04	7.622500	7.660714	7.585000	7.643214	6.515213	493729600
2010-01-05	7.664286	7.699643	7.616071	7.656429	6.526477	601904800
2010-01-06	7.656429	7.686786	7.526786	7.534643	6.422666	552160000
2010-01-07	7.562500	7.571429	7.466071	7.520714	6.410791	477131200
2010-01-08	7.510714	7.571429	7.466429	7.570714	6.453413	447610800</pre></div>



<p>The data should comprise the following columns:</p>



<ul class="wp-block-list">
<li>Close</li>



<li>Open</li>



<li>High</li>



<li>Low</li>



<li>Adj Close</li>



<li>Volume</li>
</ul>



<p>The target variable that we are trying to predict is the Closing price (Close).</p>



<h3 class="wp-block-heading">Step #2: Explore the Data</h3>



<p>Once we have loaded the data, we print a quick overview of the time-series data using different line graphs. The following code will plot a line chart for each column in df_plot using the <code>seaborn</code> library. The charts will be organized in a grid with nrows number of rows and ncols number of columns. The sharex parameter is set to True, which means that the x-axes of the subplots will be shared. The figsize parameter determines the size of the plot in inches.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Plot line charts
df_plot = df.copy()

ncols = 2
nrows = int(round(df_plot.shape[1] / ncols, 0))

fig, ax = plt.subplots(nrows=nrows, ncols=ncols, sharex=True, figsize=(14, 7))
for i, ax in enumerate(fig.axes):
        sns.lineplot(data = df_plot.iloc[:, i], ax=ax)
        ax.tick_params(axis=&quot;x&quot;, rotation=30, labelsize=10, length=0)
        ax.xaxis.set_major_locator(mdates.AutoDateLocator())
fig.tight_layout()
plt.show()</pre></div>



<figure class="wp-block-image size-large is-resized"><img decoding="async" data-attachment-id="5805" data-permalink="https://www.relataly.com/stock-price-prediction-multi-output-regression-using-neural-networks-in-python/5800/image-5-14/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2021/07/image-5.png" data-orig-size="999,496" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-5" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2021/07/image-5.png" src="https://www.relataly.com/wp-content/uploads/2021/07/image-5.png" alt="The Apple Stock's historical price data, including quotes, highs, lows, and volume" class="wp-image-5805" width="817" height="406" srcset="https://www.relataly.com/wp-content/uploads/2021/07/image-5.png 999w, https://www.relataly.com/wp-content/uploads/2021/07/image-5.png 300w, https://www.relataly.com/wp-content/uploads/2021/07/image-5.png 768w" sizes="(max-width: 817px) 100vw, 817px" /></figure>



<p>The line plots look as expected and reflect the Apple stock price history. Because we are fetching daily data from an API, please note that the lineplots will look different depending on when you run the code. </p>



<h3 class="wp-block-heading" id="h-step-3-preprocess-the-data">Step #3: Preprocess the Data</h3>



<p>Next, we prepare the data for the training process of our multi-output forecasting model. Preparing the data for multivariate forecasting involves several steps: </p>



<ul class="wp-block-list">
<li>Selecting features for model training</li>



<li>Scaling and splitting the data into separate sets for training and testing</li>



<li>Slicing the time series into several shifted training batches</li>
</ul>



<p>Remember that the steps are specific to our data and the use case. The steps required to prepare the data for a neural network with multiple outputs in time series forecasting will depend on the characteristics of your data and the requirements of your model. It is essential to consider these factors and tailor your data preparation accordingly and carefully.</p>



<h4 class="wp-block-heading">3.1 Basic Preparations</h4>



<p>We begin by creating a copy of the initial data and resetting the index.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Indexing Batches
df_train = df.sort_values(by=['Date']).copy()

# We safe a copy of the dates index, before we need to reset it to numbers
date_index = df_train.index

# We reset the index, so we can convert the date-index to a number-index
df_train = df_train.reset_index(drop=True).copy()
df_train.head(5)</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">			Open		High		Low			Close		Adj Close	Volume
Date						
2022-11-29	144.289993	144.809998	140.350006	141.169998	141.169998	83763800
2022-11-30	141.399994	148.720001	140.550003	148.029999	148.029999	111224400
2022-12-01	148.210007	149.130005	146.610001	148.309998	148.309998	71250400
2022-12-02	145.960007	148.000000	145.649994	147.809998	147.809998	65421400
2022-12-05	147.770004	150.919998	145.770004	146.630005	146.630005	68732400</pre></div>



<h4 class="wp-block-heading" id="h-3-2-feature-selection-and-scaling">3.2 Feature Selection and Scaling</h4>



<p>We proceed with feature selection. To keep things simple, we will use the features from the input data without any modifications. After selecting the features, we scale them to a range between 0 and 1. To ease unscaling the predictions after training, we create two different scalers: One for the training data, which takes five columns, and one for the output data that scales a single column (the Close Price). I have covered <a href="https://www.relataly.com/feature-engineering-for-multivariate-time-series-models-with-python/1813/" target="_blank" rel="noreferrer noopener">feature engineering in a separate article</a> if you want to learn more about this topic.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}">def prepare_data(df):

    # List of considered Features
    FEATURES = ['Open', 'High', 'Low', 'Close', 'Volume']

    print('FEATURE LIST')
    print([f for f in FEATURES])

    # Create the dataset with features and filter the data to the list of FEATURES
    df_filter = df[FEATURES]
    
    # Convert the data to numpy values
    np_filter_unscaled = np.array(df_filter)
    #np_filter_unscaled = np.reshape(np_unscaled, (df_filter.shape[0], -1))
    print(np_filter_unscaled.shape)

    np_c_unscaled = np.array(df['Close']).reshape(-1, 1)
    
    return np_filter_unscaled, np_c_unscaled, df_filter
    
np_filter_unscaled, np_c_unscaled, df_filter = prepare_data(df_train)
                                          
# Creating a separate scaler that works on a single column for scaling predictions
# Scale each feature to a range between 0 and 1
scaler_train = MinMaxScaler()
np_scaled = scaler_train.fit_transform(np_filter_unscaled)
    
# Create a separate scaler for a single column
scaler_pred = MinMaxScaler()
np_scaled_c = scaler_pred.fit_transform(np_c_unscaled)   </pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">FEATURE LIST
['Open', 'High', 'Low', 'Close', 'Volume']
(3254, 5)</pre></div>



<p>The final step of the data preparation is to create the structure for the input data. This structure needs to match the input layer of the model architecture.</p>



<h4 class="wp-block-heading">3.3 Slicing the Data for a Model with Multiple In- and Outputs</h4>



<p>The code below starts a sliding window process that cuts the initial time series data into multiple slices, i.e., mini-batches. Each batch is a smaller fraction of the initial time series shifted by a single step. Because we will feed our model with multivariate input data, the time series consists of five input columns/features. Each batch comprises a period of 50 steps from the time series and an output sequence of ten consecutive values. To validate that the batches have the right shape, we visualize mini-batches in a line graph with their consecutive target values. </p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Set the input_sequence_length length - this is the timeframe used to make a single prediction
input_sequence_length = 50
# The output sequence length is the number of steps that the neural network predicts
output_sequence_length = 10 #

# Prediction Index
index_Close = df_train.columns.get_loc(&quot;Close&quot;)

# Split the training data into train and train data sets
# As a first step, we get the number of rows to train the model on 80% of the data 
train_data_length = math.ceil(np_scaled.shape[0] * 0.8)

# Create the training and test data
train_data = np_scaled[:train_data_length, :]
test_data = np_scaled[train_data_length - input_sequence_length:, :]

# The RNN needs data with the format of [samples, time steps, features]
# Here, we create N samples, input_sequence_length time steps per sample, and f features
def partition_dataset(input_sequence_length, output_sequence_length, data):
    x, y = [], []
    data_len = data.shape[0]
    for i in range(input_sequence_length, data_len - output_sequence_length):
        x.append(data[i-input_sequence_length:i,:]) #contains input_sequence_length values 0-input_sequence_length * columns
        y.append(data[i:i + output_sequence_length, index_Close]) #contains the prediction values for validation (3rd column = Close),  for single-step prediction
    
    # Convert the x and y to numpy arrays
    x = np.array(x)
    y = np.array(y)
    return x, y

# Generate training data and test data
x_train, y_train = partition_dataset(input_sequence_length, output_sequence_length, train_data)
x_test, y_test = partition_dataset(input_sequence_length, output_sequence_length, test_data)

# Print the shapes: the result is: (rows, training_sequence, features) (prediction value, )
print(x_train.shape, y_train.shape)
print(x_test.shape, y_test.shape)

# Validate that the prediction value and the input match up
# The last close price of the second input sample should equal the first prediction value
nrows = 3 # number of shifted plots
fig, ax = plt.subplots(nrows=nrows, ncols=1, figsize=(16, 8))
for i, ax in enumerate(fig.axes):
    xtrain = pd.DataFrame(x_train[i][:,index_Close], columns={f'x_train_{i}'})
    ytrain = pd.DataFrame(y_train[i][:output_sequence_length-1], columns={f'y_train_{i}'})
    ytrain.index = np.arange(input_sequence_length, input_sequence_length + output_sequence_length-1)
    xtrain_ = pd.concat([xtrain, ytrain[:1].rename(columns={ytrain.columns[0]:xtrain.columns[0]})])
    df_merge = pd.concat([xtrain_, ytrain])
    sns.lineplot(data = df_merge, ax=ax)
plt.show</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">(2544, 50, 5) (2544, 10)
(640, 50, 5) (640, 10)
&lt;function matplotlib.pyplot.show(close=None, block=None)&gt;</pre></div>



<figure class="wp-block-image size-full is-resized"><img decoding="async" data-attachment-id="8670" data-permalink="https://www.relataly.com/stock-price-prediction-multi-output-regression-using-neural-networks-in-python/5800/batch-test-visualizations-deep-neural-networks-for-multi-output-stock-market-forecasting/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/05/batch-test-visualizations-deep-neural-networks-for-multi-output-stock-market-forecasting.png" data-orig-size="939,465" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="batch-test-visualizations-deep-neural-networks-for-multi-output-stock-market-forecasting" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/05/batch-test-visualizations-deep-neural-networks-for-multi-output-stock-market-forecasting.png" src="https://www.relataly.com/wp-content/uploads/2022/05/batch-test-visualizations-deep-neural-networks-for-multi-output-stock-market-forecasting.png" alt="batch test visualizations, deep neural networks for multi-output stock market forecasting" class="wp-image-8670" width="891" height="441" srcset="https://www.relataly.com/wp-content/uploads/2022/05/batch-test-visualizations-deep-neural-networks-for-multi-output-stock-market-forecasting.png 939w, https://www.relataly.com/wp-content/uploads/2022/05/batch-test-visualizations-deep-neural-networks-for-multi-output-stock-market-forecasting.png 300w, https://www.relataly.com/wp-content/uploads/2022/05/batch-test-visualizations-deep-neural-networks-for-multi-output-stock-market-forecasting.png 768w" sizes="(max-width: 891px) 100vw, 891px" /></figure>



<h3 class="wp-block-heading" id="h-step-4-prepare-the-neural-network-architecture-and-train-the-multi-output-regression-model">Step #4: Prepare the Neural Network Architecture and Train the Multi-Output Regression Model</h3>



<p>Now that we have the training data prepared and ready, the next step is to configure the architecture of the multi-out neural network. Because we will be using multiple input series, our model is, in fact, a multivariate architecture so that it corresponds to the input training batches. </p>



<h4 class="wp-block-heading" id="h-4-1-configuring-and-training-the-model">4.1 Configuring and Training the Model</h4>



<p>We choose a comparably simple architecture with only two LSTM layers and two additional dense layers. The first dense layer has 20 neurons, and the second layer is the output layer, which has ten output neurons. If you wonder how I got to the number of neurons in the third layer, I conducted several experiments and found that this number leads to solid results. </p>



<p>To ensure that the architecture matches our input data&#8217;s structure, we reuse the variables for the previous code section (n_input_neurons, n_output_neurons. The input sequence length is 50, and the output sequence (the steps for the period we want to predict) is ten.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Configure the neural network model
model = Sequential()
n_output_neurons = output_sequence_length

# Model with n_neurons = inputshape Timestamps, each with x_train.shape[2] variables
n_input_neurons = x_train.shape[1] * x_train.shape[2]
print(n_input_neurons, x_train.shape[1], x_train.shape[2])
model.add(LSTM(n_input_neurons, return_sequences=True, input_shape=(x_train.shape[1], x_train.shape[2]))) 
model.add(LSTM(n_input_neurons, return_sequences=False))
model.add(Dense(20))
model.add(Dense(n_output_neurons))

# Compile the model
model.compile(optimizer='adam', loss='mse')</pre></div>



<p>After configuring the model architecture, we can initiate the training process and illustrate how the loss develops over the training epochs. </p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Training the model
epochs = 10
batch_size = 16
early_stop = EarlyStopping(monitor='loss', patience=5, verbose=1)
history = model.fit(x_train, y_train, 
                    batch_size=batch_size, 
                    epochs=epochs,
                    validation_data=(x_test, y_test)
                   )
                    
                    #callbacks=[early_stop])</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">Epoch 1/5
159/159 [==============================] - 7s 14ms/step - loss: 0.0047 - val_loss: 0.0262
Epoch 2/5
159/159 [==============================] - 2s 11ms/step - loss: 3.6759e-04 - val_loss: 0.0097
Epoch 3/5
159/159 [==============================] - 2s 11ms/step - loss: 1.5222e-04 - val_loss: 0.0056
Epoch 4/5
159/159 [==============================] - 2s 11ms/step - loss: 1.0327e-04 - val_loss: 0.0031
Epoch 5/5
159/159 [==============================] - 2s 11ms/step - loss: 1.1690e-04 - val_loss: 0.0026</pre></div>



<h4 class="wp-block-heading" id="h-4-2-loss-curve">4.2 Loss Curve</h4>



<p>Next, we plot the loss curve, which represents the amount of error between the model&#8217;s predicted values and the actual values in the training data. A lower loss value indicates that the model makes more accurate predictions on the training data. </p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Plot training &amp; validation loss values
fig, ax = plt.subplots(figsize=(10, 5), sharex=True)
plt.plot(history.history[&quot;loss&quot;])
plt.title(&quot;Model loss&quot;)
plt.ylabel(&quot;Loss&quot;)
plt.xlabel(&quot;Epoch&quot;)
ax.xaxis.set_major_locator(plt.MaxNLocator(epochs))
plt.legend([&quot;Train&quot;, &quot;Test&quot;], loc=&quot;upper left&quot;)
plt.grid()
plt.show()</pre></div>



<figure class="wp-block-image size-full is-resized is-style-default"><img decoding="async" data-attachment-id="5856" data-permalink="https://www.relataly.com/stock-price-prediction-multi-output-regression-using-neural-networks-in-python/5800/image-4-16/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2021/08/image-4.png" data-orig-size="628,333" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-4" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2021/08/image-4.png" src="https://www.relataly.com/wp-content/uploads/2021/08/image-4.png" alt="loss curve after training the multi-output neural network, keras, multi-step time series regression" class="wp-image-5856" width="720" height="381" srcset="https://www.relataly.com/wp-content/uploads/2021/08/image-4.png 628w, https://www.relataly.com/wp-content/uploads/2021/08/image-4.png 300w" sizes="(max-width: 720px) 100vw, 720px" /></figure>



<p>As we can see, the loss curve drops quickly during training, which typically means that the model is quickly learning to make accurate predictions.</p>



<h3 class="wp-block-heading" id="h-step-5-evaluate-model-performance">Step #5 Evaluate Model Performance</h3>



<p>Now that we have trained the model, we can make forecasts on the test data and use traditional regression metrics such as the MAE, MAPE, or MDAPE to measure the performance of our model. </p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Get the predicted values
y_pred_scaled = model.predict(x_test)

# Unscale the predicted values
y_pred = scaler_pred.inverse_transform(y_pred_scaled)
y_test_unscaled = scaler_pred.inverse_transform(y_test).reshape(-1, output_sequence_length)

# Mean Absolute Error (MAE)
MAE = mean_absolute_error(y_test_unscaled, y_pred)
print(f'Median Absolute Error (MAE): {np.round(MAE, 2)}')

# Mean Absolute Percentage Error (MAPE)
MAPE = np.mean((np.abs(np.subtract(y_test_unscaled, y_pred)/ y_test_unscaled))) * 100
print(f'Mean Absolute Percentage Error (MAPE): {np.round(MAPE, 2)} %')

# Median Absolute Percentage Error (MDAPE)
MDAPE = np.median((np.abs(np.subtract(y_test_unscaled, y_pred)/ y_test_unscaled)) ) * 100
print(f'Median Absolute Percentage Error (MDAPE): {np.round(MDAPE, 2)} %')


def prepare_df(i, x, y, y_pred_unscaled):
    # Undo the scaling on x, reshape the testset into a one-dimensional array, so that it fits to the pred scaler
    x_test_unscaled_df = pd.DataFrame(scaler_pred.inverse_transform((x[i]))[:,index_Close]).rename(columns={0:'x_test'})
    
    y_test_unscaled_df = []
    # Undo the scaling on y
    if type(y) == np.ndarray:
        y_test_unscaled_df = pd.DataFrame(scaler_pred.inverse_transform(y)[i]).rename(columns={0:'y_test'})

    # Create a dataframe for the y_pred at position i, y_pred is already unscaled
    y_pred_df = pd.DataFrame(y_pred_unscaled[i]).rename(columns={0:'y_pred'})
    return x_test_unscaled_df, y_pred_df, y_test_unscaled_df


def plot_multi_test_forecast(x_test_unscaled_df, y_test_unscaled_df, y_pred_df, title): 
    # Package y_pred_unscaled and y_test_unscaled into a dataframe with columns pred and true   
    if type(y_test_unscaled_df) == pd.core.frame.DataFrame:
        df_merge = y_pred_df.join(y_test_unscaled_df, how='left')
    else:
        df_merge = y_pred_df.copy()
    
    # Merge the dataframes 
    df_merge_ = pd.concat([x_test_unscaled_df, df_merge]).reset_index(drop=True)
    
    # Plot the linecharts
    fig, ax = plt.subplots(figsize=(20, 8))
    plt.title(title, fontsize=12)
    ax.set(ylabel = stockname + &quot;_stock_price_quotes&quot;)
    sns.lineplot(data = df_merge_, linewidth=2.0, ax=ax)

# Creates a linechart for a specific test batch_number and corresponding test predictions
batch_number = 50
x_test_unscaled_df, y_pred_df, y_test_unscaled_df = prepare_df(i, x_test, y_test, y_pred)
title = f&quot;Predictions vs y_test - test batch number {batch_number}&quot;
plot_multi_test_forecast(x_test_unscaled_df, y_test_unscaled_df, y_pred_df, title) </pre></div>



<figure class="wp-block-image size-large is-style-default"><img decoding="async" width="1024" height="419" data-attachment-id="8667" data-permalink="https://www.relataly.com/stock-price-prediction-multi-output-regression-using-neural-networks-in-python/5800/multioutput-regression-deep-neural-networks-test-predictions/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/05/multioutput-regression-deep-neural-networks-test-predictions.png" data-orig-size="1170,479" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="multioutput-regression-deep-neural-networks-test-predictions" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/05/multioutput-regression-deep-neural-networks-test-predictions.png" src="https://www.relataly.com/wp-content/uploads/2022/05/multioutput-regression-deep-neural-networks-test-predictions-1024x419.png" alt="A line chart showing predictions and test data, multioutput regression deep neural networks test predictions" class="wp-image-8667" srcset="https://www.relataly.com/wp-content/uploads/2022/05/multioutput-regression-deep-neural-networks-test-predictions.png 1024w, https://www.relataly.com/wp-content/uploads/2022/05/multioutput-regression-deep-neural-networks-test-predictions.png 300w, https://www.relataly.com/wp-content/uploads/2022/05/multioutput-regression-deep-neural-networks-test-predictions.png 768w, https://www.relataly.com/wp-content/uploads/2022/05/multioutput-regression-deep-neural-networks-test-predictions.png 1170w" sizes="(max-width: 1024px) 100vw, 1024px" /></figure>



<p>The quality of the predictions is acceptable, considering that this tutorial aimed not to achieve excellent predictions but to demonstrate the process and architecture of training a multi-output regression. So, there is certainly room for improvement. Feel free to experiment with different features or try other hyperparameters and neural network layers.</p>



<h3 class="wp-block-heading" id="h-step-6-create-a-new-forecast">Step #6 Create a New Forecast</h3>



<p>Finally, let&#8217;s create a forecast on a new dataset. We take the scaled dataset from section 2 (np_scaled) and extract a series with the latest 50 values. The data is reshaped into a 3D array with shape (1, 50, 5) to match the expected input shape of the model. We use these values to generate a new prediction for the next ten days using the predict method. We store the result in the y_pred_scaled variable. In addition, we need to transform the predictions back to the original scale. We do this by using the inverse_transform method of the scaler_pred object, which was fit on the training data. Finally, we visualize the multi-step forecast in another line chart. </p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Get the latest input batch from the test dataset, which is contains the price values for the last ten trading days
x_test_latest_batch = np_scaled[-50:,:].reshape(1,50,5)

# Predict on the batch
y_pred_scaled = model.predict(x_test_latest_batch)
y_pred_unscaled = scaler_pred.inverse_transform(y_pred_scaled)

# Prepare the data and plot the input data and the predictions
x_test_unscaled_df, y_test_unscaled_df, _ = prepare_df(0, x_test_latest_batch, '', y_pred_unscaled)
plot_multi_test_forecast(x_test_unscaled_df, '', y_test_unscaled_df, &quot;x_new Vs. y_new_pred&quot;)</pre></div>



<figure class="wp-block-image size-large is-style-default"><img decoding="async" width="1024" height="420" data-attachment-id="8666" data-permalink="https://www.relataly.com/stock-price-prediction-multi-output-regression-using-neural-networks-in-python/5800/multioutput-regression-deep-neural-networks/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/05/multioutput-regression-deep-neural-networks.png" data-orig-size="1167,479" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="multioutput-regression-deep-neural-networks" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/05/multioutput-regression-deep-neural-networks.png" src="https://www.relataly.com/wp-content/uploads/2022/05/multioutput-regression-deep-neural-networks-1024x420.png" alt="multioutput regression deep neural networks - x_text and y_pred" class="wp-image-8666" srcset="https://www.relataly.com/wp-content/uploads/2022/05/multioutput-regression-deep-neural-networks.png 1024w, https://www.relataly.com/wp-content/uploads/2022/05/multioutput-regression-deep-neural-networks.png 300w, https://www.relataly.com/wp-content/uploads/2022/05/multioutput-regression-deep-neural-networks.png 768w, https://www.relataly.com/wp-content/uploads/2022/05/multioutput-regression-deep-neural-networks.png 1167w" sizes="(max-width: 1024px) 100vw, 1024px" /></figure>



<p></p>



<h2 class="wp-block-heading" id="h-summary">Summary</h2>



<p>In this tutorial, we demonstrated how to use multiple output neural networks to make predictions at different time steps. We first discussed the architecture of a recurrent neural network and how it can be used to process sequential data. We then showed how to properly preprocess the data and split it into training and test sets for training a multi-output regression model.</p>



<p>Next, we trained a model to predict the stock price of Apple ten steps into the future using historical data. We also discussed how to use the trained model to make multi-step predictions on new data and how to visualize the results.</p>



<p>To further improve the performance of the model, you can experiment with different hyperparameters and adjust the model architecture. For example, adding more neurons to the output layers will increase the prediction horizon, but remember that prediction error will also increase as the horizon lengthens. You can also try using different activation functions or adding more layers to the model to see how it affects the performance.</p>



<p>I hope this article was helpful in understanding multi-output neural networks better. If you have any questions or comments, please let me know.</p>



<h2 class="wp-block-heading" id="h-sources-and-further-reading">Sources and Further Reading</h2>



<ol class="wp-block-list">
<li><a href="https://amzn.to/3MyU6Tj" target="_blank" rel="noreferrer noopener">Charu C. Aggarwal (2018) Neural Networks and Deep Learning</a></li>



<li><a href="https://amzn.to/3yIQdWi" target="_blank" rel="noreferrer noopener">Jansen (2020) Machine Learning for Algorithmic Trading: Predictive models to extract signals from market and alternative data for systematic trading strategies with Python</a></li>



<li><a href="https://amzn.to/3S9Nfkl" target="_blank" rel="noreferrer noopener">Aurélien Géron (2019) Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems </a></li>



<li><a href="https://amzn.to/3EKidwE" target="_blank" rel="noreferrer noopener">David Forsyth (2019) Applied Machine Learning Springer</a></li>



<li><a href="https://amzn.to/3MAy8j5" target="_blank" rel="noreferrer noopener">Andriy Burkov (2020) Machine Learning Engineering</a></li>
</ol>



<p class="has-contrast-2-color has-base-3-background-color has-text-color has-background"><em>The links above to Amazon are affiliate links. By buying through these links, you support the Relataly.com blog and help to cover the hosting costs. Using the links does not affect the price.</em></p>



<p>If you want to learn about an alternative approach to univariate stock market forecasting, consider taking a look <a href="https://www.relataly.com/time-series-forecasting-using-facebook-prophet-in-python/10351/" target="_blank" rel="noreferrer noopener">at Facebook Prophet</a> or <a href="https://www.relataly.com/forecasting-beer-sales-with-arima-in-python/2884/" target="_blank" rel="noreferrer noopener">ARIMA models</a></p>
<p>The post <a href="https://www.relataly.com/stock-price-prediction-multi-output-regression-using-neural-networks-in-python/5800/">Stock Market Forecasting Neural Networks for Multi-Output Regression in Python</a> appeared first on <a href="https://www.relataly.com">relataly.com</a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://www.relataly.com/stock-price-prediction-multi-output-regression-using-neural-networks-in-python/5800/feed/</wfw:commentRss>
			<slash:comments>31</slash:comments>
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">5800</post-id>	</item>
	</channel>
</rss>
