<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Gradient Boosting Archives - relataly.com</title>
	<atom:link href="https://www.relataly.com/category/machine-learning-algorithms/gradient-boosting/feed/" rel="self" type="application/rss+xml" />
	<link>https://www.relataly.com/category/machine-learning-algorithms/gradient-boosting/</link>
	<description>The Business AI Blog</description>
	<lastBuildDate>Thu, 13 Jul 2023 12:10:12 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	<generator>https://wordpress.org/?v=7.0</generator>

<image>
	<url>https://www.relataly.com/wp-content/uploads/2023/04/cropped-AI-cat-Icon-White.png</url>
	<title>Gradient Boosting Archives - relataly.com</title>
	<link>https://www.relataly.com/category/machine-learning-algorithms/gradient-boosting/</link>
	<width>32</width>
	<height>32</height>
</image> 
<site xmlns="com-wordpress:feed-additions:1">175977316</site>	<item>
		<title>Predictive Maintenance: Predicting Machine Failure using Sensor Data with XGBoost and Python</title>
		<link>https://www.relataly.com/predictive-maintenance-predicting-machine-failure-with-python/10618/</link>
					<comments>https://www.relataly.com/predictive-maintenance-predicting-machine-failure-with-python/10618/#comments</comments>
		
		<dc:creator><![CDATA[Florian Follonier]]></dc:creator>
		<pubDate>Sun, 08 Jan 2023 20:34:44 +0000</pubDate>
				<category><![CDATA[Algorithms]]></category>
		<category><![CDATA[Classification (multi-class)]]></category>
		<category><![CDATA[Cross-Validation]]></category>
		<category><![CDATA[Data Visualization]]></category>
		<category><![CDATA[Exploratory Data Analysis (EDA)]]></category>
		<category><![CDATA[Gradient Boosting]]></category>
		<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[Manufacturing]]></category>
		<category><![CDATA[Plotly]]></category>
		<category><![CDATA[Predictive Maintenance]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[Scikit-Learn]]></category>
		<category><![CDATA[Seaborn]]></category>
		<category><![CDATA[Yahoo Finance API]]></category>
		<category><![CDATA[AI in Manufacturing]]></category>
		<category><![CDATA[Classic Machine Learning]]></category>
		<category><![CDATA[Intermediate Tutorials]]></category>
		<category><![CDATA[Multivariate Models]]></category>
		<guid isPermaLink="false">https://www.relataly.com/?p=10618</guid>

					<description><![CDATA[<p>Predictive maintenance is a game-changer for the modern industry. Still, it is based on a simple idea: By using machine learning algorithms, businesses can predict equipment failures before they happen. This approach can help businesses improve their operations by reducing the need for reactive, unplanned maintenance and by enabling them to schedule maintenance activities during ... <a title="Predictive Maintenance: Predicting Machine Failure using Sensor Data with XGBoost and Python" class="read-more" href="https://www.relataly.com/predictive-maintenance-predicting-machine-failure-with-python/10618/" aria-label="Read more about Predictive Maintenance: Predicting Machine Failure using Sensor Data with XGBoost and Python">Read more</a></p>
<p>The post <a href="https://www.relataly.com/predictive-maintenance-predicting-machine-failure-with-python/10618/">Predictive Maintenance: Predicting Machine Failure using Sensor Data with XGBoost and Python</a> appeared first on <a href="https://www.relataly.com">relataly.com</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow">
<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">Predictive maintenance is a game-changer for the modern industry. Still, it is based on a simple idea: By using machine learning algorithms, businesses can predict equipment failures before they happen. This approach can help businesses improve their operations by reducing the need for reactive, unplanned maintenance and by enabling them to schedule maintenance activities during planned downtime. In this article, we&#8217;ll explore the use of machine learning algorithms to predict machine failures using the robust XGBoost algorithm in Python. By the end of this tutorial, you&#8217;ll have the knowledge and skills to start implementing predictive maintenance in your organization. So, let&#8217;s get started!</p>



<p class="wp-block-paragraph">We begin by discussing the concept of predictive maintenance and show different ways to implement it. Then we will turn to the coding part in python and implement the prediction model based on machine sensor data. We train a classification model that predicts different types of machine failure using XGBoost.</p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%">
<figure class="wp-block-image size-full"><img fetchpriority="high" decoding="async" width="509" height="467" data-attachment-id="12909" data-permalink="https://www.relataly.com/robot-factory-machine-learning-predictive-maintenance-min/" data-orig-file="https://www.relataly.com/wp-content/uploads/2023/03/robot-factory-machine-learning-predictive-maintenance-min.png" data-orig-size="509,467" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="robot factory machine learning predictive maintenance-min" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2023/03/robot-factory-machine-learning-predictive-maintenance-min.png" src="https://www.relataly.com/wp-content/uploads/2023/03/robot-factory-machine-learning-predictive-maintenance-min.png" alt="Predictive maintenance is a game-changer for the modern industry. Image generated with Midjourney." class="wp-image-12909" srcset="https://www.relataly.com/wp-content/uploads/2023/03/robot-factory-machine-learning-predictive-maintenance-min.png 509w, https://www.relataly.com/wp-content/uploads/2023/03/robot-factory-machine-learning-predictive-maintenance-min.png 300w" sizes="(max-width: 509px) 100vw, 509px" /><figcaption class="wp-element-caption">Predictive maintenance is a game-changer for the modern industry. Image generated with <a href="http://www.midjourney.com" target="_blank" rel="noreferrer noopener">Midjourney</a>.</figcaption></figure>
</div>
</div>



<h2 class="wp-block-heading">What is Predictive Maintenance?</h2>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">Predictive maintenance is a data-driven approach that uses predictive modeling to assess the state of equipment and determine the optimal timing for maintenance activities. This technique is particularly beneficial in industries that heavily rely on equipment for their operations, such as manufacturing, transportation, energy, and healthcare. Depending on the requirements and challenges of an organization, predictive maintenance may contribute to one or several of the following goals:</p>



<ul class="wp-block-list">
<li><strong>Improve equipment reliability</strong>: By proactively identifying and addressing potential problems with equipment, predictive maintenance can help improve the reliability of the equipment, reducing the risk of unexpected downtime or failure.</li>



<li><strong>Increase efficiency</strong>: Predictive maintenance can help improve the efficiency of equipment by identifying and fixing problems before they cause equipment failure or downtime. This can help reduce maintenance costs and increase productivity.</li>



<li><strong>Improve safety:</strong> Predictive maintenance can help improve safety by identifying and addressing potential problems with equipment before they occur. This can help prevent accidents and injuries caused by equipment failure.</li>



<li><strong>Reduce maintenance costs</strong>: By proactively identifying and fixing potential problems with equipment, predictive maintenance can help reduce the overall cost of maintenance by minimizing the need for unscheduled downtime.</li>



<li><strong>Improve asset management</strong>: Predictive maintenance can help improve asset management by providing data and insights into the condition and performance of equipment. This can help organizations decide when to replace or upgrade equipment.</li>
</ul>



<p class="wp-block-paragraph">Next, we look at the different ways organizations can implement predictive maintenance.</p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%">
<figure class="wp-block-image size-full"><img decoding="async" width="511" height="510" data-attachment-id="12380" data-permalink="https://www.relataly.com/predictive-maintenance-predicting-machine-failure-with-python/10618/monitoring-predictive-maintenance-safety-manufacturing-min/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2023/02/monitoring-predictive-maintenance-safety-manufacturing-min.png" data-orig-size="511,510" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="monitoring-predictive-maintenance-safety-manufacturing-min" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2023/02/monitoring-predictive-maintenance-safety-manufacturing-min.png" src="https://www.relataly.com/wp-content/uploads/2023/02/monitoring-predictive-maintenance-safety-manufacturing-min.png" alt="" class="wp-image-12380" srcset="https://www.relataly.com/wp-content/uploads/2023/02/monitoring-predictive-maintenance-safety-manufacturing-min.png 511w, https://www.relataly.com/wp-content/uploads/2023/02/monitoring-predictive-maintenance-safety-manufacturing-min.png 300w, https://www.relataly.com/wp-content/uploads/2023/02/monitoring-predictive-maintenance-safety-manufacturing-min.png 140w" sizes="(max-width: 511px) 100vw, 511px" /><figcaption class="wp-element-caption">Utilities and manufacturing are only two of the many industries that use predictive maintenance. Image generated with <a href="http://www.midjourney.com" target="_blank" rel="noreferrer noopener">Midjourney</a>.</figcaption></figure>



<p class="wp-block-paragraph"></p>
</div>
</div>
</div>
</div>



<h2 class="wp-block-heading">Approaches to Predictive Maintenance</h2>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">There are several approaches to implementing a predictive maintenance solution, depending on the type of equipment being monitored and the resources available. These approaches include:</p>



<ul class="wp-block-list">
<li><strong>Condition-based monitoring:</strong> This involves continuously monitoring the condition of the equipment using sensors. When certain thresholds or conditions are met, an alert is triggered, or corrective measures are launched. The goal is to reduce the risk of failure. For example, if the temperature of a motor exceeds a certain level, this may indicate that the motor is about to fail.</li>



<li><strong>Predictive modeling:</strong> This approach involves using machine learning algorithms to analyze historical lifetime data about the equipment to identify patterns that may indicate an impending failure. This can be done using data from sensors, as well as operational data and maintenance records. When historical or failure data is not available, a degradation model can be created to estimate failure times based on a threshold value. This approach is often used when there is limited data available.</li>



<li><strong>Prognostic algorithms: </strong>By using data from sensors and other sources, prognostic algorithms can predict the remaining useful life of a piece of equipment. This information can help organizations determine the likelihood of a breakdown and plan for replacements or maintenance activities. By understanding the equipment better, organizations can potentially extend maintenance cycles, which can reduce costs for replacements and maintenance.</li>
</ul>



<p class="wp-block-paragraph">It is important to choose an approach that is appropriate for the specific equipment and maintenance challenges faced by the organization. </p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%"></div>
</div>



<h2 class="wp-block-heading">Data Requirements</h2>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">When implementing predictive maintenance, it is important to consider that each approach comes with its own set of data requirements. Types of data include the following:</p>



<ul class="wp-block-list">
<li><strong>Current condition data</strong> includes information about the state of the equipment, such as its temperature, pressure, vibration, and other physical parameters.</li>



<li><strong>Operating data </strong>includes information about how the equipment is being used, such as its load, speed, and other operating parameters.</li>



<li><strong>Maintenance history data</strong> includes information about past maintenance activities that have been performed on the equipment.</li>



<li><strong>Failure history data</strong> includes information about past equipment failures, such as the date of the failure, the cause of the failure, and the impact on operations.</li>
</ul>



<p class="wp-block-paragraph">Collecting these data requires investing in sensors and other data collection infrastructure and ensuring that data collection is accurate and storage is proper. By combining various data types, organizations can create a comprehensive view of equipment condition and performance and use it to predict maintenance requirements.</p>



<p class="wp-block-paragraph">The specific types of data needed will depend on the implementation approach. Organizations must ensure they have access to the necessary data to implement the selected approach effectively. Some specific data requirements for each approach include the following:</p>



<figure class="wp-block-table"><table><thead><tr><th>Approach</th><th>Data Requirements</th></tr></thead><tbody><tr><td>Condition-based monitoring</td><td>Sensor data from the equipment being monitored. </td></tr><tr><td>Predictive modeling</td><td>A combination of sensor data, operational data, and maintenance records. </td></tr><tr><td>Prognostic algorithms</td><td>Sensor data, as well as data about past failures and maintenance events. </td></tr></tbody></table><figcaption class="wp-element-caption">Data requirements per implementation approach</figcaption></figure>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%">
<figure class="wp-block-image size-full"><img decoding="async" width="1018" height="856" data-attachment-id="12379" data-permalink="https://www.relataly.com/predictive-maintenance-predicting-machine-failure-with-python/10618/jejimga_a_factory_using_technology_for_safety_efficiency_qualit_5daef8a5-5ab0-49d2-9821-4588049635a2-min/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2023/02/jejimga_a_factory_using_technology_for_safety_efficiency_qualit_5daef8a5-5ab0-49d2-9821-4588049635a2-min.png" data-orig-size="1018,856" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="jejimga_a_factory_using_technology_for_safety_efficiency_qualit_5daef8a5-5ab0-49d2-9821-4588049635a2-min" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2023/02/jejimga_a_factory_using_technology_for_safety_efficiency_qualit_5daef8a5-5ab0-49d2-9821-4588049635a2-min.png" src="https://www.relataly.com/wp-content/uploads/2023/02/jejimga_a_factory_using_technology_for_safety_efficiency_qualit_5daef8a5-5ab0-49d2-9821-4588049635a2-min.png" alt="Predictive maintenance - Machine learning can make maintenance cycles more cost-efficient. Image generated using Midjourney" class="wp-image-12379" srcset="https://www.relataly.com/wp-content/uploads/2023/02/jejimga_a_factory_using_technology_for_safety_efficiency_qualit_5daef8a5-5ab0-49d2-9821-4588049635a2-min.png 1018w, https://www.relataly.com/wp-content/uploads/2023/02/jejimga_a_factory_using_technology_for_safety_efficiency_qualit_5daef8a5-5ab0-49d2-9821-4588049635a2-min.png 300w, https://www.relataly.com/wp-content/uploads/2023/02/jejimga_a_factory_using_technology_for_safety_efficiency_qualit_5daef8a5-5ab0-49d2-9821-4588049635a2-min.png 768w" sizes="(max-width: 1018px) 100vw, 1018px" /><figcaption class="wp-element-caption">Predictive maintenance &#8211; Machine learning can make maintenance cycles more cost-efficient. Image generated using&nbsp;<a href="http://www.Midjourney.com" target="_blank" rel="noreferrer noopener">Midjourney</a></figcaption></figure>
</div>
</div>



<h2 class="wp-block-heading">Predicting Failures in Milling Machines using XGBoost in Python</h2>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">Now that we have a basic understanding of predictive maintenance, it&#8217;s time to get hands-on with Python. We will use sensor data and machine learning to predict failures in milling machines. But why do these machines break down in the first place? Milling machines have many moving parts that can suffer from wear and tear over time, leading to failures. Additionally, improper maintenance can cause issues with machine operation and lead to costly damage. Efficient maintenance can be challenging due to the varying loads that milling machines are subjected to. However, by implementing a predictive maintenance solution with Python, we can proactively identify and address issues to prevent costly downtime and ensure the smooth operation of our milling machines. Our goal is to predict one of five failure types, which corresponds to a predictive modeling approach. Let&#8217;s get started on building our predictive maintenance solution.</p>



<p class="wp-block-paragraph">The code is available on the GitHub repository.</p>



<div class="wp-block-kadence-advancedbtn kb-buttons-wrap kb-btns_c6038a-a9"><a class="kb-button kt-button button kb-btn_614436-7b kt-btn-size-standard kt-btn-width-type-full kb-btn-global-inherit kt-btn-has-text-true kt-btn-has-svg-true wp-block-button__link wp-block-kadence-singlebtn" href="https://github.com/flo7up/relataly-public-python-tutorials/blob/master/02%20Classification/022%20Predicting%20Machine%20Malfunction%20of%20Milling%20Machines%20in%20Python.ipynb" target="_blank" rel="noreferrer noopener"><span class="kb-svg-icon-wrap kb-svg-icon-fe_eye kt-btn-icon-side-left"><svg viewBox="0 0 24 24"  fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"  aria-hidden="true"><path d="M1 12s4-8 11-8 11 8 11 8-4 8-11 8-11-8-11-8z"/><circle cx="12" cy="12" r="3"/></svg></span><span class="kt-btn-inner-text">View on GitHub </span></a>

<a class="kb-button kt-button button kb-btn_a119d2-89 kt-btn-size-standard kt-btn-width-type-full kb-btn-global-inherit kt-btn-has-text-true kt-btn-has-svg-true wp-block-button__link wp-block-kadence-singlebtn" href="https://github.com/flo7up/relataly-public-python-API-tutorials" target="_blank" rel="noreferrer noopener"><span class="kb-svg-icon-wrap kb-svg-icon-fa_github kt-btn-icon-side-left"><svg viewBox="0 0 496 512"  fill="currentColor" xmlns="http://www.w3.org/2000/svg"  aria-hidden="true"><path d="M165.9 397.4c0 2-2.3 3.6-5.2 3.6-3.3.3-5.6-1.3-5.6-3.6 0-2 2.3-3.6 5.2-3.6 3-.3 5.6 1.3 5.6 3.6zm-31.1-4.5c-.7 2 1.3 4.3 4.3 4.9 2.6 1 5.6 0 6.2-2s-1.3-4.3-4.3-5.2c-2.6-.7-5.5.3-6.2 2.3zm44.2-1.7c-2.9.7-4.9 2.6-4.6 4.9.3 2 2.9 3.3 5.9 2.6 2.9-.7 4.9-2.6 4.6-4.6-.3-1.9-3-3.2-5.9-2.9zM244.8 8C106.1 8 0 113.3 0 252c0 110.9 69.8 205.8 169.5 239.2 12.8 2.3 17.3-5.6 17.3-12.1 0-6.2-.3-40.4-.3-61.4 0 0-70 15-84.7-29.8 0 0-11.4-29.1-27.8-36.6 0 0-22.9-15.7 1.6-15.4 0 0 24.9 2 38.6 25.8 21.9 38.6 58.6 27.5 72.9 20.9 2.3-16 8.8-27.1 16-33.7-55.9-6.2-112.3-14.3-112.3-110.5 0-27.5 7.6-41.3 23.6-58.9-2.6-6.5-11.1-33.3 2.6-67.9 20.9-6.5 69 27 69 27 20-5.6 41.5-8.5 62.8-8.5s42.8 2.9 62.8 8.5c0 0 48.1-33.6 69-27 13.7 34.7 5.2 61.4 2.6 67.9 16 17.7 25.8 31.5 25.8 58.9 0 96.5-58.9 104.2-114.8 110.5 9.2 7.9 17 22.9 17 46.4 0 33.7-.3 75.4-.3 83.6 0 6.5 4.6 14.4 17.3 12.1C428.2 457.8 496 362.9 496 252 496 113.3 383.5 8 244.8 8zM97.2 352.9c-1.3 1-1 3.3.7 5.2 1.6 1.6 3.9 2.3 5.2 1 1.3-1 1-3.3-.7-5.2-1.6-1.6-3.9-2.3-5.2-1zm-10.8-8.1c-.7 1.3.3 2.9 2.3 3.9 1.6 1 3.6.7 4.3-.7.7-1.3-.3-2.9-2.3-3.9-2-.6-3.6-.3-4.3.7zm32.4 35.6c-1.6 1.3-1 4.3 1.3 6.2 2.3 2.3 5.2 2.6 6.5 1 1.3-1.3.7-4.3-1.3-6.2-2.2-2.3-5.2-2.6-6.5-1zm-11.4-14.7c-1.6 1-1.6 3.6 0 5.9 1.6 2.3 4.3 3.3 5.6 2.3 1.6-1.3 1.6-3.9 0-6.2-1.4-2.3-4-3.3-5.6-2z"/></svg></span><span class="kt-btn-inner-text">Relataly GitHub Repo </span></a></div>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%">
<figure class="wp-block-image size-full is-resized"><img decoding="async" data-attachment-id="12384" data-permalink="https://www.relataly.com/predictive-maintenance-predicting-machine-failure-with-python/10618/cnc_milling_machine_cyberpunk/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2023/02/cnc_milling_machine_cyberpunk.png" data-orig-size="253,253" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="cnc_milling_machine_cyberpunk" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2023/02/cnc_milling_machine_cyberpunk.png" src="https://www.relataly.com/wp-content/uploads/2023/02/cnc_milling_machine_cyberpunk.png" alt="Image of a CNC milling machine. Image created with Midjourney" class="wp-image-12384" width="375" height="375" srcset="https://www.relataly.com/wp-content/uploads/2023/02/cnc_milling_machine_cyberpunk.png 253w, https://www.relataly.com/wp-content/uploads/2023/02/cnc_milling_machine_cyberpunk.png 140w" sizes="(max-width: 375px) 100vw, 375px" /><figcaption class="wp-element-caption">Image of a CNC milling machine. Image created with <a href="http://www.midjourney.com" target="_blank" rel="noreferrer noopener">Midjourney</a></figcaption></figure>
</div>
</div>



<h3 class="wp-block-heading">Prerequisites</h3>



<p class="wp-block-paragraph">Before starting the coding part, make sure that you have set up your <a href="https://www.python.org/downloads/" target="_blank" rel="noreferrer noopener">Python 3</a> environment and required packages. </p>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow">
<p class="wp-block-paragraph"><strong>Python Environment</strong></p>



<p class="wp-block-paragraph">Before diving into the FairLearn Python tutorial, it is important to take the necessary steps to ensure that your Python environment is properly set up and that you have all the required packages installed. This will ensure a seamless learning experience and prevent any potential roadblocks or issues that may arise due to an improperly configured environment.</p>



<p class="wp-block-paragraph">If you don&#8217;t have an environment, follow&nbsp;<a href="https://www.relataly.com/anaconda-python-environment-machine-learning/1663/" target="_blank" rel="noreferrer noopener">this tutorial</a>&nbsp;to set up the&nbsp;<a href="https://www.anaconda.com/products/individual" target="_blank" rel="noreferrer noopener">Anaconda environment</a>.</p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow">
<p class="wp-block-paragraph"><strong>Python Packages</strong></p>



<p class="wp-block-paragraph">Make sure you install all required packages. In this tutorial, we will be working with the following packages:&nbsp;</p>



<ul class="wp-block-list">
<li>Pandas</li>



<li>NumPy</li>



<li>Matplotlib</li>



<li>Seaborn</li>



<li>Plotly</li>
</ul>



<p class="wp-block-paragraph">In addition, we will be using the machine learning library <strong><em>Scikit-learn</em></strong> and the XGBoost library, which is a popular library for training gradient-boosting models.</p>



<p class="wp-block-paragraph">You can install packages using console commands:</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">pip install &lt;package name&gt;
conda install &lt;package name&gt; (if you are using the anaconda packet manager)</pre></div>
</div>
</div>



<h3 class="wp-block-heading">About the Sensor Dataset</h3>



<p class="wp-block-paragraph">In this tutorial, we will work with a synthetic sensor dataset from the <a href="https://archive.ics.uci.edu/ml/datasets/AI4I+2020+Predictive+Maintenance+Dataset" target="_blank" rel="noreferrer noopener">UCL ML archives</a> that simulates the typical life cycle of a milling machine. The dataset contains the following fields:</p>



<p class="wp-block-paragraph">The dataset consists of 10 000 data points stored as rows with 14 features in columns:</p>



<ul class="wp-block-list">
<li>UID: unique identifier ranging from 1 to 10000</li>



<li>productID: consisting of a letter L, M, or H for low (50% of all products), medium (30%), and high (20%) as product quality variants and a variant-specific serial number</li>



<li>air temperature [K]</li>



<li>process temperature [K]</li>



<li>rotational speed [rpm]</li>



<li>torque [Nm]</li>



<li>tool wear [min]</li>



<li>machine failure. A label that indicates whether the machine has failed or not</li>



<li>Failure type (prediction label). The label contains five failure types: tool wear failure (TWF), heat dissipation failure (HDF), power failure (PWF), overstrain failure (OSF), random failures (RNF)</li>
</ul>



<p class="wp-block-paragraph">Source: <a href="https://archive.ics.uci.edu/ml/datasets/AI4I+2020+Predictive+Maintenance+Dataset" target="_blank" rel="noreferrer noopener">UCL ML Repository</a></p>



<p class="wp-block-paragraph">You can download the dataset from <a href="https://www.kaggle.com/code/potongpasir/predicting-machine-malfunction/data" target="_blank" rel="noreferrer noopener">Kaggle.com</a>. Unzip the file predictive_maintenance.csv and save it under the following file path: &#8220;/data/iot/classification/&#8221;</p>



<h3 class="wp-block-heading">Step #1 Load the Data</h3>



<p class="wp-block-paragraph">We begin by importing the required libraries. This also includes the XGBoost library, which is a popular library for training gradient-boosting models. In addition, we will load the dataset using the pandas library. Then we define our target variable as Failure Type. The dataset contains a second target column, which only contains the binary information of machine failures. We will drop this column, as our goal is to predict the specific type of failure. Then we print the first three rows of the loaded dataset. </p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># A tutorial for this file is available at www.relataly.com
# Tested with Python 3.9.13, Matplotlib 3.6.2, Scikit-learn 1.2, Seaborn 0.12.1, numpy 1.21.5, xgboost 1.7.2

import pandas as pd 
import matplotlib.pyplot as plt 
import numpy as np
import seaborn as sns
import plotly.express as px
sns.set_style('white', { 'axes.spines.right': False, 'axes.spines.top': False})
from sklearn.metrics import classification_report, confusion_matrix, precision_recall_fscore_support as score, roc_curve
from sklearn.model_selection import cross_val_score, train_test_split, cross_validate
from sklearn.utils import compute_sample_weight
from xgboost import XGBClassifier

# load the train data
path = '/data/iot/classification/'
df = pd.read_csv(path + &quot;predictive_maintenance.csv&quot;) 

# define the target
target_name='Failure Type'

# drop a redundant columns
df.drop(columns=['Target'], inplace=True)

# print a summary of the train data
print(df.shape[0])
df.head(3)</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">	UDI	Product ID	Type	Air temperature [K]	Process temperature [K]	Rotational speed [rpm]	Torque [Nm]	Tool wear [min]	Failure Type
0	1	M14860		M		298.1				308.6				1551						42.8		0				No Failure
1	2	L47181		L		298.2				308.7				1408						46.3		3				No Failure
2	3	L47182		L		298.1				308.5				1498						49.4		5				No Failure</pre></div>



<h3 class="wp-block-heading">Step #2 Clean the Data</h3>



<p class="wp-block-paragraph">Next, we quickly check the data quality of our dataset. The following code block checks if there are any missing values in our dataset. If there are missing values, it creates a barplot showing the number of missing values for each column, along with the percentage of missing values. If there are no missing values, it prints a message saying &#8220;no missing values.&#8221;</p>



<p class="wp-block-paragraph">The function then drops any columns with more than 5% missing values from the DataFrame. Finally, it prints the names of the remaining columns in the DataFrame. This function can be used to identify and handle missing values in a dataset before applying machine learning algorithms to it.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># check for missing values
def print_missing_values(df):
    null_df = pd.DataFrame(df.isna().sum(), columns=['null_values']).sort_values(['null_values'], ascending=False)
    fig = plt.subplots(figsize=(16, 6))
    ax = sns.barplot(data=null_df, x='null_values', y=null_df.index, color='royalblue')
    pct_values = [' {:g}'.format(elm) + ' ({:.1%})'.format(elm/len(df)) for elm in list(null_df['null_values'])]
    ax.set_title('Overview of missing values')
    ax.bar_label(container=ax.containers[0], labels=pct_values, size=12)

if df.isna().sum().sum() &gt; 0:
    print_missing_values(df)
else:
    print('no missing values')

# drop all columns with more than 5% missing values
for col_name in df.columns:
    if df[col_name].isna().sum()/df.shape[0] &gt; 0.05:
        df.drop(columns=[col_name], inplace=True) 

df.columns</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">no missing values
Index(['UDI', 'Product ID', 'Type', 'Air temperature [K]',
       'Process temperature [K]', 'Rotational speed [rpm]', 'Torque [Nm]',
       'Tool wear [min]', 'Failure Type'],
      dtype='object')</pre></div>



<p class="wp-block-paragraph">Next, we will drop two unnecessary columns and rename the remaining ones to make them easier to work with. The original column names are quite long and contain special characters that could cause errors during the training process. Once the columns are renamed, we will print the updated DataFrame to verify the changes.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># drop id columns
df_base = df.drop(columns=['Product ID', 'UDI'])

# adjust column names
df_base.rename(columns={'Air temperature [K]': 'air_temperature', 
                        'Process temperature [K]': 'process_temperature', 
                        'Rotational speed [rpm]':'rotational_speed', 
                        'Torque [Nm]': 'torque', 
                        'Tool wear [min]': 'tool_wear'}, inplace=True)
df_base.head()</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">	Type	air_temperature	process_temperature	rotational_speed	torque	tool_wear	Failure Type
0	M		298.1			308.6				1551				42.8	0			No Failure
1	L		298.2			308.7				1408				46.3	3			No Failure
2	L		298.1			308.5				1498				49.4	5			No Failure
3	L		298.2			308.6				1433				39.5	7			No Failure
4	L		298.2			308.7				1408				40.0	9			No Failure</pre></div>



<p class="wp-block-paragraph">Everything looks as expected: Our dataset contains six features and the target column with the five failure types.</p>



<h3 class="wp-block-heading" id="h-step-3-explore-the-data">Step #3 Explore the Data</h3>



<p class="wp-block-paragraph">Next, let&#8217;s explore the dataset. </p>



<h4 class="wp-block-heading">Target Class Distribution</h4>



<p class="wp-block-paragraph">The following code uses the plotly express library to create a histogram showing the class distribution of the &#8220;Failure Type&#8221; column in a DataFrame called &#8220;df_base.&#8221; The histogram will have one bar for each unique value in the &#8220;Failure Type&#8221; column, and the height of each bar will represent the number of occurrences of that value in the column. This can be useful for understanding the imbalance in the distribution of classes in a classification problem.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># display class distribution of the target variable
px.histogram(df_base, y=&quot;Failure Type&quot;, color=&quot;Failure Type&quot;) </pre></div>



<figure class="wp-block-image size-large is-resized"><img decoding="async" data-attachment-id="11828" data-permalink="https://www.relataly.com/predictive-maintenance-predicting-machine-failure-with-python/10618/newplot/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2023/01/newplot.png" data-orig-size="2042,450" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="newplot" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2023/01/newplot.png" src="https://www.relataly.com/wp-content/uploads/2023/01/newplot-1024x226.png" alt="Target class distribution in our predictive maintenance dataset" class="wp-image-11828" width="1115" height="246" srcset="https://www.relataly.com/wp-content/uploads/2023/01/newplot.png 1024w, https://www.relataly.com/wp-content/uploads/2023/01/newplot.png 300w, https://www.relataly.com/wp-content/uploads/2023/01/newplot.png 768w, https://www.relataly.com/wp-content/uploads/2023/01/newplot.png 1536w, https://www.relataly.com/wp-content/uploads/2023/01/newplot.png 2042w" sizes="(max-width: 1115px) 100vw, 1115px" /></figure>



<p class="wp-block-paragraph">Our dataset is highly imbalanced, with the vast majority of cases having a &#8220;No Failure&#8221; label. If the dataset is highly imbalanced, with a disproportionate number of cases in one class compared to the others, it can impact the performance of machine learning models. This is because imbalanced datasets can lead to models that are biased towards the majority class, and may not perform well on the minority class. In order to improve model performance on imbalanced datasets, we will later adjust the model hyperparameters accordingly. </p>



<h4 class="wp-block-heading">Feature Pairplots</h4>



<p class="wp-block-paragraph">Next, let&#8217;s construct pair plots to explore feature relations with the target variable. Pair plots, also known as scatter plots, are a type of plot that shows the relationship between two variables. In the context of a predictive maintenance dataset, pair plots can be useful for exploring the relationships between different features and the target variable (e.g., the likelihood of a machine failure). By creating pair plots and visualizing the relationships between different features and the target variable, you can gain insights into which features might be most useful for building a predictive model.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># pairplots on failure type
sns.pairplot(df_base, height=2.5, hue='Failure Type')</pre></div>



<figure class="wp-block-image size-large is-resized"><img decoding="async" data-attachment-id="11829" data-permalink="https://www.relataly.com/predictive-maintenance-predicting-machine-failure-with-python/10618/image-3-2/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2023/01/image-3.png" data-orig-size="1476,1226" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-3" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2023/01/image-3.png" src="https://www.relataly.com/wp-content/uploads/2023/01/image-3-1024x851.png" alt="feature plot for our predictive maintenance dataset" class="wp-image-11829" width="874" height="726" srcset="https://www.relataly.com/wp-content/uploads/2023/01/image-3.png 1024w, https://www.relataly.com/wp-content/uploads/2023/01/image-3.png 300w, https://www.relataly.com/wp-content/uploads/2023/01/image-3.png 768w, https://www.relataly.com/wp-content/uploads/2023/01/image-3.png 1476w" sizes="(max-width: 874px) 100vw, 874px" /></figure>



<p class="wp-block-paragraph">The pair plots reveal valuable patterns in our features that can inform the predictions of our model. For instance, we see that Power Failures tend to be correlated with torque values that are either close to the maximum or minimum. Such patterns should allow our predictive model to make solid predictions. </p>



<h4 class="wp-block-heading">Feature Correlation</h4>



<p class="wp-block-paragraph">Next, we will look at feature correlation. The following code block creates a heatmap using the seaborn library that shows the correlation between all pairs of columns in a DataFrame called &#8220;df_base&#8221;. The heatmap is plotted using a color scale, with warmer colors indicating stronger correlations and cooler colors indicating weaker correlations. The correlation values are also displayed in the cells of the heatmap, with values ranging from -1 (perfect negative correlation) to 1 (perfect positive correlation). By creating a heatmap, you can quickly see which variables are positively or negatively correlated with each other, and to what degree. This can be helpful for identifying which features might be most useful for building a predictive model.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># correlation plot
plt.figure(figsize=(6,4))
sns.heatmap(df_base.corr(), cbar=True, fmt='.1f', vmax=0.8, annot=True, cmap='Blues')</pre></div>



<figure class="wp-block-image size-full is-resized"><img decoding="async" data-attachment-id="11830" data-permalink="https://www.relataly.com/predictive-maintenance-predicting-machine-failure-with-python/10618/image-4-2/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2023/01/image-4.png" data-orig-size="649,506" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-4" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2023/01/image-4.png" src="https://www.relataly.com/wp-content/uploads/2023/01/image-4.png" alt="Feature correlation for our predictive maintenance dataset" class="wp-image-11830" width="689" height="537" srcset="https://www.relataly.com/wp-content/uploads/2023/01/image-4.png 649w, https://www.relataly.com/wp-content/uploads/2023/01/image-4.png 300w" sizes="(max-width: 689px) 100vw, 689px" /></figure>



<p class="wp-block-paragraph">From the table, it looks like there is a strong positive correlation between &#8220;air_temperature&#8221; and &#8220;process_temperature&#8221; (0.87). This makes sense since a high process temperature will naturally also heat up the air around the machine. In addition, there is a strong negative correlation between &#8220;rotational_speed&#8221; and &#8220;torque&#8221; (-0.87). The other correlations are weaker and closer to 0, indicating weaker relationships.</p>



<p class="wp-block-paragraph">Understanding the correlations between different variables in a dataset can be helpful for building predictive models, as it can give you an idea of which features might be most important for predicting a given target. It can also help you identify any redundant features that might not add much value to your model. Since our dataset only contains six features, we will keep all of them. </p>



<h4 class="wp-block-heading">Feature Boxplots</h4>



<p class="wp-block-paragraph">Box plots are a useful visualization tool for understanding the distribution of values in a dataset. They show the minimum, first quartile, median, third quartile, and maximum values for each group, as well as any outliers. By creating box plots separated by a categorical variable, you can compare the distributions of values between different groups and see if there are any significant differences. This can be useful for identifying trends or patterns in the data that might be useful for building a predictive model.</p>



<p class="wp-block-paragraph">If there are significant differences between the boxplots for different categories, it could be a good sign for building a predictive model. For example, if the boxplots for one category tend to have higher values for a particular feature than the boxplots for another category, it could indicate that the feature is related to the target variable and could be useful for making predictions.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># create histograms for feature columns separated by target column
def create_histogram(column_name):
    plt.figure(figsize=(16,6))
    return px.box(data_frame=df_base, y=column_name, color='Failure Type', points=&quot;all&quot;, width=1200)

create_histogram('air_temperature')</pre></div>



<figure class="wp-block-image size-large is-resized"><img decoding="async" data-attachment-id="11831" data-permalink="https://www.relataly.com/predictive-maintenance-predicting-machine-failure-with-python/10618/newplot-1/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2023/01/newplot-1.png" data-orig-size="1200,450" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="newplot-1" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2023/01/newplot-1.png" src="https://www.relataly.com/wp-content/uploads/2023/01/newplot-1-1024x384.png" alt="feature boxplot for different failure types in predictive maintenance dataset. feature: air temperature" class="wp-image-11831" width="1078" height="405" srcset="https://www.relataly.com/wp-content/uploads/2023/01/newplot-1.png 1024w, https://www.relataly.com/wp-content/uploads/2023/01/newplot-1.png 300w, https://www.relataly.com/wp-content/uploads/2023/01/newplot-1.png 768w, https://www.relataly.com/wp-content/uploads/2023/01/newplot-1.png 1200w" sizes="(max-width: 1078px) 100vw, 1078px" /></figure>



<p class="wp-block-paragraph">Feature boxplot for process_temperature.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}">create_histogram('process_temperature')</pre></div>



<figure class="wp-block-image size-large is-resized"><img decoding="async" data-attachment-id="11832" data-permalink="https://www.relataly.com/predictive-maintenance-predicting-machine-failure-with-python/10618/newplot-2/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2023/01/newplot-2.png" data-orig-size="1200,450" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="newplot-2" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2023/01/newplot-2.png" src="https://www.relataly.com/wp-content/uploads/2023/01/newplot-2-1024x384.png" alt="feature boxplot for different failure types in predictive maintenance dataset. feature: air temperature" class="wp-image-11832" width="1087" height="408" srcset="https://www.relataly.com/wp-content/uploads/2023/01/newplot-2.png 1024w, https://www.relataly.com/wp-content/uploads/2023/01/newplot-2.png 300w, https://www.relataly.com/wp-content/uploads/2023/01/newplot-2.png 768w, https://www.relataly.com/wp-content/uploads/2023/01/newplot-2.png 1200w" sizes="(max-width: 1087px) 100vw, 1087px" /><figcaption class="wp-element-caption">Feature boxplot for rotational speed.</figcaption></figure>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}">create_histogram('rotational_speed')</pre></div>



<figure class="wp-block-image size-large is-resized"><img decoding="async" data-attachment-id="11833" data-permalink="https://www.relataly.com/predictive-maintenance-predicting-machine-failure-with-python/10618/newplot-3/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2023/01/newplot-3.png" data-orig-size="1200,450" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="newplot-3" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2023/01/newplot-3.png" src="https://www.relataly.com/wp-content/uploads/2023/01/newplot-3-1024x384.png" alt="feature boxplot for different failure types in predictive maintenance dataset. feature: rotational speed" class="wp-image-11833" width="1110" height="417" srcset="https://www.relataly.com/wp-content/uploads/2023/01/newplot-3.png 1024w, https://www.relataly.com/wp-content/uploads/2023/01/newplot-3.png 300w, https://www.relataly.com/wp-content/uploads/2023/01/newplot-3.png 768w, https://www.relataly.com/wp-content/uploads/2023/01/newplot-3.png 1200w" sizes="(max-width: 1110px) 100vw, 1110px" /></figure>



<p class="wp-block-paragraph">Feature boxplot for torque.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}">create_histogram('torque')</pre></div>



<figure class="wp-block-image size-large is-resized"><img decoding="async" data-attachment-id="11834" data-permalink="https://www.relataly.com/predictive-maintenance-predicting-machine-failure-with-python/10618/newplot-4/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2023/01/newplot-4.png" data-orig-size="1200,450" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="newplot-4" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2023/01/newplot-4.png" src="https://www.relataly.com/wp-content/uploads/2023/01/newplot-4-1024x384.png" alt="feature boxplot for different failure types in predictive maintenance dataset. feature: torque" class="wp-image-11834" width="1082" height="406" srcset="https://www.relataly.com/wp-content/uploads/2023/01/newplot-4.png 1024w, https://www.relataly.com/wp-content/uploads/2023/01/newplot-4.png 300w, https://www.relataly.com/wp-content/uploads/2023/01/newplot-4.png 768w, https://www.relataly.com/wp-content/uploads/2023/01/newplot-4.png 1200w" sizes="(max-width: 1082px) 100vw, 1082px" /></figure>



<p class="wp-block-paragraph">Feature boxplot for tool wear.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}">create_histogram('tool_wear')</pre></div>



<figure class="wp-block-image size-large is-resized"><img decoding="async" data-attachment-id="11835" data-permalink="https://www.relataly.com/predictive-maintenance-predicting-machine-failure-with-python/10618/newplot-5/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2023/01/newplot-5.png" data-orig-size="1200,450" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="newplot-5" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2023/01/newplot-5.png" src="https://www.relataly.com/wp-content/uploads/2023/01/newplot-5-1024x384.png" alt="feature boxplot for different failure types in predictive maintenance dataset. feature: tool wear" class="wp-image-11835" width="1097" height="411" srcset="https://www.relataly.com/wp-content/uploads/2023/01/newplot-5.png 1024w, https://www.relataly.com/wp-content/uploads/2023/01/newplot-5.png 300w, https://www.relataly.com/wp-content/uploads/2023/01/newplot-5.png 768w, https://www.relataly.com/wp-content/uploads/2023/01/newplot-5.png 1200w" sizes="(max-width: 1097px) 100vw, 1097px" /></figure>



<p class="wp-block-paragraph">Now that we have a good understanding of our dataset, we can prepare the data for model training. </p>



<h3 class="wp-block-heading" id="h-step-4-data-preparation">Step #4 Data Preparation</h3>



<p class="wp-block-paragraph">To prepare the data for model training, we will need to split our dataset and make additional modifications. </p>



<p class="wp-block-paragraph">The following code block contains a reusable function called data_preparation. The purpose of this function is to prepare the data in a way that is suitable for building and evaluating machine learning models. It performs several preprocessing steps, such as encoding categorical variables and splitting the data into training and test sets. </p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}">def data_preparation(df_base, target_name):
    df = df_base.dropna()

    df['target_name_encoded'] = df[target_name].replace({'No Failure': 0, 'Power Failure': 1, 'Tool Wear Failure': 2, 'Overstrain Failure': 3, 'Random Failures': 4, 'Heat Dissipation Failure': 5})
    df['Type'].replace({'L': 0, 'M': 1, 'H': 2}, inplace=True)
    X = df.drop(columns=[target_name, 'target_name_encoded'])
    y = df['target_name_encoded'] #Prediction label

    # split the data into x_train and y_train data sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, random_state=0)

    # print the shapes: the result is: (rows, training_sequence, features) (prediction value, )
    print('train: ', X_train.shape, y_train.shape)
    print('test: ', X_test.shape, y_test.shape)
    return X, y, X_train, X_test, y_train, y_test

# remove target from training data
X, y, X_train, X_test, y_train, y_test = data_preparation(df_base, target_name)</pre></div>



<h3 class="wp-block-heading" id="h-step-5-model-training">Step #5 Model Training</h3>



<p class="wp-block-paragraph">Now that we have prepared the dataset, we can train the XGBoost classification model. The basic idea behind XGBoost is to train a series of weak models, such as decision trees, and then combine their predictions using gradient boosting. During training, XGBoost uses an optimization algorithm to adjust the weight of each model in the ensemble in order to improve the overall prediction accuracy. XGBoost also includes a number of additional features and techniques that help to improve the performance of the model, such as regularization, feature selection, and handling missing values.</p>



<p class="wp-block-paragraph">XGboost provides several configuration options that we can use to finetune performance and adjust the training process to our dataset. For a complete list of hyperparameters, please see the <a href="https://xgboost.readthedocs.io/en/stable/python/index.html" target="_blank" rel="noreferrer noopener">library documentation</a>.</p>



<p class="wp-block-paragraph">Remember that our class labels are imbalanced. Therefore, we will provide the model with sample weights. The following code creates a weight array for the training and test sets using the &#8220;compute_sample_weight&#8221; function from scikit-learn. We calculate the weight array based on the &#8220;balanced&#8221; mode. This means that the weights are calculated such that the class distribution in the sample is balanced. This can be useful when working with imbalanced datasets, as it helps to mitigate the effects of class imbalance on the model.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}">weight_train = compute_sample_weight('balanced', y_train)
weight_test = compute_sample_weight('balanced', y_test)

xgb_clf = XGBClassifier(booster='gbtree', 
                        tree_method='gpu_hist', 
                        sampling_method='gradient_based', 
                        eval_metric='aucpr', 
                        objective='multi:softmax', 
                        num_class=6)
# fit the model to the data
xgb_clf.fit(X_train, y_train.ravel(), sample_weight=weight_train)</pre></div>



<figure class="wp-block-image size-full"><img decoding="async" width="842" height="270" data-attachment-id="11836" data-permalink="https://www.relataly.com/predictive-maintenance-predicting-machine-failure-with-python/10618/image-5-3/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2023/01/image-5.png" data-orig-size="842,270" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-5" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2023/01/image-5.png" src="https://www.relataly.com/wp-content/uploads/2023/01/image-5.png" alt="summary of our XGBoost classifier of our predictive maintenance solution" class="wp-image-11836" srcset="https://www.relataly.com/wp-content/uploads/2023/01/image-5.png 842w, https://www.relataly.com/wp-content/uploads/2023/01/image-5.png 300w, https://www.relataly.com/wp-content/uploads/2023/01/image-5.png 768w" sizes="(max-width: 842px) 100vw, 842px" /></figure>



<p class="wp-block-paragraph">We can see that the blue box summarizes the configuration of our model and indicates that the training process has been successful. Now that we have the classifier, we can use it to make predictions on new data.</p>



<h3 class="wp-block-heading" id="h-step-6-model-evaluation">Step #6 Model Evaluation</h3>



<p class="wp-block-paragraph">Finally, we will evaluate the model&#8217;s performance. This will involve three steps:</p>



<ul class="wp-block-list">
<li>Model scoring</li>



<li>Cross-validation</li>



<li>Confusion matrix</li>
</ul>



<h4 class="wp-block-heading">Model Scoring</h4>



<p class="wp-block-paragraph">First, we calculate the accuracy of the classifier on the test set using the &#8220;score&#8221; method. To account for the imbalance of class labels, we pass in the weight array for the test set as an additional parameter. This returns the fraction of correct predictions made by the classifier. Next, the code uses the classifier to make predictions on the test set using the &#8220;predict&#8221; method. It then generates a classification report using the &#8220;classification_report&#8221; function from scikit-learn. The report displays a summary of the model&#8217;s performance in terms of various evaluation metrics such as precision, recall, and f1-score.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># score the model with the test dataset
score = xgb_clf.score(X_test, y_test.ravel(), sample_weight=weight_test)

# predict on the test dataset
y_pred = xgb_clf.predict(X_test)

# print a classification report
results_log = classification_report(y_test, y_pred)
print(results_log)</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">precision    recall  f1-score   support

           0       0.99      0.98      0.99      2903
           1       0.64      0.88      0.74        24
           2       0.04      0.08      0.06        12
           3       0.77      0.89      0.83        27
           4       0.00      0.00      0.00         4
           5       0.76      0.97      0.85        30

    accuracy                           0.98      3000
   macro avg       0.53      0.63      0.58      3000
weighted avg       0.98      0.98      0.98      3000</pre></div>



<p class="wp-block-paragraph">The classification report shows the performance of our XGBoost classifier on the test dataset. The model appears to perform well, with a high accuracy of 0.98 and a high weighted average f1-score of 0.98. </p>



<p class="wp-block-paragraph">However, there are a few classes where the model&#8217;s performance is not as strong. Class 1 has a relatively low precision of 0.64 and a low f1-score of 0.74, while class 2 has a very low precision of 0.04 and a low f1-score of 0.06. Class 4 has a precision and f1-score of 0.00, which suggests that the model is not making any correct predictions for this class.</p>



<p class="wp-block-paragraph">It is also worth noting that the support for some classes is much lower than for others. Class 1 has a support of 24, while class 0 has a support of 2903. This is due to the fact that there are relatively few instances of class 1 in the test dataset compared to class 0, which affects the model&#8217;s performance on class 1.</p>



<h4 class="wp-block-heading">Confusion Matrix</h4>



<p class="wp-block-paragraph">Next, we create a confusion matrix. We input the true labels of the test set (y_test) and the predicted labels produced by the model (y_pred) to generate the matrix. The matrix shows us the number of correct and incorrect predictions made by the model for each class.</p>



<p class="wp-block-paragraph">We then create a DataFrame from the confusion matrix and use the seaborn library to visualize the matrix as a heatmap. The heatmap allows us to easily see which classes are being predicted correctly and which are being misclassified. </p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># create predictions on the test dataset
y_pred = xgb_clf.predict(X_test)

# print a multi-Class Confusion Matrix
cnf_matrix = confusion_matrix(y_test, y_pred)
df_cm = pd.DataFrame(cnf_matrix, columns=np.unique(y_test), index=np.unique(y_test))
df_cm.index.name = 'Actual'
df_cm.columns.name = 'Predicted'
plt.figure(figsize = (8, 5))
sns.set(font_scale=1.1) #for label size
sns.heatmap(df_cm, cbar=True, cmap= &quot;inferno&quot;, annot=True, fmt='.0f') </pre></div>



<figure class="wp-block-image size-full is-resized"><img decoding="async" data-attachment-id="11837" data-permalink="https://www.relataly.com/predictive-maintenance-predicting-machine-failure-with-python/10618/image-6-2/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2023/01/image-6.png" data-orig-size="668,456" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-6" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2023/01/image-6.png" src="https://www.relataly.com/wp-content/uploads/2023/01/image-6.png" alt="Evaluating the performance of our predictive maintenance solution using a confusion matrix" class="wp-image-11837" width="663" height="452" srcset="https://www.relataly.com/wp-content/uploads/2023/01/image-6.png 668w, https://www.relataly.com/wp-content/uploads/2023/01/image-6.png 300w" sizes="(max-width: 663px) 100vw, 663px" /></figure>



<p class="wp-block-paragraph">The color scale of the heatmap indicates the magnitude of the values in the matrix. In this case, the darker the color, the higher the number of predictions. This visualization helps us to understand the performance of the model and identify areas for improvement. </p>



<p class="wp-block-paragraph">Here are a few things that we can learn from this matrix:</p>



<ul class="wp-block-list">
<li>The model made a total of 2902 correct predictions and 67 incorrect predictions.</li>



<li>For the &#8220;No Failure&#8221; class, the model made 2854 correct predictions and 29 incorrect predictions. The majority of the incorrect predictions were false negatives.</li>



<li>For the &#8220;Power Failure&#8221; class, the model made 21 correct predictions and three incorrect predictions. </li>



<li>For the &#8220;Tool Wear Failure&#8221; class, the model made 1 correct prediction and 1 incorrect prediction. </li>



<li>For the &#8220;Overstrain Failure&#8221; class, the model made 24 correct predictions and 2 incorrect predictions. </li>



<li>For the &#8220;Random Failures&#8221; class, the model made 29 correct predictions and 4 incorrect predictions. </li>



<li>For the &#8220;Heat Dissipation Failure&#8221; class, the model made 29 correct predictions and 1 incorrect prediction. </li>
</ul>



<p class="wp-block-paragraph">Overall, the model seems to be performing relatively well, but it is making a lot of false negatives for some classes. </p>



<h4 class="wp-block-heading">Cross Validation</h4>



<p class="wp-block-paragraph">Finally, we perform cross-validation on the training set using the &#8220;cross_validate&#8221; function from scikit-learn. Cross-validation is a technique for evaluating the performance of a machine learning model by training it on different subsets of the data and evaluating it on the remaining data. </p>



<p class="wp-block-paragraph">In this case, we will train and evaluate our model 10 times using different splits of the data (specified by the &#8220;cv&#8221; parameter). We also specify that the evaluation metric should be the weighted f1-score (specified by the &#8220;scoring&#8221; parameter). We then pass the weight array for the training set to the classifier.</p>



<p class="wp-block-paragraph">The &#8220;cross_validate&#8221; function returns a dictionary containing various evaluation metrics for each fold of the cross-validation. We will convert the dictionary to a DataFrame and create a bar plot using the plotly express library to visualize the results. This helps us to understand the consistency and stability of the model&#8217;s performance.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># cross validation
scores  = cross_validate(xgb_clf, X_train, y_train, cv=10, scoring=&quot;f1_weighted&quot;, fit_params={ &quot;sample_weight&quot; :weight_train})
scores_df = pd.DataFrame(scores)
px.bar(x=scores_df.index, y=scores_df.test_score, width=800)</pre></div>



<figure class="wp-block-image size-full is-resized"><img decoding="async" data-attachment-id="11838" data-permalink="https://www.relataly.com/predictive-maintenance-predicting-machine-failure-with-python/10618/newplot-6/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2023/01/newplot-6.png" data-orig-size="800,450" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="newplot-6" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2023/01/newplot-6.png" src="https://www.relataly.com/wp-content/uploads/2023/01/newplot-6.png" alt="Evaluation the performance of our predictive maintenance solution. cross validation scores for the XGBoost model. " class="wp-image-11838" width="644" height="362" srcset="https://www.relataly.com/wp-content/uploads/2023/01/newplot-6.png 800w, https://www.relataly.com/wp-content/uploads/2023/01/newplot-6.png 300w, https://www.relataly.com/wp-content/uploads/2023/01/newplot-6.png 768w" sizes="(max-width: 644px) 100vw, 644px" /></figure>



<p class="wp-block-paragraph">The model performance remains consistent across all folds. </p>



<h2 class="wp-block-heading">Summary</h2>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">In this article, we have presented the concept of predictive maintenance and demonstrated how organizations can use this approach to improve their maintenance cycles. The second part of the article provided a hands-on tutorial showing how to implement a predictive maintenance solution for predicting different failure types of a milling machine. We trained a classification model using the XGBoost algorithm and sensor data from the machine. </p>



<p class="wp-block-paragraph">While the model demonstrated good performance overall, we observed that it was not able to predict all classes with the same level of accuracy. This suggests that there may be opportunities to improve the model&#8217;s performance. One potential approach is to balance the dataset by up or down-sampling the data to achieve a more even distribution of classes. By doing so, we can mitigate the effects of class imbalance and potentially improve the model&#8217;s predictions for all classes.</p>



<p class="wp-block-paragraph">By implementing such a predictive maintenance approach, organizations can improve their operational efficiency and ensure the smooth running of their machinery.</p>



<p class="wp-block-paragraph">I hope this article was helpful. If you have any questions or feedback, let me know in the comments. </p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%">
<figure class="wp-block-image size-full"><img decoding="async" width="497" height="493" data-attachment-id="12901" data-permalink="https://www.relataly.com/smart-factory-iot-sensors-relataly-midjourney-min/" data-orig-file="https://www.relataly.com/wp-content/uploads/2023/03/smart-factory-iot-sensors-relataly-midjourney-min.png" data-orig-size="497,493" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="smart factory iot sensors relataly midjourney-min" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2023/03/smart-factory-iot-sensors-relataly-midjourney-min.png" src="https://www.relataly.com/wp-content/uploads/2023/03/smart-factory-iot-sensors-relataly-midjourney-min.png" alt="" class="wp-image-12901" srcset="https://www.relataly.com/wp-content/uploads/2023/03/smart-factory-iot-sensors-relataly-midjourney-min.png 497w, https://www.relataly.com/wp-content/uploads/2023/03/smart-factory-iot-sensors-relataly-midjourney-min.png 300w, https://www.relataly.com/wp-content/uploads/2023/03/smart-factory-iot-sensors-relataly-midjourney-min.png 140w" sizes="(max-width: 497px) 100vw, 497px" /><figcaption class="wp-element-caption">Predictive maintenance also plays an essential role in a smart factory. Image created with Midjourney. </figcaption></figure>
</div>
</div>



<p class="wp-block-paragraph"></p>



<h2 class="wp-block-heading">Sources and Further Reading</h2>



<p class="wp-block-paragraph">There are many books available on the topics of IoT and predictive maintenance. Here are a few recommendations:</p>



<ul class="wp-block-list">
<li><a href="https://amzn.to/3XgrX7L" target="_blank" rel="noreferrer noopener">An Introduction to Predictive Maintenance</a> by R Keith Mobley</li>



<li><a href="https://amzn.to/3CzYL3A" target="_blank" rel="noreferrer noopener">Predictive Analytics: The Secret to Predicting Future Events Using Big Data and Data Science Techniques Such as Data Mining, Predictive Modelling, Statistics, Data Analysis, and Machine</a> by Richard Hurley</li>



<li>Stephan Matzka, <a href="https://ieeexplore.ieee.org/document/9253083" target="_blank" rel="noreferrer noopener">Explainable Artificial Intelligence for Predictive Maintenance Applications</a>, Third International Conference on Artificial Intelligence for Industries (AI4I 2020)</li>



<li><a href="https://amzn.to/3TrBdDY" target="_blank" rel="noreferrer noopener">David Forsyth (2019) Applied Machine Learning Springer</a></li>



<li>ChatGPT was used to revise certain parts of this article</li>



<li>Images created using Midjourney and OpenAI Dall-E</li>
</ul>



<p class="has-contrast-2-color has-base-3-background-color has-text-color has-background wp-block-paragraph"><em>The links above to Amazon are affiliate links. By buying through these links, you support the Relataly.com blog and help to cover the hosting costs. Using the links does not affect the price.</em></p>
<p>The post <a href="https://www.relataly.com/predictive-maintenance-predicting-machine-failure-with-python/10618/">Predictive Maintenance: Predicting Machine Failure using Sensor Data with XGBoost and Python</a> appeared first on <a href="https://www.relataly.com">relataly.com</a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://www.relataly.com/predictive-maintenance-predicting-machine-failure-with-python/10618/feed/</wfw:commentRss>
			<slash:comments>3</slash:comments>
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">10618</post-id>	</item>
		<item>
		<title>Predictive Policing: Preventing Crime in San Francisco using XGBoost and Python</title>
		<link>https://www.relataly.com/predicting-crimes-in-san-francisco-creatingsf-crime-map-using-xgboost/2960/</link>
					<comments>https://www.relataly.com/predicting-crimes-in-san-francisco-creatingsf-crime-map-using-xgboost/2960/#respond</comments>
		
		<dc:creator><![CDATA[Florian Follonier]]></dc:creator>
		<pubDate>Sun, 07 Mar 2021 16:16:19 +0000</pubDate>
				<category><![CDATA[Algorithms]]></category>
		<category><![CDATA[Classification (multi-class)]]></category>
		<category><![CDATA[Decision Trees]]></category>
		<category><![CDATA[Fighting Crime]]></category>
		<category><![CDATA[Gradient Boosting]]></category>
		<category><![CDATA[Insurance]]></category>
		<category><![CDATA[Kaggle Competitions]]></category>
		<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[mplleaflet]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[Random Decision Forests]]></category>
		<category><![CDATA[Scikit-Learn]]></category>
		<category><![CDATA[Seaborn]]></category>
		<category><![CDATA[Classic Machine Learning]]></category>
		<category><![CDATA[Crime Data]]></category>
		<category><![CDATA[Geographic Maps]]></category>
		<category><![CDATA[Intermediate Tutorials]]></category>
		<category><![CDATA[Kaggle]]></category>
		<category><![CDATA[Multivariate Models]]></category>
		<category><![CDATA[Smart City]]></category>
		<category><![CDATA[Spatial Data]]></category>
		<category><![CDATA[Supervised Learning]]></category>
		<guid isPermaLink="false">https://www.relataly.com/?p=2960</guid>

					<description><![CDATA[<p>In this tutorial, we&#8217;ll be using machine learning to predict and map out crime in San Francisco. We&#8217;ll be working with a dataset from Kaggle that contains information on 39 different types of crimes, including everything from vehicle theft to drug offenses. Using Python and the powerful Scikit-Learn library, we&#8217;ll train a classification model using ... <a title="Predictive Policing: Preventing Crime in San Francisco using XGBoost and Python" class="read-more" href="https://www.relataly.com/predicting-crimes-in-san-francisco-creatingsf-crime-map-using-xgboost/2960/" aria-label="Read more about Predictive Policing: Preventing Crime in San Francisco using XGBoost and Python">Read more</a></p>
<p>The post <a href="https://www.relataly.com/predicting-crimes-in-san-francisco-creatingsf-crime-map-using-xgboost/2960/">Predictive Policing: Preventing Crime in San Francisco using XGBoost and Python</a> appeared first on <a href="https://www.relataly.com">relataly.com</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">In this tutorial, we&#8217;ll be using machine learning to predict and map out crime in San Francisco. We&#8217;ll be working with a dataset from Kaggle that contains information on 39 different types of crimes, including everything from vehicle theft to drug offenses. Using Python and the powerful Scikit-Learn library, we&#8217;ll train a classification model using the <a href="https://www.relataly.com/category/machine-learning-algorithms/gradient-boosting/" target="_blank" rel="noreferrer noopener">XGboost algorithm</a> to predict 39 types of crimes based on when and where it occurred. We&#8217;ll then use the Plotly library to visualize the results on a map of the city, highlighting areas with higher rates of certain crimes. This type of prediction and mapping is similar to what the San Francisco Police Department uses in their practice of predictive policing, where they allocate resources to at-risk areas in an effort to prevent crime.</p>



<p class="wp-block-paragraph">As we embark on this thrilling journey, we&#8217;ll start by downloading and preprocessing the San Francisco crime data. Next, we&#8217;ll channel the data to train two distinct classification models. The first model will utilize a standard Random Forest Classifier, while the second will leverage the exceptional XGBoost package. We&#8217;ll experiment with various models that boast different hyperparameters. Ultimately, we&#8217;ll visualize our predictions on a striking SF crime map and assess the performance of our diverse models. So, buckle up and let&#8217;s dive into the exhilarating world of crime prediction and mapping!</p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%">
<figure class="wp-block-image size-full"><img decoding="async" width="508" height="513" data-attachment-id="12478" data-permalink="https://www.relataly.com/predicting-crimes-in-san-francisco-creatingsf-crime-map-using-xgboost/2960/crime-prediction-san-francisco-city-map-xgboost-python-tutorial-min/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2023/02/crime-prediction-san-francisco-city-map-xgboost-python-tutorial-min.png" data-orig-size="508,513" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="crime-prediction-san-francisco-city-map-xgboost-python-tutorial-min" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2023/02/crime-prediction-san-francisco-city-map-xgboost-python-tutorial-min.png" src="https://www.relataly.com/wp-content/uploads/2023/02/crime-prediction-san-francisco-city-map-xgboost-python-tutorial-min.png" alt="crime prediction san francisco city map xgboost python tutorial. Image generated using Midjourney. relataly.com" class="wp-image-12478" srcset="https://www.relataly.com/wp-content/uploads/2023/02/crime-prediction-san-francisco-city-map-xgboost-python-tutorial-min.png 508w, https://www.relataly.com/wp-content/uploads/2023/02/crime-prediction-san-francisco-city-map-xgboost-python-tutorial-min.png 297w, https://www.relataly.com/wp-content/uploads/2023/02/crime-prediction-san-francisco-city-map-xgboost-python-tutorial-min.png 140w" sizes="(max-width: 508px) 100vw, 508px" /><figcaption class="wp-element-caption">Predictive policing can make police work much more efficient and effective. Image generated using <a href="http://www.midjourney.com" target="_blank" rel="noreferrer noopener">Midjourney</a>.</figcaption></figure>
</div>
</div>



<h2 class="wp-block-heading">What is Predictive Policing?</h2>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">The use case we are looking at in this article falls into predictive policing. Predictive policing uses data, algorithms, and other technological tools to predict where and when crimes are likely to occur. The goal of predictive policing is to help law enforcement agencies better allocate their resources and focus their efforts on areas where crime is likely to happen, with the ultimate goal of reducing crime and improving public safety. This approach to policing is based on the idea that by using data and other tools to identify patterns and trends, law enforcement agencies can better anticipate where crimes are likely to occur and take steps to prevent them from happening.</p>



<p class="wp-block-paragraph">The benefits of predictive policing include the ability to allocate law enforcement resources better, the potential to reduce crime and improve public safety, and the ability to identify trends and patterns that may not be immediately obvious to law enforcement officers. Additionally, by using data and other tools to anticipate where crimes are likely to occur, law enforcement agencies can take proactive steps to prevent those crimes from happening, which can save time and money.</p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%"></div>
</div>



<h2 class="wp-block-heading" id="h-creating-a-crime-map-for-predictive-policing-using-xgboost-in-python">Creating a Crime Map for Predictive Policing using XGBoost in Python</h2>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">In this practical tutorial, we&#8217;ll construct an XGBoost multi-label classifier to predict crime types in San Francisco. Urban crime, such as in San Francisco, is a dynamic and multifaceted issue that can dramatically vary based on location, time, and other factors. Our aim is to develop a predictive algorithm capable of forecasting specific crime types based on a given location and time parameters. The end product is an interactive San Francisco crime map providing a snapshot of crime hotspots throughout the city.</p>



<p class="wp-block-paragraph">Law enforcement agencies, like the San Francisco Police Department, use similar maps for strategic resource allocation to curb crime rates effectively. Additionally, this SF crime map will underscore crime clusters &#8211; areas notorious for particular types of crime incidents. By the end of this tutorial, you&#8217;ll have a deeper understanding of using machine learning in practical scenarios and aiding real-world decision-making.</p>



<p class="wp-block-paragraph">The code is available on the GitHub repository.</p>



<div class="wp-block-kadence-advancedbtn kb-buttons-wrap kb-btns_262c97-e5"><a class="kb-button kt-button button kb-btn_e6ce86-27 kt-btn-size-standard kt-btn-width-type-full kb-btn-global-inherit kt-btn-has-text-true kt-btn-has-svg-true wp-block-button__link wp-block-kadence-singlebtn" href="https://github.com/flo7up/relataly-public-python-tutorials/blob/master/02%20Classification/018%20Forecasting%20Criminal%20Activity%20in%20San%20Francisco.ipynb" target="_blank" rel="noreferrer noopener"><span class="kb-svg-icon-wrap kb-svg-icon-fe_eye kt-btn-icon-side-left"><svg viewBox="0 0 24 24"  fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"  aria-hidden="true"><path d="M1 12s4-8 11-8 11 8 11 8-4 8-11 8-11-8-11-8z"/><circle cx="12" cy="12" r="3"/></svg></span><span class="kt-btn-inner-text">View on GitHub </span></a>

<a class="kb-button kt-button button kb-btn_3b4a4f-fe kt-btn-size-standard kt-btn-width-type-full kb-btn-global-inherit kt-btn-has-text-true kt-btn-has-svg-true wp-block-button__link wp-block-kadence-singlebtn" href="https://github.com/flo7up/relataly-public-python-API-tutorials" target="_blank" rel="noreferrer noopener"><span class="kb-svg-icon-wrap kb-svg-icon-fa_github kt-btn-icon-side-left"><svg viewBox="0 0 496 512"  fill="currentColor" xmlns="http://www.w3.org/2000/svg"  aria-hidden="true"><path d="M165.9 397.4c0 2-2.3 3.6-5.2 3.6-3.3.3-5.6-1.3-5.6-3.6 0-2 2.3-3.6 5.2-3.6 3-.3 5.6 1.3 5.6 3.6zm-31.1-4.5c-.7 2 1.3 4.3 4.3 4.9 2.6 1 5.6 0 6.2-2s-1.3-4.3-4.3-5.2c-2.6-.7-5.5.3-6.2 2.3zm44.2-1.7c-2.9.7-4.9 2.6-4.6 4.9.3 2 2.9 3.3 5.9 2.6 2.9-.7 4.9-2.6 4.6-4.6-.3-1.9-3-3.2-5.9-2.9zM244.8 8C106.1 8 0 113.3 0 252c0 110.9 69.8 205.8 169.5 239.2 12.8 2.3 17.3-5.6 17.3-12.1 0-6.2-.3-40.4-.3-61.4 0 0-70 15-84.7-29.8 0 0-11.4-29.1-27.8-36.6 0 0-22.9-15.7 1.6-15.4 0 0 24.9 2 38.6 25.8 21.9 38.6 58.6 27.5 72.9 20.9 2.3-16 8.8-27.1 16-33.7-55.9-6.2-112.3-14.3-112.3-110.5 0-27.5 7.6-41.3 23.6-58.9-2.6-6.5-11.1-33.3 2.6-67.9 20.9-6.5 69 27 69 27 20-5.6 41.5-8.5 62.8-8.5s42.8 2.9 62.8 8.5c0 0 48.1-33.6 69-27 13.7 34.7 5.2 61.4 2.6 67.9 16 17.7 25.8 31.5 25.8 58.9 0 96.5-58.9 104.2-114.8 110.5 9.2 7.9 17 22.9 17 46.4 0 33.7-.3 75.4-.3 83.6 0 6.5 4.6 14.4 17.3 12.1C428.2 457.8 496 362.9 496 252 496 113.3 383.5 8 244.8 8zM97.2 352.9c-1.3 1-1 3.3.7 5.2 1.6 1.6 3.9 2.3 5.2 1 1.3-1 1-3.3-.7-5.2-1.6-1.6-3.9-2.3-5.2-1zm-10.8-8.1c-.7 1.3.3 2.9 2.3 3.9 1.6 1 3.6.7 4.3-.7.7-1.3-.3-2.9-2.3-3.9-2-.6-3.6-.3-4.3.7zm32.4 35.6c-1.6 1.3-1 4.3 1.3 6.2 2.3 2.3 5.2 2.6 6.5 1 1.3-1.3.7-4.3-1.3-6.2-2.2-2.3-5.2-2.6-6.5-1zm-11.4-14.7c-1.6 1-1.6 3.6 0 5.9 1.6 2.3 4.3 3.3 5.6 2.3 1.6-1.3 1.6-3.9 0-6.2-1.4-2.3-4-3.3-5.6-2z"/></svg></span><span class="kt-btn-inner-text">Relataly Github Repo </span></a></div>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%">
<figure class="wp-block-image size-full"><img decoding="async" width="500" height="487" data-attachment-id="12476" data-permalink="https://www.relataly.com/predicting-crimes-in-san-francisco-creatingsf-crime-map-using-xgboost/2960/san-francisco-crime-prediction-machine-learning-python-tutorial-crime-map/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2023/02/san-francisco-crime-prediction-machine-learning-python-tutorial-crime-map.png" data-orig-size="500,487" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="san-francisco-crime-prediction-machine-learning-python-tutorial-crime-map" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2023/02/san-francisco-crime-prediction-machine-learning-python-tutorial-crime-map.png" src="https://www.relataly.com/wp-content/uploads/2023/02/san-francisco-crime-prediction-machine-learning-python-tutorial-crime-map.png" alt="" class="wp-image-12476" srcset="https://www.relataly.com/wp-content/uploads/2023/02/san-francisco-crime-prediction-machine-learning-python-tutorial-crime-map.png 500w, https://www.relataly.com/wp-content/uploads/2023/02/san-francisco-crime-prediction-machine-learning-python-tutorial-crime-map.png 300w" sizes="(max-width: 500px) 100vw, 500px" /><figcaption class="wp-element-caption">Crime doesn&#8217;t sleep in San Francisco. That&#8217;s why predictive policing can make a real impact. Image generated with <a href="http://www.midjourney.com" target="_blank" rel="noreferrer noopener">Midjourney</a></figcaption></figure>
</div>
</div>



<h3 class="wp-block-heading" id="h-prerequisites">Prerequisites</h3>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">Before starting the Python coding part, ensure that you have set up your <a href="https://www.python.org/downloads/" target="_blank" rel="noreferrer noopener">Python 3</a> environment and required packages. If you don&#8217;t have an environment, follow&nbsp;<a href="https://www.relataly.com/anaconda-python-environment-machine-learning/1663/" target="_blank" rel="noreferrer noopener">this tutorial</a>&nbsp;to set up the&nbsp;<a href="https://www.anaconda.com/products/individual" target="_blank" rel="noreferrer noopener">Anaconda environment</a>.</p>



<p class="wp-block-paragraph">Also, make sure you install all required packages. In this tutorial, we will be working with the following standard packages:&nbsp;</p>



<ul class="wp-block-list">
<li><em><a href="https://pandas.pydata.org/" target="_blank" rel="noreferrer noopener">pandas</a></em></li>



<li><em><a href="https://numpy.org/" target="_blank" rel="noreferrer noopener">NumPy</a></em></li>



<li><em><a href="https://matplotlib.org/" target="_blank" rel="noreferrer noopener">matplotlib</a></em></li>



<li>Seaborn</li>
</ul>



<p class="wp-block-paragraph">In addition, we will be using XGBoost (&#8216;xgboost&#8217;) and the machine learning library scikit-learn. </p>



<p class="wp-block-paragraph">You can install packages using console commands:</p>



<ul class="wp-block-list">
<li><em>pip install &lt;package name&gt;</em></li>



<li><em>conda install &lt;package name&gt;</em>&nbsp;(if you are using the anaconda packet manager)</li>
</ul>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%"></div>
</div>



<h3 class="wp-block-heading" id="h-step-1-load-the-data">Step #1 Load the Data</h3>



<p class="wp-block-paragraph">We begin by downloading the San Francisco crime challenge data on kaggle.com. Once you have downloaded the dataset, place the CSV files (train.csv) into your Python working folder.</p>



<p class="wp-block-paragraph">The dataset was collected by the SFO police department between 2003 and 2015. According to the data description from the SF crime challenge, the dataset contains the following variables:</p>



<ul class="wp-block-list">
<li><strong>Dates</strong>: timestamp of the crime incident</li>



<li><strong>Category</strong>: Category of the crime incident (only in train.csv) that we will use as the target variable</li>



<li><strong>Descript</strong>: detailed description of the crime incident&nbsp;(only in train.csv)</li>



<li><strong>DayOfWeek:</strong> the day of the week</li>



<li><strong>PdDistrict</strong>: the name of the Police Department District</li>



<li><strong>Resolution</strong>: how the crime incident was resolved&nbsp;(only in train.csv)</li>



<li><strong>Address</strong>: the approximate street address of the crime incident&nbsp;</li>



<li><strong>X</strong>: Longitude</li>



<li><strong>Y</strong>: Latitude</li>
</ul>



<p class="wp-block-paragraph">The next step is to load the data into a dataframe. Then we use the head() command to print the first five lines and ensure you can see the data.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}">import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import matplotlib.pyplot as plt 
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix
from xgboost import XGBClassifier
import plotly.express as px

# The Data is part of the Kaggle Competition: https://www.kaggle.com/c/sf-crime/data
df_base = pd.read_csv(&quot;data/crime/sf-crime/train.csv&quot;)

print(df_base.describe())
df_base.head()</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">		X              Y
count  	878049.000000  878049.000000
mean     -122.422616      37.771020
std         0.030354       0.456893
min      -122.513642      37.707879
25%      -122.432952      37.752427
50%      -122.416420      37.775421
75%      -122.406959      37.784369
max      -120.500000      90.000000</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">	Dates				Category	Descript			DayOfWeek	PdDistrict	Resolution	Address				X			Y
0	2015-05-13 23:53:00	WARRANTS	WARRANT ARREST		Wednesday	NORTHERN	ARREST, 	OAK ST / ...		-122.425892	37.774599
1	2015-05-13 23:53:00	OTHER ...	TRAFFIC ...			Wednesday	NORTHERN	ARREST, 	OAK ST / ...		-122.425892	37.774599
2	2015-05-13 23:33:00	OTHER ...	TRAFFIC ...			Wednesday	NORTHERN	ARREST, 	VANNESS AV... ST	-122.424363	37.800414
3	2015-05-13 23:30:00	LARCENY/THEFT	GRAND THEFT...	Wednesday	NORTHERN	NONE		1500 Block... ST	-122.426995	37.800873
4	2015-05-13 23:30:00	LARCENY/THEFT	GRAND THEFT ...	Wednesday	PARK		NONE		100 Block... ST		-122.438738	37.771541</pre></div>



<p class="wp-block-paragraph">If the data was loaded correctly, you should see the first five records of the dataframe, as shown above.</p>



<h3 class="wp-block-heading" id="h-step-2-explore-the-data">Step #2 Explore the Data</h3>



<p class="wp-block-paragraph">At the beginning of a new project, we usually don&#8217;t understand the data well and need to acquire that understanding. Therefore, next, we will explore the data and familiarize ourselves with its characteristics. </p>



<p class="wp-block-paragraph">The following examples will help us better understand our data&#8217;s characteristics. For example, you can use whisker charts and a correlation matrix to understand better the correlation between variables, such as between weekdays and prediction categories. Feel free to create more charts.</p>



<h4 class="wp-block-heading" id="h-2-1-prediction-labels">2.1 Prediction Labels</h4>



<p class="wp-block-paragraph">Running the code below shows a bar plot of the prediction labels. The plot shows the frequency in which the class labels occur in the data.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># print the value counts of the categories
plt.figure(figsize=(15,5))
ax = sns.countplot(x = df_base['Category'], orient='v', order = df_base['Category'].value_counts().index)
ax.set_xticklabels(ax.get_xticklabels(),rotation = 90)</pre></div>



<figure class="wp-block-image size-large is-resized"><img decoding="async" data-attachment-id="3167" data-permalink="https://www.relataly.com/predicting-crimes-in-san-francisco-creatingsf-crime-map-using-xgboost/2960/image-13-8/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2021/04/image-13.png" data-orig-size="910,467" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-13" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2021/04/image-13.png" src="https://www.relataly.com/wp-content/uploads/2021/04/image-13.png" alt="crime types in San Francisco" class="wp-image-3167" width="914" height="470" srcset="https://www.relataly.com/wp-content/uploads/2021/04/image-13.png 910w, https://www.relataly.com/wp-content/uploads/2021/04/image-13.png 300w, https://www.relataly.com/wp-content/uploads/2021/04/image-13.png 768w" sizes="(max-width: 914px) 100vw, 914px" /></figure>



<p class="wp-block-paragraph">As shown above, our class labels are highly imbalanced, affecting model accuracy. When we evaluate the performance of our model, we need to consider this.</p>



<h4 class="wp-block-heading" id="h-2-2-when-a-crime-occured-considering-dates-and-time">2.2 When a Crime Occured &#8211; Considering Dates and Time</h4>



<p class="wp-block-paragraph">We assume that when a crime occurs impacts the type of crime. For this reason, we look at how crimes distribute across different days of the week and times of the day. First, we look at crime numbers per weekday. </p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Print Crime Counts per Weekday
plt.figure(figsize=(6,3))
ax = sns.countplot(y = df_base['DayOfWeek'], orient='h', order = df_base['DayOfWeek'].value_counts().index)
ax.set_xticklabels(ax.get_xticklabels(),rotation = 90)</pre></div>



<figure class="wp-block-image size-large is-resized"><img decoding="async" data-attachment-id="3164" data-permalink="https://www.relataly.com/predicting-crimes-in-san-francisco-creatingsf-crime-map-using-xgboost/2960/image-12-8/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2021/04/image-12.png" data-orig-size="437,238" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-12" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2021/04/image-12.png" src="https://www.relataly.com/wp-content/uploads/2021/04/image-12.png" alt="" class="wp-image-3164" width="669" height="364" srcset="https://www.relataly.com/wp-content/uploads/2021/04/image-12.png 437w, https://www.relataly.com/wp-content/uploads/2021/04/image-12.png 300w" sizes="(max-width: 669px) 100vw, 669px" /></figure>



<p class="wp-block-paragraph">Fewer crimes happen on Sundays, and most are on Fridays. So it seems that even criminals like to have a weekend. For the sake of clarity, we thereby limit the categories. Let&#8217;s take a look at the time when certain crimes are reported.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Convert the time to minutes
df_base['Hour_Min'] = pd.to_datetime(df_base['Dates']).dt.hour  + pd.to_datetime(df_base['Dates']).dt.minute / 60

# Print Crime Counts per Time and Category
df_base_filtered = df_base[df_base['Category'].isin([
    'PROSTITUTION', 
    'VEHICLE THEFT', 
    'DRUG/NARCOTIC', 
    'WARRENTS', 
    'BURGLERY', 
    'FRAUD', 
    'ASSAULT',
    'LARCENY/THEFT',
    'VANDALISM'])]

plt.figure(figsize=(16,10))
ax = sns.displot(x = 'Hour_Min', hue=&quot;Category&quot;, data = df_base_filtered, kind=&quot;kde&quot;, height=8, aspect=1.5)</pre></div>



<figure class="wp-block-image size-large is-resized"><img decoding="async" data-attachment-id="3174" data-permalink="https://www.relataly.com/predicting-crimes-in-san-francisco-creatingsf-crime-map-using-xgboost/2960/image-17-7/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2021/04/image-17.png" data-orig-size="983,568" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-17" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2021/04/image-17.png" src="https://www.relataly.com/wp-content/uploads/2021/04/image-17.png" alt="different crime types in San Francisco and how often they occur during the day" class="wp-image-3174" width="906" height="522"/></figure>



<p class="wp-block-paragraph">In addition, the time when a crime happens affects the likelihood of certain types. For example, we can see that FRAUD rarely occurs at night and usually during the day. We can see that criminals often go to work in the afternoon and at midnight. On the other hand, certain crimes, such as VEHICLE THEFT, mainly occur at night and late afternoon but less often in the morning.</p>



<p class="wp-block-paragraph">If you want to gain an overview of additional features, you can use the pair plot function. Because our dataset is large, we reduce the computation time by plotting 1/100 of the data.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}">sns.pairplot(data = df_base_filtered[0::100], height=4, aspect=1.5, hue='Category')</pre></div>



<figure class="wp-block-image size-large is-resized"><img decoding="async" data-attachment-id="8386" data-permalink="https://www.relataly.com/predicting-crimes-in-san-francisco-creatingsf-crime-map-using-xgboost/2960/pairplot-by-category-san-francisco-crime-map/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/05/pairplot-by-category-san-francisco-crime-map.png" data-orig-size="1464,844" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="pairplot-by-category-san-francisco-crime-map" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/05/pairplot-by-category-san-francisco-crime-map.png" src="https://www.relataly.com/wp-content/uploads/2022/05/pairplot-by-category-san-francisco-crime-map-1024x590.png" alt="pairplot by category, san francisco crime map" class="wp-image-8386" width="1080" height="622" srcset="https://www.relataly.com/wp-content/uploads/2022/05/pairplot-by-category-san-francisco-crime-map.png 1024w, https://www.relataly.com/wp-content/uploads/2022/05/pairplot-by-category-san-francisco-crime-map.png 300w, https://www.relataly.com/wp-content/uploads/2022/05/pairplot-by-category-san-francisco-crime-map.png 768w, https://www.relataly.com/wp-content/uploads/2022/05/pairplot-by-category-san-francisco-crime-map.png 1464w" sizes="(max-width: 1080px) 100vw, 1080px" /></figure>



<h4 class="wp-block-heading" id="h-2-3-where-a-crime-occured-considering-address">2.3 Where a Crime Occured &#8211; Considering Address</h4>



<p class="wp-block-paragraph">Next, we look at the address information, from which we can often extract additional information. We do this by printing some sample address values.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Extracting information from the streetnames
for i in df_base['Address'][0:10]:
    print(i)</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">OAK ST / LAGUNA ST
OAK ST / LAGUNA ST
VANNESS AV / GREENWICH ST
1500 Block of LOMBARD ST
100 Block of BRODERICK ST
0 Block of TEDDY AV
AVALON AV / PERU AV
KIRKWOOD AV / DONAHUE ST
600 Block of 47TH AV
JEFFERSON ST / LEAVENWORTH ST</pre></div>



<p class="wp-block-paragraph">The street names alone are not so helpful. However, the address data does provide additional information. For example, it tells us whether the location is a street intersection or not. In addition, it contains the type of street. This information is valuable because now we can extract parts of the text and use them as separate features.</p>



<p class="wp-block-paragraph">We could do a lot more, but we&#8217;ve got a good enough idea of the data.</p>



<h3 class="wp-block-heading" id="h-step-3-data-preprocessing">Step #3 Data Preprocessing</h3>



<p class="wp-block-paragraph">Probably the most exciting and important aspect of model development is feature engineering. Compared to model parameterization, the right features can often achieve more significant leaps in performance. </p>



<h4 class="wp-block-heading" id="h-3-1-remarks-on-data-preprocessing-for-xgboost">3.1 Remarks on Data Preprocessing for XGBoost </h4>



<p class="wp-block-paragraph">When preprocessing the data, it is helpful to know which algorithms to use because some algorithms are picky about the shape of the data. We will prepare the data to train a gradient-boosting model (XGBoost). This algorithm uses a random forest ensemble, which can only handle integer and Boolean values, but no categorical data. Therefore we need to encode our values. We also need to map the categorical labels to integer values.</p>



<p class="wp-block-paragraph">We don&#8217;t need to scale the continuous feature variables because gradient boosting and decision trees, generally, are not sensitive to variables that have different scales.</p>



<h4 class="wp-block-heading" id="h-3-2-feature-engineering">3.2 Feature Engineering</h4>



<p class="wp-block-paragraph">Based on the data exploration that we have done in the previous section, we create three feature types:</p>



<ul class="wp-block-list">
<li><strong>Date &amp; Time:</strong> When a crime happens is essential. For example, when there is a lot of traffic on the street, there is a higher likelihood of traffic-related crimes. For example, when it is Saturday, more people will usually come to the nightlife district, which attracts certain crimes, e.g., drug-related. Therefore, we will create different features for the time, the day, the month, and the year. </li>



<li><strong>Address</strong>: As mentioned, we will extract additional features from the address column. First, we create different features for the street type (for example, ST, AV, WY, TR, DR). In addition, we check whether the address contains the word &#8220;Block.&#8221; In addition, we will let our model know whether the address is a street crossing.</li>



<li><strong>Latitude &amp; <strong>Longitude</strong></strong>: We will transform the latitude and longitude values into polar coordinates. We will also remove some outliers from the dataset whose latitude is far off the grid. Above all, this will make it easier for our model to make sense of the location.</li>
</ul>



<p class="wp-block-paragraph">Considering these features, the primary input to our crime-type prediction model is the information on when and where a crime occurs. </p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Processing Function for Features
def cart2polar(x, y):
    dist = np.sqrt(x**2 + y**2)
    phi = np.arctan2(y, x)
    return dist, phi

def preprocessFeatures(dfx):
    
    # Time Feature Engineering
    df = pd.get_dummies(dfx[['DayOfWeek' , 'PdDistrict']])
    df['Hour_Min'] = pd.to_datetime(dfx['Dates']).dt.hour + pd.to_datetime(dfx['Dates']).dt.minute / 60
    # We add a feature that contains the expontential time
    df['Hour_Min_Exp'] = np.exp(df['Hour_Min'])
    
    df['Day'] = pd.to_datetime(dfx['Dates']).dt.day
    df['Month'] = pd.to_datetime(dfx['Dates']).dt.month
    df['Year'] = pd.to_datetime(dfx['Dates']).dt.year

    month_one_hot_encoded = pd.get_dummies(pd.to_datetime(dfx['Dates']).dt.month, prefix='Month')
    df = pd.concat([df, month_one_hot_encoded], axis=1, join=&quot;inner&quot;)
    
    # Convert Carthesian Coordinates to Polar Coordinates
    df[['X', 'Y']] = dfx[['X', 'Y']] # we maintain the original coordindates as additional features
    df['dist'], df['phi'] = cart2polar(dfx['X'], dfx['Y'])
  
    # Extracting Street Types
    df['Is_ST'] = dfx['Address'].str.contains(&quot; ST&quot;, case=True)
    df['Is_AV'] = dfx['Address'].str.contains(&quot; AV&quot;, case=True)
    df['Is_WY'] = dfx['Address'].str.contains(&quot; WY&quot;, case=True)
    df['Is_TR'] = dfx['Address'].str.contains(&quot; TR&quot;, case=True)
    df['Is_DR'] = dfx['Address'].str.contains(&quot; DR&quot;, case=True)
    df['Is_Block'] = dfx['Address'].str.contains(&quot; Block&quot;, case=True)
    df['Is_crossing'] = dfx['Address'].str.contains(&quot; / &quot;, case=True)
    
    return df

# Processing Function for Labels
def encodeLabels(dfx):
    df = pd.DataFrame (columns = [])
    factor = pd.factorize(dfx['Category'])
    return factor

# Remove Outliers by Longitude
df_cleaned = df_base[df_base['Y']&lt;70]

# Encode Labels as Integer
factor = encodeLabels(df_cleaned)
y_df = factor[0]
labels = list(factor[1])
# for val, i in enumerate(labels):
#     print(val, i)</pre></div>



<p class="wp-block-paragraph">We could also try to further improve our features by using additional data sources, such as weather data. However, there is no guarantee that this will improve the model results, and it did not in the case of criminal records. Therefore, we have omitted this part.</p>



<h3 class="wp-block-heading" id="h-step-4-visualize-crime-types-on-a-map-of-san-francisco">Step #4 Visualize Crime Types on a Map of San Francisco</h3>



<p class="wp-block-paragraph">Next, we create a San Francisco crime map using the cartesian coordinates indicating where a crime has occurred. First, we only plot the data without a geographical map. Later we will use these spatial data to create a dot plot and overlay it with a map of San Francisco. Visualizing the crime types on a map helps us understand how crime types distribute across the city. </p>



<h4 class="wp-block-heading" id="h-4-1-plot-crime-types-using-a-scatter-plot">4.1 Plot Crime Types using a Scatter Plot</h4>



<p class="wp-block-paragraph">Next, we want to gain an overview of possible spatial patterns and hotspots. We expect to see streets and neighborhoods where certain crimes are more common than in the more expensive areas of the city. In addition, we expect to see places in the city where certain crime types occur relatively rarely. To gain an overview of the crime distribution in San Francisco, we use a scatter plot to display the crime coordinates on a blank chart. </p>



<p class="wp-block-paragraph">Running the code below creates the crime map of San Francisco with all crime types. Depending on the speed of your machine, the creation of the map may take several minutes.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Plot Criminal Activities by Lat and Long
df_filtered = df_cleaned.sample(frac=0.05)  
#df_filtered = df_cleaned[df_cleaned['Category'].isin(['PROSTITUTION', 'VEHICLE THEFT', 'FRAUD'])].sample(frac=0.05) # to filter 

groups = df_filtered.groupby('Category')

fig, ax = plt.subplots(sharex=False, figsize=(20, 12))
ax.margins(0.05) # Optional, just adds 5% padding to the autoscaling
for name, group in groups:
    ax.plot(group['X'], group['Y'], marker='.', linestyle='', label=name, alpha=0.9)
ax.legend()
plt.show()</pre></div>



<figure class="wp-block-image size-large is-resized"><img decoding="async" data-attachment-id="8389" data-permalink="https://www.relataly.com/predicting-crimes-in-san-francisco-creatingsf-crime-map-using-xgboost/2960/output-8/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/05/output.png" data-orig-size="1170,683" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="output" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/05/output.png" src="https://www.relataly.com/wp-content/uploads/2022/05/output-1024x598.png" alt="Crime Map of San Francisco (sf crime map) - Kaggle Crime Prediction Challenge. Classification with XGBoost" class="wp-image-8389" width="1111" height="649" srcset="https://www.relataly.com/wp-content/uploads/2022/05/output.png 1024w, https://www.relataly.com/wp-content/uploads/2022/05/output.png 300w, https://www.relataly.com/wp-content/uploads/2022/05/output.png 768w, https://www.relataly.com/wp-content/uploads/2022/05/output.png 1170w" sizes="(max-width: 1111px) 100vw, 1111px" /></figure>



<p class="wp-block-paragraph">The plot shows that certain streets in San Francisco are more prone to specific crime types than others. It is also clear that there are certain crime hotspots in the city, especially in the center. We can also see that few crimes are reported in public park areas. </p>



<h4 class="wp-block-heading" id="h-4-2-create-a-crime-map-of-san-francisco-using-plotly">4.2 Create a Crime Map of San Francisco using Plotly</h4>



<p class="wp-block-paragraph">Next, we will create a San Francisco crime map using the Plotly Python library. Because the plugin can handle a limited amount of data simultaneously, we will reduce our data to a fraction of 1% and a few selected crime types. </p>



<p class="wp-block-paragraph">Running the code below opens a _map.html file in your browser that displays the SF crime map. The result is a zoomable geographic map of San Francisco that shows how the selected crime types distribute across the city.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># 4.2 Create a Crime Map of San Francisco using Plotly
# Limit the data to a fraction and selected categories
df_filtered = df_cleaned.sample(frac=0.01) 
fig = px.scatter_mapbox(df_filtered, lat=&quot;Y&quot;, lon=&quot;X&quot;, hover_name=&quot;Category&quot;, color='Category', hover_data=[&quot;Y&quot;, &quot;X&quot;], zoom=12, height=800)
fig.update_layout(mapbox_style=&quot;open-street-map&quot;)
fig.update_layout(margin={&quot;r&quot;:0,&quot;t&quot;:0,&quot;l&quot;:0,&quot;b&quot;:0})
fig.show()</pre></div>



<figure class="wp-block-image size-large is-resized"><img decoding="async" data-attachment-id="8401" data-permalink="https://www.relataly.com/predicting-crimes-in-san-francisco-creatingsf-crime-map-using-xgboost/2960/map-sf-crime/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/05/map-sf-crime.png" data-orig-size="1600,650" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="map-sf-crime" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/05/map-sf-crime.png" src="https://www.relataly.com/wp-content/uploads/2022/05/map-sf-crime-1024x416.png" alt="Crime Map of San Francisco (SF Crime Map) - Kaggle Crime Prediction Challenge. Crime Classification XGBoost" class="wp-image-8401" width="1097" height="445" srcset="https://www.relataly.com/wp-content/uploads/2022/05/map-sf-crime.png 1024w, https://www.relataly.com/wp-content/uploads/2022/05/map-sf-crime.png 300w, https://www.relataly.com/wp-content/uploads/2022/05/map-sf-crime.png 768w, https://www.relataly.com/wp-content/uploads/2022/05/map-sf-crime.png 1536w, https://www.relataly.com/wp-content/uploads/2022/05/map-sf-crime.png 1600w" sizes="(max-width: 1097px) 100vw, 1097px" /></figure>



<p class="wp-block-paragraph">The SF crime map shows different types of crimes, including prostitution, vehicle theft, and fraud. The interactive map allows you to change zoom levels and filter the type of crime displayed on the map. For example, if you filter DRUG/NARCOTIC-related crimes, you can see that these crimes mainly occur in the city center near the financial district and the nightlife area.</p>



<h3 class="wp-block-heading" id="h-step-5-split-the-data">Step #5 Split the Data</h3>



<p class="wp-block-paragraph">Before training our predictive model, we will split our data into separate datasets for training and testing. For this purpose, we use the train_test_split function of scikit-learn and configure a split ratio of 70%. Then we output the data, which we employ in the next step to train and validate a model.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Create train_df &amp; test_df
x_df = preprocessFeatures(df_cleaned).copy()

# Split the data into x_train and y_train data sets
x_train, x_test, y_train, y_test = train_test_split(x_df, y_df, train_size=0.7, random_state=0)
x_train</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">		DayOfWeek_Friday	DayOfWeek_Monday	DayOfWeek_Saturday	DayOfWeek_Sunday	DayOfWeek_Thursday	DayOfWeek_Tuesday	DayOfWeek_Wednesday	PdDistrict_BAYVIEW	PdDistrict_CENTRAL	PdDistrict_INGLESIDE	...	Y			dist		phi			Is_ST	Is_AV	Is_WY	Is_TR	Is_DR	Is_Block	Is_crossing
276998	0					0					0					0					0					1					0					0					0					0						...	37.785023	128.110900	2.842200	True	False	False	False	False	True		False
81579	0					0					0					0					0					1					0					0					0					0						...	37.748470	128.185052	2.842677	False	True	False	False	False	True		False
206676	0					0					0					1					0					0					0					0					0					0						...	37.762744	128.113657	2.842389	True	False	False	False	False	True		False
732006	0					0					0					0					0					0					1					0					0					0						...	37.784140	128.109653	2.842204	True	False	False	False	False	False		True
796194	1					0					0					0					0					0					0					0					0					0						...	37.791333	128.125982	2.842185	True	False	False	False	False	True		False
5 rows × 45 columns</pre></div>



<h3 class="wp-block-heading" id="h-step-6-train-a-random-forest-classifier">Step #6 Train a Random Forest Classifier</h3>



<p class="wp-block-paragraph">We can train the predictive models now that we have prepared the data. We train a basic model based on the Random Forest algorithm in the first step. The Random Forest is a robust algorithm that can handle regression and classification problems. One of our <a href="https://www.relataly.com/anyone-about-to-leave-predicting-the-customer-churn-of-a-telecommunications-provider/2378/" target="_blank" rel="noreferrer noopener">recent articles provides more information on Random Forests</a> and how you can find the optimal configuration of their hyperparameters. In this tutorial, we use the Random Forest to establish a baseline against which we can measure the performance of our XGboost model.  We, therefore, use the Random Forest with a simple parameter configuration without tuning the hyperparameters.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Train a single random forest classifier - parameters are a best guess
clf = RandomForestClassifier(max_depth=100, random_state=0, n_estimators = 200)
clf.fit(x_train, y_train.ravel())
y_pred = clf.predict(x_test)

results_log = classification_report(y_test, y_pred)
print(results_log)</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">Output exceeds the size limit. Open the full output data in a text editor
              precision    recall  f1-score   support

           0       0.15      0.10      0.12     12657
           1       0.29      0.35      0.32     37898
           2       0.38      0.63      0.47     52237
           3       0.46      0.40      0.43     16136
           4       0.16      0.08      0.10     13426
           5       0.25      0.21      0.23     27798
           6       0.10      0.04      0.06      6850
           7       0.23      0.22      0.23     23087
           8       0.19      0.12      0.15      2586
           9       0.20      0.13      0.15     10942
          10       0.08      0.03      0.05      9559
          11       0.00      0.00      0.00      1300
          12       0.20      0.10      0.14      3200
          13       0.37      0.43      0.40     16282
          14       0.02      0.02      0.02      1350
          15       0.01      0.00      0.00      2912
          16       0.05      0.03      0.04      2217
          17       0.61      0.52      0.56      7865
          18       0.11      0.06      0.08      4954
          19       0.04      0.03      0.03       723
          20       0.28      0.19      0.23       581
          21       0.05      0.02      0.03       708
          22       0.25      0.13      0.17      1333
...
    accuracy                           0.31    263395
   macro avg       0.15      0.12      0.13    263395
weighted avg       0.28      0.31      0.28    263395</pre></div>



<p class="wp-block-paragraph">The baseline model is a random forest classifier with 31% percent accuracy on the test dataset.</p>



<h3 class="wp-block-heading" id="h-step-7-train-an-xgboost-classifier">Step #7 Train an XGBoost Classifier</h3>



<p class="wp-block-paragraph">Now that we have a baseline model, we can train our gradient boosting classifier using the XGBoost package. We expect this model to perform better than the baseline. </p>



<h4 class="wp-block-heading" id="h-7-1-about-gradient-boosting">7.1 About Gradient Boosting</h4>



<p class="wp-block-paragraph">XGBoost is an implementation of a <a href="https://www.relataly.com/category/machine-learning-algorithms/gradient-boosting/" target="_blank" rel="noreferrer noopener">gradient-boosting algorithm</a> that uses a decision-tree-based ensemble machine learning algorithm. The algorithm searches for an optimal ensemble of trees. In this process, the algorithm iteratively adds trees to the model or removes them to reduce the prediction error of the previous tree constellation. The algorithm repeats these steps until it can make no further improvements. Thus, training does not optimize the model against the predictions but the previous model&#8217;s residuals (prediction errors).</p>



<p class="wp-block-paragraph" id="h-xgboost-an-implementation-of-a-gradient-boosting-algorithm-that-creates-a-random-forest-which-is-an-ensemble-of-decision-trees-the-basic-idea-of-a-decision-forest-is-to-have-an-ensemble-of-decision-trees-that-come-to-a-conclusion-on-the-prediction-label-by-casting-a-vote-gradient-boosting-is-used-to-find-an-optimal-ensemble-of-trees-and-parameters-boosting-is-the-process-of-iteratively-adding-trees-to-correct-the-prediction-error-of-a-previous-ensemble-of-trees-until-no-further-improvements-can-be-achieved-the-models-are-thereby-trained-to-predict-the-residuals-prediction-errors-of-the-previous-model">But XGBoost does more! It is an extreme version of gradient boosting that uses additional optimization techniques to achieve the best result with minimal effort. In contrast to the random decision forest, the XGBoost classification algorithm determines an optimal number of trees in the training process. We do not have to specify this number in advance.</p>



<p class="wp-block-paragraph" id="h-xgboost-an-implementation-of-a-gradient-boosting-algorithm-that-creates-a-random-forest-which-is-an-ensemble-of-decision-trees-the-basic-idea-of-a-decision-forest-is-to-have-an-ensemble-of-decision-trees-that-come-to-a-conclusion-on-the-prediction-label-by-casting-a-vote-gradient-boosting-is-used-to-find-an-optimal-ensemble-of-trees-and-parameters-boosting-is-the-process-of-iteratively-adding-trees-to-correct-the-prediction-error-of-a-previous-ensemble-of-trees-until-no-further-improvements-can-be-achieved-the-models-are-thereby-trained-to-predict-the-residuals-prediction-errors-of-the-previous-model">A disadvantage of XGBoost is that it tends to overfit the data. Therefore, testing against unseen data is essential. This tutorial will test only against a single test sample for simplicity, but using cross-validation would be a better choice.</p>



<h4 class="wp-block-heading" id="h-7-2-train-the-xgboost-classifier">7.2 Train the XGBoost Classifier</h4>



<p class="wp-block-paragraph">Various Gradient Boosting Algorithms are available for Python, including one from scikit-learn. However, scikit-learn does not support multi-threading, which makes the training process slower than necessary. For this reason, we will use the gradient boosting classifier from the XGBoost package.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Configure the XGBoost model
param = {'booster': 'gbtree', 
         'tree_method': 'gpu_hist',
         'predictor': 'gpu_predictor',
         'max_depth': 140, 
         'eta': 0.3, 
         'objective': '{multi:softmax}', 
         'eval_metric': 'mlogloss', 
         'num_round': 30,
         'feature_selector ': 'cyclic'
        }

xgb_clf = XGBClassifier(param)
xgb_clf.fit(x_train, y_train.ravel())
score = xgb_clf.score(x_test, y_test.ravel())
print(score)

# Create predictions on the test dataset
y_pred = xgb_clf.predict(x_test)

# Print a classification report
results_log = classification_report(y_test, y_pred)
print(results_log)</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">Output exceeds the size limit. Open the full output data in a text editor
0.30852142219859907
              precision    recall  f1-score   support

           0       0.17      0.01      0.02     12657
           1       0.30      0.42      0.35     37898
           2       0.33      0.72      0.46     52237
           3       0.31      0.27      0.29     16136
           4       0.21      0.03      0.05     13426
           5       0.24      0.18      0.21     27798
           6       0.17      0.01      0.01      6850
           7       0.21      0.19      0.20     23087
           8       0.26      0.01      0.02      2586
           9       0.22      0.08      0.12     10942
          10       0.13      0.00      0.00      9559
          11       0.07      0.00      0.01      1300
          12       0.20      0.08      0.11      3200
          13       0.34      0.43      0.38     16282
          14       0.00      0.00      0.00      1350
          15       0.12      0.00      0.01      2912
          16       0.15      0.02      0.03      2217
          17       0.57      0.34      0.43      7865
          18       0.19      0.03      0.05      4954
          19       0.00      0.00      0.00       723
          20       0.50      0.24      0.32       581
          21       0.10      0.01      0.01       708
...
    accuracy                           0.31    263395
   macro avg       0.18      0.11      0.11    263395
weighted avg       0.27      0.31      0.25    263395
</pre></div>



<p class="wp-block-paragraph">Now that we have trained our classification model, let&#8217;s see how it performs. For this purpose, we will generate predictions (y_pred) on the test dataset (x_test). Afterward, we use the predictions and the valid values (y_test) to create a classification report.</p>



<p class="wp-block-paragraph">Our model achieves an accuracy score of 31%. At first hand, this might not look so good, but considering that we have 39 categories and only sparse information available, this performance is quite impressive.   </p>



<h3 class="wp-block-heading" id="h-step-8-measure-model-performance">Step #8 Measure Model Performance</h3>



<p class="wp-block-paragraph">So how well does our XGboost model perform? To measure the performance of our model, we create a confusion matrix that visualizes the performance of the XGboost classifier. If you want to learn more about measuring the performance of classification models, check out<a href="https://www.relataly.com/measuring-classification-performance-with-python-and-scikit-learn/846/" target="_blank" rel="noreferrer noopener"> this tutorial on measuring classification performance</a>.</p>



<p class="wp-block-paragraph">Running the code below creates the confusion matrix that shows the number of correct and false predictions for each crime category.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Print a multi-Class Confusion Matrix
cnf_matrix = confusion_matrix(y_test.reshape(-1), y_pred)
df_cm = pd.DataFrame(cnf_matrix, columns=np.unique(y_test), index = np.unique(y_test))
df_cm.index.name = 'Actual'
df_cm.columns.name = 'Predicted'
plt.figure(figsize = (16,12))
plt.tight_layout()
sns.set(font_scale=1.4) #for label size
sns.heatmap(df_cm, cbar=True, cmap= &quot;inferno&quot;, annot=False, fmt='.0f' #, annot_kws={&quot;size&quot;: 13}
           )</pre></div>



<figure class="wp-block-image size-large is-resized"><img decoding="async" data-attachment-id="3213" data-permalink="https://www.relataly.com/predicting-crimes-in-san-francisco-creatingsf-crime-map-using-xgboost/2960/image-22-5/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2021/04/image-22.png" data-orig-size="903,709" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-22" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2021/04/image-22.png" src="https://www.relataly.com/wp-content/uploads/2021/04/image-22.png" alt="Evaluating the performance of our XGboost classifier; sfo crime map" class="wp-image-3213" width="669" height="525" srcset="https://www.relataly.com/wp-content/uploads/2021/04/image-22.png 903w, https://www.relataly.com/wp-content/uploads/2021/04/image-22.png 300w, https://www.relataly.com/wp-content/uploads/2021/04/image-22.png 768w" sizes="(max-width: 669px) 100vw, 669px" /></figure>



<p class="wp-block-paragraph">The confusion matrix shows that our model frequently predicts crime category two and neglects the other crime types. The reason is the uneven distribution of crime types in the training data. As a result, when we evaluate the model, we need to pay attention to the importance of the different crime types. For example, we might train the model to predict certain crime types accurately, although this might come at a lower accuracy when predicting other crime types. However, such optimizations depend on the technical context and the goals one wants to achieve with the prediction model. </p>



<h2 class="wp-block-heading" id="h-summary">Summary</h2>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">This tutorial has presented the machine learning use case &#8220;Predictive Policing&#8221; and showed how to implement it in Python. We have trained an XGBoost model that predicts crime types in San Francisco based on the information on when and where specific crimes have occurred. We also illustrated our data on an interactive crime map of San Francisco with the Plotly Python library. The Crime Map is an intuitive way of visualizing crime in a city and highlighting particular hotspots. Finally, we have used the prediction model to make test predictions and evaluate the model performance against other algorithms, such as a classic Random Decision Forest. The XGBoost model achieves a prediction accuracy of about 31%—a respectable performance, considering that the prediction problem involves 39 crime classes.</p>



<p class="wp-block-paragraph">We hope this tutorial was helpful. If you have any questions or suggestions on what we could improve, feel free to post them in the comments. We appreciate your feedback.</p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%">
<figure class="wp-block-image size-full is-resized"><img decoding="async" data-attachment-id="8410" data-permalink="https://www.relataly.com/predicting-crimes-in-san-francisco-creatingsf-crime-map-using-xgboost/2960/image-12-13/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/05/image-12.png" data-orig-size="658,592" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-12" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/05/image-12.png" src="https://www.relataly.com/wp-content/uploads/2022/05/image-12.png" alt="Crime Map of San Francisco - Kaggle Crime Prediction Challenge. Crim Classification XGBoost" class="wp-image-8410" width="346" height="311" srcset="https://www.relataly.com/wp-content/uploads/2022/05/image-12.png 658w, https://www.relataly.com/wp-content/uploads/2022/05/image-12.png 300w" sizes="(max-width: 346px) 100vw, 346px" /><figcaption class="wp-element-caption">Predictive policing with machine learning &#8211; Crime map of San Francisco, created with Python and Plotly</figcaption></figure>
</div>
</div>



<h2 class="wp-block-heading">Sources and Further Reading</h2>



<p class="wp-block-paragraph">Looking for more esciting map vizualizations? Consider the relataly tutorial on <a href="https://www.relataly.com/visualize-covid-19-data-on-a-geographic-heat-maps/291/" target="_blank" rel="noreferrer noopener">visualizing COVID-19 data on geographic heatmaps using GeoPandas</a>.</p>



<div style="display: inline-block;">
  <iframe sandbox="allow-popups allow-scripts allow-modals allow-forms allow-same-origin" style="width:120px;height:240px;" marginwidth="0" marginheight="0" scrolling="no" frameborder="0" src="//ws-eu.amazon-adsystem.com/widgets/q?ServiceVersion=20070822&amp;OneJS=1&amp;Operation=GetAdHtml&amp;MarketPlace=DE&amp;source=ss&amp;ref=as_ss_li_til&amp;ad_type=product_link&amp;tracking_id=flo7up-21&amp;language=de_DE&amp;marketplace=amazon&amp;region=DE&amp;placement=3030181162&amp;asins=3030181162&amp;linkId=669e46025028259138fbb5ccec12dfbe&amp;show_border=true&amp;link_opens_in_new_window=true"></iframe>
<iframe sandbox="allow-popups allow-scripts allow-modals allow-forms allow-same-origin" style="width:120px;height:240px;" marginwidth="0" marginheight="0" scrolling="no" frameborder="0" src="//ws-eu.amazon-adsystem.com/widgets/q?ServiceVersion=20070822&amp;OneJS=1&amp;Operation=GetAdHtml&amp;MarketPlace=DE&amp;source=ss&amp;ref=as_ss_li_til&amp;ad_type=product_link&amp;tracking_id=flo7up-21&amp;language=de_DE&amp;marketplace=amazon&amp;region=DE&amp;placement=1999579577&amp;asins=1999579577&amp;linkId=91d862698bf9010ff4c09539e4c49bf4&amp;show_border=true&amp;link_opens_in_new_window=true"></iframe>
<iframe sandbox="allow-popups allow-scripts allow-modals allow-forms allow-same-origin" style="width:120px;height:240px;" marginwidth="0" marginheight="0" scrolling="no" frameborder="0" src="//ws-eu.amazon-adsystem.com/widgets/q?ServiceVersion=20070822&amp;OneJS=1&amp;Operation=GetAdHtml&amp;MarketPlace=DE&amp;source=ss&amp;ref=as_ss_li_til&amp;ad_type=product_link&amp;tracking_id=flo7up-21&amp;language=de_DE&amp;marketplace=amazon&amp;region=DE&amp;placement=1839217715&amp;asins=1839217715&amp;linkId=356ba074068849ff54393f527190825d&amp;show_border=true&amp;link_opens_in_new_window=true"></iframe>
<iframe sandbox="allow-popups allow-scripts allow-modals allow-forms allow-same-origin" style="width:120px;height:240px;" marginwidth="0" marginheight="0" scrolling="no" frameborder="0" src="//ws-eu.amazon-adsystem.com/widgets/q?ServiceVersion=20070822&amp;OneJS=1&amp;Operation=GetAdHtml&amp;MarketPlace=DE&amp;source=ss&amp;ref=as_ss_li_til&amp;ad_type=product_link&amp;tracking_id=flo7up-21&amp;language=de_DE&amp;marketplace=amazon&amp;region=DE&amp;placement=1492032646&amp;asins=1492032646&amp;linkId=2214804dd039e7103577abd08722abac&amp;show_border=true&amp;link_opens_in_new_window=true"></iframe>
</div>



<p class="has-contrast-2-color has-base-3-background-color has-text-color has-background wp-block-paragraph"><em>The links above to Amazon are affiliate links. By buying through these links, you support the Relataly.com blog and help to cover the hosting costs. Using the links does not affect the price.</em></p>
<p>The post <a href="https://www.relataly.com/predicting-crimes-in-san-francisco-creatingsf-crime-map-using-xgboost/2960/">Predictive Policing: Preventing Crime in San Francisco using XGBoost and Python</a> appeared first on <a href="https://www.relataly.com">relataly.com</a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://www.relataly.com/predicting-crimes-in-san-francisco-creatingsf-crime-map-using-xgboost/2960/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">2960</post-id>	</item>
	</channel>
</rss>
