<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Kaggle Competitions Archives - relataly.com</title>
	<atom:link href="https://www.relataly.com/category/data-science/kaggle-competitions/feed/" rel="self" type="application/rss+xml" />
	<link>https://www.relataly.com/category/data-science/kaggle-competitions/</link>
	<description>The Business AI Blog</description>
	<lastBuildDate>Thu, 13 Jul 2023 12:10:12 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	<generator>https://wordpress.org/?v=7.0</generator>

<image>
	<url>https://www.relataly.com/wp-content/uploads/2023/04/cropped-AI-cat-Icon-White.png</url>
	<title>Kaggle Competitions Archives - relataly.com</title>
	<link>https://www.relataly.com/category/data-science/kaggle-competitions/</link>
	<width>32</width>
	<height>32</height>
</image> 
<site xmlns="com-wordpress:feed-additions:1">175977316</site>	<item>
		<title>How to Use Hierarchical Clustering For Customer Segmentation in Python</title>
		<link>https://www.relataly.com/customer-segmentation-using-hierarchical-clustering-in-python/11335/</link>
					<comments>https://www.relataly.com/customer-segmentation-using-hierarchical-clustering-in-python/11335/#respond</comments>
		
		<dc:creator><![CDATA[Florian Follonier]]></dc:creator>
		<pubDate>Thu, 22 Dec 2022 18:50:14 +0000</pubDate>
				<category><![CDATA[Agglomerative Clustering]]></category>
		<category><![CDATA[Algorithms]]></category>
		<category><![CDATA[Clustering]]></category>
		<category><![CDATA[Customer Segmentation]]></category>
		<category><![CDATA[Data Science]]></category>
		<category><![CDATA[Data Visualization]]></category>
		<category><![CDATA[Exploratory Data Analysis (EDA)]]></category>
		<category><![CDATA[Finance]]></category>
		<category><![CDATA[Insurance]]></category>
		<category><![CDATA[Kaggle Competitions]]></category>
		<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[Marketing Automation]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[Scikit-Learn]]></category>
		<category><![CDATA[Seaborn]]></category>
		<category><![CDATA[Telecommunications]]></category>
		<category><![CDATA[Use Cases]]></category>
		<category><![CDATA[AI in Finance]]></category>
		<category><![CDATA[AI in Insurance]]></category>
		<category><![CDATA[Beginner Tutorials]]></category>
		<category><![CDATA[Classic Machine Learning]]></category>
		<category><![CDATA[Digital Transformation]]></category>
		<guid isPermaLink="false">https://www.relataly.com/?p=11335</guid>

					<description><![CDATA[<p>Have you ever found yourself wondering how you can better understand your customer base and target your marketing efforts more effectively? One solution is to use hierarchical clustering, a method of grouping customers into clusters based on their characteristics and behaviors. By dividing your customers into distinct groups, you can tailor your marketing campaigns and ... <a title="How to Use Hierarchical Clustering For Customer Segmentation in Python" class="read-more" href="https://www.relataly.com/customer-segmentation-using-hierarchical-clustering-in-python/11335/" aria-label="Read more about How to Use Hierarchical Clustering For Customer Segmentation in Python">Read more</a></p>
<p>The post <a href="https://www.relataly.com/customer-segmentation-using-hierarchical-clustering-in-python/11335/">How to Use Hierarchical Clustering For Customer Segmentation in Python</a> appeared first on <a href="https://www.relataly.com">relataly.com</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">Have you ever found yourself wondering how you can better understand your customer base and target your marketing efforts more effectively? One solution is to use hierarchical clustering, a method of grouping customers into clusters based on their characteristics and behaviors. By dividing your customers into distinct groups, you can tailor your marketing campaigns and personalize your marketing efforts to meet the specific needs of each group. This can be especially useful for businesses with large customer bases, as it allows them to target their marketing efforts to specific segments rather than trying to appeal to everyone at once. Additionally, hierarchical clustering can help businesses identify common patterns and trends among their customers, which can be useful for targeting future marketing efforts and improving the overall customer experience. In this tutorial, we will use Python and the scikit-learn library to apply hierarchical (agglomerative) clustering to a dataset of customer data. </p>



<p class="wp-block-paragraph">The rest of this tutorial proceeds in two parts. The first part will discuss hierarchical clustering and how we can use it to identify clusters in a set of customer data. The second part is a hands-on Python tutorial. We will explore customer health insurance data and apply an agglomerative clustering approach to group the customers into meaningful segments. Finally, we will use a tree-like diagram called a dendrogram, which is helpful for visualizing the structure of the data. The resulting segments could inform our marketing strategies and help us better understand our customers. So let&#8217;s get started!</p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%">
<figure class="wp-block-image size-full"><img fetchpriority="high" decoding="async" width="896" height="510" data-attachment-id="12402" data-permalink="https://www.relataly.com/customer-segmentation-using-hierarchical-clustering-in-python/11335/isometric-view-of-people-customer-segmentation-using-machine-learning-python-tutorial-min/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2023/02/isometric-view-of-people-customer-segmentation-using-machine-learning-python-tutorial-min.png" data-orig-size="896,510" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="isometric-view-of-people-customer-segmentation-using-machine-learning-python-tutorial-min" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2023/02/isometric-view-of-people-customer-segmentation-using-machine-learning-python-tutorial-min.png" src="https://www.relataly.com/wp-content/uploads/2023/02/isometric-view-of-people-customer-segmentation-using-machine-learning-python-tutorial-min.png" alt="isometric view of people customer segmentation using machine learning python tutorial" class="wp-image-12402" srcset="https://www.relataly.com/wp-content/uploads/2023/02/isometric-view-of-people-customer-segmentation-using-machine-learning-python-tutorial-min.png 896w, https://www.relataly.com/wp-content/uploads/2023/02/isometric-view-of-people-customer-segmentation-using-machine-learning-python-tutorial-min.png 300w, https://www.relataly.com/wp-content/uploads/2023/02/isometric-view-of-people-customer-segmentation-using-machine-learning-python-tutorial-min.png 768w" sizes="(max-width: 896px) 100vw, 896px" /><figcaption class="wp-element-caption">Customer segmentation is a typical use case for clustering. Image generated with <a href="http://www.midjourney.com" target="_blank" rel="noreferrer noopener">Midjourney</a>. </figcaption></figure>
</div>
</div>



<h2 class="wp-block-heading">What is Hierarchical Clustering?</h2>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">So what is hierarchical clustering? Hierarchical clustering is a method of cluster analysis that aims to build a hierarchy of clusters. It creates a tree-like diagram called a dendrogram, which shows the relationships between clusters. There are two main types of hierarchical clustering: agglomerative and divisive. </p>



<ol class="wp-block-list">
<li>Agglomerative hierarchical clustering: This is a bottom-up approach in which each data point is treated as a single cluster at the outset. The algorithm iteratively merges the most similar pairs of clusters until all data points are in a single cluster.</li>



<li>Divisive hierarchical clustering: This is a top-down approach in which all data points are treated as a single cluster at the outset. The algorithm iteratively splits the cluster into smaller and smaller subclusters until each data point is in its own cluster.</li>
</ol>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%"></div>
</div>



<h3 class="wp-block-heading">Agglomerative Clustering</h3>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">In this article, we will apply the agglomerative clustering approach, which is a bottom-up approach to clustering. The idea is to initially treat each data point in a dataset as its own cluster and then combine the points with other clusters as the algorithm progresses. The process of agglomerative clustering can be broken down into the following steps:</p>



<ol class="wp-block-list">
<li>Start with each data point in its own cluster.</li>



<li>Calculate the similarity between all pairs of clusters.</li>



<li>Merge the two most similar clusters.</li>



<li>Repeat steps 2 and 3 until all the data points are in a single cluster or until a predetermined number of clusters is reached.</li>
</ol>



<p class="wp-block-paragraph">There are several ways to calculate the similarity between clusters, including using measures such as the Euclidean distance, cosine similarity, or the Jaccard index. The specific measure used can impact the results of the clustering algorithm.</p>



<p class="wp-block-paragraph">For details on how the clustering approach works, see the&nbsp;<a href="https://en.wikipedia.org/wiki/Hierarchical_clustering">Wikipedia page</a>.</p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%">
<figure class="wp-block-image size-large"><img decoding="async" width="430" height="512" data-attachment-id="13027" data-permalink="https://www.relataly.com/mushrooms_and_fruits_pattern-min-2/" data-orig-file="https://www.relataly.com/wp-content/uploads/2023/03/mushrooms_and_fruits_pattern-min.png" data-orig-size="506,602" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="mushrooms_and_fruits_pattern-min" data-image-description="&lt;p&gt;Hierarchical clustering is an unsupversied way to classify things. &lt;/p&gt;
" data-image-caption="&lt;p&gt;Hierarchical clustering is an unsupversied way to classify things. &lt;/p&gt;
" data-large-file="https://www.relataly.com/wp-content/uploads/2023/03/mushrooms_and_fruits_pattern-min.png" src="https://www.relataly.com/wp-content/uploads/2023/03/mushrooms_and_fruits_pattern-min-430x512.png" alt="Hierarchical clustering is an unsupversied way to classify things. " class="wp-image-13027" srcset="https://www.relataly.com/wp-content/uploads/2023/03/mushrooms_and_fruits_pattern-min.png 430w, https://www.relataly.com/wp-content/uploads/2023/03/mushrooms_and_fruits_pattern-min.png 252w, https://www.relataly.com/wp-content/uploads/2023/03/mushrooms_and_fruits_pattern-min.png 506w" sizes="(max-width: 430px) 100vw, 430px" /><figcaption class="wp-element-caption">Hierarchical clustering is an unsupervised technique to classify things based on patterns in their data. Image created with <a href="http://www.midjourney.com" target="_blank" rel="noreferrer noopener">Midjourney</a>.</figcaption></figure>
</div>
</div>



<h3 class="wp-block-heading">Hierarchical Clustering vs. K-means</h3>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">In a previous article, we have already discussed the popular <a href="https://www.relataly.com/simple-cluster-analysis-with-k-means-with-python/5070/" target="_blank" rel="noreferrer noopener">clustering approach k-means</a>. So how are k-means and hierarchical clustering different? Hierarchical clustering and k-means are both clustering algorithms that can be used to group similar data points together. However, there are several key differences between these two approaches:</p>



<ol class="wp-block-list">
<li><strong>The number of clusters:</strong> In k-means, the number of clusters must be specified in advance, whereas in hierarchical clustering, the number of clusters is not specified. Instead, hierarchical clustering creates a hierarchy of clusters, starting with each data point as its own cluster and then merging the most similar clusters until all data points are in a single cluster.</li>



<li><strong>Cluster shape:</strong> K-means produces clusters that are spherical, while hierarchical clustering produces clusters that can have any shape. This means that k-means is better suited for data that is well-separated into distinct, spherical clusters, while hierarchical clustering is more flexible and can handle more complex cluster shapes.</li>



<li><strong>Distance measure:</strong> K-means uses a distance measure, such as the Euclidean distance, to calculate the similarity between data points, while hierarchical clustering can use a variety of distance measures. This means that k-means is more sensitive to the scale of the features, while hierarchical clustering is less sensitive to the feature scale.</li>



<li><strong>Computational complexity:</strong> K-means is generally faster than hierarchical clustering, especially for large datasets. This is because k-means only requires a single pass through the data to assign data points to clusters, while hierarchical clustering requires multiple passes to merge clusters.</li>



<li><strong>Visualization: </strong>Hierarchical clustering produces a tree-like diagram called a &#8220;dendrogram.&#8221; The dendrogram shows the relationships between clusters. This can be useful for visualizing the structure of the data and understanding how clusters are related.</li>
</ol>



<p class="wp-block-paragraph">Next, let&#8217;s look at how we can implement a hierarchical clustering model in Python. </p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%">
<p class="wp-block-paragraph"></p>
</div>
</div>



<h2 class="wp-block-heading">Customer Segmentation using Hierarchical Clustering in Python</h2>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">In this comprehensive guide, we explore the application of hierarchical clustering for effective customer segmentation using a customer dataset. This data-driven segmentation method enables businesses to identify distinct customer clusters based on various factors, including demographics, behaviors, and preferences.</p>



<p class="wp-block-paragraph">Customer segmentation is a strategic approach that splits a customer base into smaller, more manageable groups with similar characteristics. It aims to better understand the diverse needs and wants of different customer segments to enhance marketing strategies and product development.</p>



<p class="wp-block-paragraph">Applying customer segmentation through hierarchical clustering allows businesses to personalize their marketing messages, design targeted campaigns, and tailor products to meet the unique needs of each segment. This proactive approach can stimulate increased customer loyalty and sales.</p>



<p class="wp-block-paragraph">We begin by loading the customer data and selecting the relevant features we want to use for clustering. We then standardize the data using the StandardScaler from scikit-learn. Next, we apply hierarchical clustering using the AgglomerativeClustering method, specifying the number of clusters we want to create. Finally, we add the predictions to the original data as a new column and view the resulting segments by calculating the mean of each feature for each segment.</p>



<p class="wp-block-paragraph">The code is available on the GitHub repository.</p>



<div class="wp-block-kadence-advancedbtn kb-buttons-wrap kb-btns_bada6f-73"><a class="kb-button kt-button button kb-btn_43f94b-af kt-btn-size-standard kt-btn-width-type-full kb-btn-global-inherit kt-btn-has-text-true kt-btn-has-svg-true wp-block-button__link wp-block-kadence-singlebtn" href="https://github.com/flo7up/relataly-public-python-tutorials/blob/master/03%20Clustering/043%20Customer%20Segmentation%20using%20Hierarchical%20Clustering%20with%20Python.ipynb" target="_blank" rel="noreferrer noopener"><span class="kb-svg-icon-wrap kb-svg-icon-fe_eye kt-btn-icon-side-left"><svg viewBox="0 0 24 24"  fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"  aria-hidden="true"><path d="M1 12s4-8 11-8 11 8 11 8-4 8-11 8-11-8-11-8z"/><circle cx="12" cy="12" r="3"/></svg></span><span class="kt-btn-inner-text">View on GitHub </span></a>

<a class="kb-button kt-button button kb-btn_17702b-41 kt-btn-size-standard kt-btn-width-type-full kb-btn-global-inherit kt-btn-has-text-true kt-btn-has-svg-true wp-block-button__link wp-block-kadence-singlebtn" href="https://github.com/flo7up/relataly-public-python-API-tutorials" target="_blank" rel="noreferrer noopener"><span class="kb-svg-icon-wrap kb-svg-icon-fa_github kt-btn-icon-side-left"><svg viewBox="0 0 496 512"  fill="currentColor" xmlns="http://www.w3.org/2000/svg"  aria-hidden="true"><path d="M165.9 397.4c0 2-2.3 3.6-5.2 3.6-3.3.3-5.6-1.3-5.6-3.6 0-2 2.3-3.6 5.2-3.6 3-.3 5.6 1.3 5.6 3.6zm-31.1-4.5c-.7 2 1.3 4.3 4.3 4.9 2.6 1 5.6 0 6.2-2s-1.3-4.3-4.3-5.2c-2.6-.7-5.5.3-6.2 2.3zm44.2-1.7c-2.9.7-4.9 2.6-4.6 4.9.3 2 2.9 3.3 5.9 2.6 2.9-.7 4.9-2.6 4.6-4.6-.3-1.9-3-3.2-5.9-2.9zM244.8 8C106.1 8 0 113.3 0 252c0 110.9 69.8 205.8 169.5 239.2 12.8 2.3 17.3-5.6 17.3-12.1 0-6.2-.3-40.4-.3-61.4 0 0-70 15-84.7-29.8 0 0-11.4-29.1-27.8-36.6 0 0-22.9-15.7 1.6-15.4 0 0 24.9 2 38.6 25.8 21.9 38.6 58.6 27.5 72.9 20.9 2.3-16 8.8-27.1 16-33.7-55.9-6.2-112.3-14.3-112.3-110.5 0-27.5 7.6-41.3 23.6-58.9-2.6-6.5-11.1-33.3 2.6-67.9 20.9-6.5 69 27 69 27 20-5.6 41.5-8.5 62.8-8.5s42.8 2.9 62.8 8.5c0 0 48.1-33.6 69-27 13.7 34.7 5.2 61.4 2.6 67.9 16 17.7 25.8 31.5 25.8 58.9 0 96.5-58.9 104.2-114.8 110.5 9.2 7.9 17 22.9 17 46.4 0 33.7-.3 75.4-.3 83.6 0 6.5 4.6 14.4 17.3 12.1C428.2 457.8 496 362.9 496 252 496 113.3 383.5 8 244.8 8zM97.2 352.9c-1.3 1-1 3.3.7 5.2 1.6 1.6 3.9 2.3 5.2 1 1.3-1 1-3.3-.7-5.2-1.6-1.6-3.9-2.3-5.2-1zm-10.8-8.1c-.7 1.3.3 2.9 2.3 3.9 1.6 1 3.6.7 4.3-.7.7-1.3-.3-2.9-2.3-3.9-2-.6-3.6-.3-4.3.7zm32.4 35.6c-1.6 1.3-1 4.3 1.3 6.2 2.3 2.3 5.2 2.6 6.5 1 1.3-1.3.7-4.3-1.3-6.2-2.2-2.3-5.2-2.6-6.5-1zm-11.4-14.7c-1.6 1-1.6 3.6 0 5.9 1.6 2.3 4.3 3.3 5.6 2.3 1.6-1.3 1.6-3.9 0-6.2-1.4-2.3-4-3.3-5.6-2z"/></svg></span><span class="kt-btn-inner-text">Relataly GitHub Repo </span></a></div>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%">
<figure class="wp-block-image size-full"><img decoding="async" width="512" height="513" data-attachment-id="12366" data-permalink="https://www.relataly.com/customer-segmentation-using-hierarchical-clustering-in-python/11335/the_future_of_the_healthcare_using_blockchain-min/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/12/the_future_of_the_healthcare_using_blockchain-min.png" data-orig-size="512,513" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="the_future_of_the_healthcare_using_blockchain-min" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/12/the_future_of_the_healthcare_using_blockchain-min.png" src="https://www.relataly.com/wp-content/uploads/2022/12/the_future_of_the_healthcare_using_blockchain-min.png" alt="In this machine learning tutorial, we will run a hierarchical clustering algorithm on health data." class="wp-image-12366" srcset="https://www.relataly.com/wp-content/uploads/2022/12/the_future_of_the_healthcare_using_blockchain-min.png 512w, https://www.relataly.com/wp-content/uploads/2022/12/the_future_of_the_healthcare_using_blockchain-min.png 300w, https://www.relataly.com/wp-content/uploads/2022/12/the_future_of_the_healthcare_using_blockchain-min.png 140w" sizes="(max-width: 512px) 100vw, 512px" /><figcaption class="wp-element-caption">The future of healthcare will see a tight collaboration between humans and AI. Image generated using&nbsp;Midjourney</figcaption></figure>
</div>
</div>



<h3 class="wp-block-heading">About the Customer Health Insurance Dataset</h3>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">In this tutorial, we will work with a public dataset on health_insurance_customer_data from kaggle.com. Download the <a href="https://www.kaggle.com/datasets/teertha/ushealthinsurancedataset" target="_blank" rel="noreferrer noopener">CSV file from Kaggle</a> and copy it into the following path, starting from the folder with your python notebook: data/customer/</p>



<p class="wp-block-paragraph">The dataset is relatively simple and contains 1338 rows of insured customers. It includes the insurance charges, as well as demographic and personal information such as Age, Sex, BMI, Number of Children, Smoker, and Region. The dataset does not have any undefined or missing values.</p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%"></div>
</div>



<h3 class="wp-block-heading" id="h-prerequisites">Prerequisites</h3>



<p class="wp-block-paragraph">Before we start the coding part, ensure that you have set up your Python 3 environment and the required packages. If you don’t have an environment, follow&nbsp;this tutorial&nbsp;to set up the&nbsp;<a href="https://www.anaconda.com/products/individual" target="_blank" rel="noreferrer noopener">Anaconda environment</a>. Also, make sure you install all required packages. In this tutorial, we will be working with the following standard packages:&nbsp;</p>



<ul class="wp-block-list">
<li>pandas</li>



<li>NumPy</li>



<li>matplotlib</li>



<li>scikit-learn</li>
</ul>



<p class="wp-block-paragraph">You can install packages using console commands:</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">pip install &lt;package name&gt; 
conda install &lt;package name&gt; (if you are using the anaconda packet manager)</pre></div>



<h3 class="wp-block-heading">Step #1 Load the Data</h3>



<p class="wp-block-paragraph">To begin, we need to load the required packages and the data we want to cluster. We will load the data by reading the CSV file via the pandas library. </p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># import necessary libraries
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import AgglomerativeClustering
from sklearn.preprocessing import LabelEncoder
from pandas.api.types import is_string_dtype
import pandas as pd
import math
import seaborn as sns

# load customer data
customer_df = pd.read_csv(&quot;data/customer/customer_health_insurance.csv&quot;)
customer_df.head(3)</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">	age	sex		bmi		children	smoker	region		charges
0	19	female	27.90	0			yes		southwest	16884.9240
1	18	male	33.77	1			no		southeast	1725.5523
2	28	male	33.00	3			no		southeast	4449.4620</pre></div>



<h3 class="wp-block-heading">Step #2 Explore the Data</h3>



<p class="wp-block-paragraph">Next, it is a good idea to explore the data and get a sense of its structure and content. This can be done using a variety of methods, such as examining the shape of the dataframe, checking for missing values, and plotting some basic statistics. For example, the following plots will explore the relationships between some of the variables. We won&#8217;t go into too much detail here.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}">def make_kdeplot(df, column_name, target_name):
    fig, ax = plt.subplots(figsize=(10, 6))
    sns.kdeplot(data=df, hue=column_name, x=target_name, ax = ax, linewidth=2,)
    ax.tick_params(axis=&quot;x&quot;, rotation=90, labelsize=10, length=0)
    ax.set_title(column_name)
    ax.set_xlim(0, df[target_name].quantile(0.99))
    plt.show()

# make kde plot for ext_color 
make_kdeplot(customer_df, 'smoker', 'charges')</pre></div>



<figure class="wp-block-image size-full is-resized"><img decoding="async" data-attachment-id="11363" data-permalink="https://www.relataly.com/customer-segmentation-using-hierarchical-clustering-in-python/11335/image-17-3/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/12/image-17.png" data-orig-size="833,571" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-17" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/12/image-17.png" src="https://www.relataly.com/wp-content/uploads/2022/12/image-17.png" alt="" class="wp-image-11363" width="567" height="389" srcset="https://www.relataly.com/wp-content/uploads/2022/12/image-17.png 833w, https://www.relataly.com/wp-content/uploads/2022/12/image-17.png 300w, https://www.relataly.com/wp-content/uploads/2022/12/image-17.png 768w" sizes="(max-width: 567px) 100vw, 567px" /></figure>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># make kde plot for ext_color 
make_kdeplot(customer_df, 'sex', 'charges')</pre></div>



<figure class="wp-block-image size-full is-resized"><img decoding="async" data-attachment-id="11364" data-permalink="https://www.relataly.com/customer-segmentation-using-hierarchical-clustering-in-python/11335/image-44-2/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/12/image-44.png" data-orig-size="846,571" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-44" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/12/image-44.png" src="https://www.relataly.com/wp-content/uploads/2022/12/image-44.png" alt="" class="wp-image-11364" width="572" height="386" srcset="https://www.relataly.com/wp-content/uploads/2022/12/image-44.png 846w, https://www.relataly.com/wp-content/uploads/2022/12/image-44.png 300w, https://www.relataly.com/wp-content/uploads/2022/12/image-44.png 768w" sizes="(max-width: 572px) 100vw, 572px" /></figure>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}">sns.lmplot(x=&quot;charges&quot;, y=&quot;age&quot;, hue=&quot;smoker&quot;, data=customer_df, aspect=2)
plt.show()</pre></div>



<figure class="wp-block-image size-large is-resized"><img decoding="async" data-attachment-id="11365" data-permalink="https://www.relataly.com/customer-segmentation-using-hierarchical-clustering-in-python/11335/image-45/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/12/image-45.png" data-orig-size="1067,489" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-45" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/12/image-45.png" src="https://www.relataly.com/wp-content/uploads/2022/12/image-45-1024x469.png" alt="" class="wp-image-11365" width="700" height="321" srcset="https://www.relataly.com/wp-content/uploads/2022/12/image-45.png 1024w, https://www.relataly.com/wp-content/uploads/2022/12/image-45.png 300w, https://www.relataly.com/wp-content/uploads/2022/12/image-45.png 768w, https://www.relataly.com/wp-content/uploads/2022/12/image-45.png 1067w" sizes="(max-width: 700px) 100vw, 700px" /></figure>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}">def make_boxplot(customer_df, x,y,h):
    fig, ax = plt.subplots(figsize=(10,4))
    box = sns.boxplot(x=x, y=y, hue=h, data=customer_df)
    box.set_xticklabels(box.get_xticklabels())
    fig.subplots_adjust(bottom=0.2)
    plt.tight_layout()

make_boxplot(customer_df, &quot;smoker&quot;, &quot;charges&quot;, &quot;sex&quot;)</pre></div>



<figure class="wp-block-image size-full is-resized"><img decoding="async" data-attachment-id="11366" data-permalink="https://www.relataly.com/customer-segmentation-using-hierarchical-clustering-in-python/11335/image-46/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/12/image-46.png" data-orig-size="989,390" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-46" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/12/image-46.png" src="https://www.relataly.com/wp-content/uploads/2022/12/image-46.png" alt="" class="wp-image-11366" width="675" height="266" srcset="https://www.relataly.com/wp-content/uploads/2022/12/image-46.png 989w, https://www.relataly.com/wp-content/uploads/2022/12/image-46.png 300w, https://www.relataly.com/wp-content/uploads/2022/12/image-46.png 768w" sizes="(max-width: 675px) 100vw, 675px" /></figure>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}">make_boxplot(customer_df, &quot;region&quot;, &quot;charges&quot;, &quot;sex&quot;)</pre></div>



<figure class="wp-block-image size-full is-resized"><img decoding="async" data-attachment-id="11367" data-permalink="https://www.relataly.com/customer-segmentation-using-hierarchical-clustering-in-python/11335/image-47-4/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/12/image-47.png" data-orig-size="989,390" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-47" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/12/image-47.png" src="https://www.relataly.com/wp-content/uploads/2022/12/image-47.png" alt="" class="wp-image-11367" width="693" height="273" srcset="https://www.relataly.com/wp-content/uploads/2022/12/image-47.png 989w, https://www.relataly.com/wp-content/uploads/2022/12/image-47.png 300w, https://www.relataly.com/wp-content/uploads/2022/12/image-47.png 768w" sizes="(max-width: 693px) 100vw, 693px" /></figure>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}">make_boxplot(customer_df, &quot;children&quot;, &quot;bmi&quot;, &quot;sex&quot;)</pre></div>



<figure class="wp-block-image size-full is-resized"><img decoding="async" data-attachment-id="11368" data-permalink="https://www.relataly.com/customer-segmentation-using-hierarchical-clustering-in-python/11335/image-48-4/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/12/image-48.png" data-orig-size="989,390" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-48" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/12/image-48.png" src="https://www.relataly.com/wp-content/uploads/2022/12/image-48.png" alt="" class="wp-image-11368" width="705" height="278" srcset="https://www.relataly.com/wp-content/uploads/2022/12/image-48.png 989w, https://www.relataly.com/wp-content/uploads/2022/12/image-48.png 300w, https://www.relataly.com/wp-content/uploads/2022/12/image-48.png 768w" sizes="(max-width: 705px) 100vw, 705px" /></figure>



<p class="wp-block-paragraph">Next, let&#8217;s prepare the data for model training. </p>



<h3 class="wp-block-heading" id="h-step-3-prepare-the-data">Step #3 Prepare the Data</h3>



<p class="wp-block-paragraph">Before we can train a model on the data, we must prepare it for modeling. This typically involves selecting the relevant features, handling missing values, and scaling the data. However, we are using a very simple dataset that already has good data quality. Therefore we can limit our data preparation activities to encoding the labels and scaling the data. </p>



<p class="wp-block-paragraph">To encode the categorical values, we will use label encoder from the scikit-learn library.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># encode categorical features
label_encoder = LabelEncoder()

for col_name in customer_df.columns:
    if (is_string_dtype(customer_df[col_name])):
        customer_df[col_name] = label_encoder.fit_transform(customer_df[col_name])
customer_df.head(3)</pre></div>



<p class="wp-block-paragraph">Next, we will scale the numeric variables. While scaling the data is an essential preprocessing step for many machine learning algorithms to work effectively, it is generally not necessary for hierarchical clustering. This is because hierarchical clustering is not sensitive to the scale of the features. However, when you use certain distance measures, such as Euclidean distance, scaling the data might still be useful when performing hierarchical clustering. Scaling the data can help to ensure that all of the features are given equal weight. This can be useful if you want to avoid giving more weight to features with larger scales.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># select features
X = customer_df # we will select all features

# standardize the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
X_scaled.head(3)</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">array([[-1.43876426, -1.0105187 , -0.45332   , ...,  1.34390459,
         0.2985838 ,  1.97058663],
       [-1.50996545,  0.98959079,  0.5096211 , ...,  0.43849455,
        -0.95368917, -0.5074631 ],
       [-0.79795355,  0.98959079,  0.38330685, ...,  0.43849455,
        -0.72867467, -0.5074631 ],
       ...,
       [-1.50996545, -1.0105187 ,  1.0148781 , ...,  0.43849455,
        -0.96159623, -0.5074631 ],
       [-1.29636188, -1.0105187 , -0.79781341, ...,  1.34390459,
        -0.93036151, -0.5074631 ],
       [ 1.55168573, -1.0105187 , -0.26138796, ..., -0.46691549,
         1.31105347,  1.97058663]])</pre></div>



<h3 class="wp-block-heading">Step #4 Train the Hierarchical Clustering Algorithm</h3>



<p class="wp-block-paragraph">To train a hierarchical clustering model using scikit-learn, we can use the AgglomerativeClustering or Ward class. The main parameters for these classes are:</p>



<ul class="wp-block-list">
<li><strong>n_clusters: </strong>The number of clusters to form. This parameter is required for AgglomerativeClustering but is not used for <code>Ward</code>.</li>



<li><strong>affinity: </strong>The distance measure used to calculate the similarity between pairs of samples. This can be any of the distance measures implemented in scikit-learn, such as the Euclidean distance or the cosine similarity.</li>



<li>l<strong>inkage: </strong>The method used to calculate the distance between clusters. This can be one of &#8220;ward,&#8221; &#8220;complete,&#8221; &#8220;average,&#8221; or &#8220;single.&#8221;</li>



<li><strong>distance_threshold:</strong> The maximum distance between two clusters that allows them to be merged. This parameter is only used in the AgglomerativeClustering class.</li>
</ul>



<p class="wp-block-paragraph">To train the model, we specify the desired parameters and fit the model to the data using the fit_predict method. This method will fit the model to the data and generate predictions in one step.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># apply hierarchical clustering 
model = AgglomerativeClustering(affinity='euclidean')
predicted_segments = model.fit_predict(X_scaled)</pre></div>



<p class="wp-block-paragraph">Now we have a trained clustering model also predicted the segments for our data.</p>



<h3 class="wp-block-heading">Step #5 Visualize the Results</h3>



<p class="wp-block-paragraph">After the model is trained, we can visualize the results to get a better understanding of the clusters that were formed. There is a wide range of plots and tools to visualize clusters. In this tutorial, we will use a scatterplot and a dendrogram. </p>



<h4 class="wp-block-heading">5.1 Scatterplot</h4>



<p class="wp-block-paragraph">For this, we can use the lmplot function in Seaborn. The lmplot creates a 2D scatterplot with an optional overlay of a linear regression model. The plot visualizes the relationship between two variables and fits a linear regression model to the data that can highlight differences. In the following, we use this linear regression model to highlight the differences between our two cluster segments and the age of the customers. </p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># add predictions to data as a new column
customer_df['segment'] = predicted_segments

# create a scatter plot of the first two features, colored by segment
sns.lmplot(x=&quot;charges&quot;, y=&quot;age&quot;, hue=&quot;segment&quot;, data=customer_df, aspect=2)
plt.show()</pre></div>



<figure class="wp-block-image size-large"><img decoding="async" width="1024" height="470" data-attachment-id="11370" data-permalink="https://www.relataly.com/customer-segmentation-using-hierarchical-clustering-in-python/11335/image-49-2/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/12/image-49.png" data-orig-size="1065,489" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-49" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/12/image-49.png" src="https://www.relataly.com/wp-content/uploads/2022/12/image-49-1024x470.png" alt="" class="wp-image-11370" srcset="https://www.relataly.com/wp-content/uploads/2022/12/image-49.png 1024w, https://www.relataly.com/wp-content/uploads/2022/12/image-49.png 300w, https://www.relataly.com/wp-content/uploads/2022/12/image-49.png 768w, https://www.relataly.com/wp-content/uploads/2022/12/image-49.png 1065w" sizes="(max-width: 1024px) 100vw, 1024px" /></figure>



<p class="wp-block-paragraph">We can see that our model has determined two clusters in our data. The clusters seem to correspond well with the smoker category, which indicates that this attribute is decisive in forming relevant groups.</p>



<h4 class="wp-block-heading" id="h-5-2-dendrogram">5.2 Dendrogram</h4>



<p class="wp-block-paragraph">The hierarchical clustering approach lets us visualize relationships between different groups in our dataset in a dendrogram. A dendrogram is a graphical representation of a hierarchical structure, such as the relationships between different groups of objects or organisms. It is typically used in biology to show the relationships between different species or taxonomic groups, but it can also be used in other fields to represent the hierarchical structure of any set of data. In a dendrogram, the objects or groups being studied are represented as branches on a tree-like diagram. The branches are usually labeled with the names of the objects or groups, and the lengths of the branches represent the distances or dissimilarities between the objects or groups. The branches are also arranged in a hierarchical manner, with the most closely related objects or groups being placed closer together and the more distantly related ones being placed farther apart.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Visualize data similarity in a dendogram
def plot_dendrogram(model, **kwargs):
    # create the counts of samples under each node
    counts = np.zeros(model.children_.shape[0])
    n_samples = len(model.labels_)
    for i, merge in enumerate(model.children_):
        current_count = 0
        for child_idx in merge:
            if child_idx &lt; n_samples:
                current_count += 1  # leaf node
            else:
                current_count += counts[child_idx - n_samples]
        counts[i] = current_count

    linkage_matrix = np.column_stack(
        [model.children_, model.distances_, counts]
    ).astype(float)

    # Plot the corresponding dendrogram
    dendrogram(linkage_matrix, orientation='right',**kwargs)


plt.title(&quot;Hierarchical Clustering Dendrogram&quot;)
# plot the top three levels of the dendrogram
plot_dendrogram(cluster_model, truncate_mode=&quot;level&quot;, p=4)
plt.xlabel(&quot;Euclidean Distance&quot;)
plt.ylabel(&quot;Number of points in node (or index of point if no parenthesis).&quot;)
plt.show()</pre></div>



<p class="wp-block-paragraph">Source: This code block is based on code <a href="https://scikit-learn.org/stable/auto_examples/cluster/plot_agglomerative_dendrogram.html" target="_blank" rel="noreferrer noopener">from the scikit-learn page</a></p>



<figure class="wp-block-image size-full"><img decoding="async" width="575" height="453" data-attachment-id="11396" data-permalink="https://www.relataly.com/customer-segmentation-using-hierarchical-clustering-in-python/11335/image-53-4/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/12/image-53.png" data-orig-size="575,453" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-53" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/12/image-53.png" src="https://www.relataly.com/wp-content/uploads/2022/12/image-53.png" alt="" class="wp-image-11396" srcset="https://www.relataly.com/wp-content/uploads/2022/12/image-53.png 575w, https://www.relataly.com/wp-content/uploads/2022/12/image-53.png 300w" sizes="(max-width: 575px) 100vw, 575px" /></figure>



<h2 class="wp-block-heading">Summary</h2>



<p class="wp-block-paragraph">In conclusion, hierarchical clustering is a powerful tool for customer segmentation that can help businesses better understand their customer base and target their marketing efforts more effectively. By grouping customers into clusters based on their characteristics and behaviors, companies can create targeted campaigns and personalize their marketing efforts to better meet the needs of each group. Using Python and the scikit-learn library, we were able to apply an agglomerative clustering approach to a dataset of customer data and identify two distinct segments. We can then use these segments to inform our marketing strategies and get a better understanding of our customers.</p>



<p class="wp-block-paragraph">By the way, customer segmentation is an area where real-world data can be prone to bias and unfairness. If you&#8217;re concerned about this, check out our latest article on <a href="https://www.relataly.com/building-fair-machine-machine-learning-models-with-fairlearn/12804/" target="_blank" rel="noreferrer noopener">addressing fairness in machine learning with fairlearn</a>.</p>



<p class="wp-block-paragraph">I hope this article was useful. If you have any feedback, please write your thoughts in the comments. </p>



<h2 class="wp-block-heading">Sources and Further Reading</h2>



<p class="wp-block-paragraph">Articles</p>



<ul class="wp-block-list">
<li><a href="https://scikit-learn.org/stable/auto_examples/cluster/plot_agglomerative_dendrogram.html" target="_blank" rel="noreferrer noopener">https://scikit-learn.org/stable/auto_examples/cluster/plot_agglomerative_dendrogram.html</a></li>



<li>Images generated with OpenAI Dall-E and Midjourney.</li>
</ul>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow">
<div class="wp-block-group"><div class="wp-block-group__inner-container is-layout-constrained wp-block-group-is-layout-constrained">
<h4 class="wp-block-heading"><strong>Books on Clustering</strong></h4>



<ul class="wp-block-list">
<li><a href="https://amzn.to/3Gb5kfj" target="_blank" rel="noreferrer noopener">&#8220;Data Clustering: Algorithms and Applications&#8221; by Charu C. Aggarwal</a>: This book covers a wide range of clustering algorithms, including hierarchical clustering, and discusses their applications in various fields.</li>



<li><a href="https://amzn.to/3WmhGXB" target="_blank" rel="noreferrer noopener">&#8220;Data Mining: Practical Machine Learning Tools and Techniques&#8221; by Ian H. Witten and Eibe Frank</a>: This book is a comprehensive introduction to data mining and machine learning, including a chapter on hierarchical clustering.</li>
</ul>



<div style="display: inline-block;">
<iframe sandbox="allow-popups allow-scripts allow-modals allow-forms allow-same-origin" style="width:120px;height:240px;" marginwidth="0" marginheight="0" scrolling="no" frameborder="0" src="//ws-eu.amazon-adsystem.com/widgets/q?ServiceVersion=20070822&amp;OneJS=1&amp;Operation=GetAdHtml&amp;MarketPlace=DE&amp;source=ss&amp;ref=as_ss_li_til&amp;ad_type=product_link&amp;tracking_id=flo7up-21&amp;language=de_DE&amp;marketplace=amazon&amp;region=DE&amp;placement=0128042915&amp;asins=0128042915&amp;linkId=1e9fe160a76f7255e3eea8e0119ca74f&amp;show_border=true&amp;link_opens_in_new_window=true"></iframe>

<iframe sandbox="allow-popups allow-scripts allow-modals allow-forms allow-same-origin" style="width:120px;height:240px;" marginwidth="0" marginheight="0" scrolling="no" frameborder="0" src="//ws-eu.amazon-adsystem.com/widgets/q?ServiceVersion=20070822&amp;OneJS=1&amp;Operation=GetAdHtml&amp;MarketPlace=DE&amp;source=ss&amp;ref=as_ss_li_til&amp;ad_type=product_link&amp;tracking_id=flo7up-21&amp;language=de_DE&amp;marketplace=amazon&amp;region=DE&amp;placement=B00EYROAQU&amp;asins=B00EYROAQU&amp;linkId=ba1fcb8a59417e729afe33f6eceb2a9f&amp;show_border=true&amp;link_opens_in_new_window=true"></iframe>
</div>
</div></div>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow">
<div class="wp-block-group"><div class="wp-block-group__inner-container is-layout-constrained wp-block-group-is-layout-constrained">
<h4 class="wp-block-heading"><strong>Books on Machine Learning</strong></h4>



<ul class="wp-block-list">
<li><a href="https://amzn.to/3S9Nfkl" target="_blank" rel="noreferrer noopener">Aurélien Géron (2019) Hands-On Machine Learning</a></li>



<li><a href="https://amzn.to/3EKidwE" target="_blank" rel="noreferrer noopener">David Forsyth (2019) Applied Machine Learning Springer</a></li>
</ul>



<div style="display: inline-block;">

  <iframe sandbox="allow-popups allow-scripts allow-modals allow-forms allow-same-origin" style="width:120px;height:240px;" marginwidth="0" marginheight="0" scrolling="no" frameborder="0" src="//ws-eu.amazon-adsystem.com/widgets/q?ServiceVersion=20070822&amp;OneJS=1&amp;Operation=GetAdHtml&amp;MarketPlace=DE&amp;source=ss&amp;ref=as_ss_li_til&amp;ad_type=product_link&amp;tracking_id=flo7up-21&amp;language=de_DE&amp;marketplace=amazon&amp;region=DE&amp;placement=3030181162&amp;asins=3030181162&amp;linkId=669e46025028259138fbb5ccec12dfbe&amp;show_border=true&amp;link_opens_in_new_window=true"></iframe>

<iframe sandbox="allow-popups allow-scripts allow-modals allow-forms allow-same-origin" style="width:120px;height:240px;" marginwidth="0" marginheight="0" scrolling="no" frameborder="0" src="//ws-eu.amazon-adsystem.com/widgets/q?ServiceVersion=20070822&amp;OneJS=1&amp;Operation=GetAdHtml&amp;MarketPlace=DE&amp;source=ss&amp;ref=as_ss_li_til&amp;ad_type=product_link&amp;tracking_id=flo7up-21&amp;language=de_DE&amp;marketplace=amazon&amp;region=DE&amp;placement=1492032646&amp;asins=1492032646&amp;linkId=2214804dd039e7103577abd08722abac&amp;show_border=true&amp;link_opens_in_new_window=true"></iframe>
</div>
</div></div>
</div>
</div>



<p class="has-contrast-2-color has-base-3-background-color has-text-color has-background wp-block-paragraph"><em>The links above to Amazon are affiliate links. By buying through these links, you support the Relataly.com blog and help to cover the hosting costs. Using the links does not affect the price.</em></p>



<p class="wp-block-paragraph"><strong>Relataly articles on clustering and machine learning</strong></p>



<ul class="wp-block-list">
<li><a href="https://www.relataly.com/simple-cluster-analysis-with-k-means-with-python/5070/" target="_blank" rel="noreferrer noopener">Simple Clustering using K-means in Python</a>: This article gives an overview of cluster analysis with k-means.</li>



<li><a href="https://www.relataly.com/crypto-market-cluster-analysis-using-affinity-propagation-python/8114/" target="_blank" rel="noreferrer noopener">Clustering crypto markets using affinity propagation in Python</a>: This article applies cluster analysis to crypto markets and creates a market map for various cryptocurrencies.</li>



<li><a href="https://www.relataly.com/building-fair-machine-machine-learning-models-with-fairlearn/12804/" target="_blank" rel="noreferrer noopener">Addressing fairness in machine learning with the fairlearn library</a></li>
</ul>



<p class="wp-block-paragraph"></p>
<p>The post <a href="https://www.relataly.com/customer-segmentation-using-hierarchical-clustering-in-python/11335/">How to Use Hierarchical Clustering For Customer Segmentation in Python</a> appeared first on <a href="https://www.relataly.com">relataly.com</a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://www.relataly.com/customer-segmentation-using-hierarchical-clustering-in-python/11335/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">11335</post-id>	</item>
		<item>
		<title>Predictive Policing: Preventing Crime in San Francisco using XGBoost and Python</title>
		<link>https://www.relataly.com/predicting-crimes-in-san-francisco-creatingsf-crime-map-using-xgboost/2960/</link>
					<comments>https://www.relataly.com/predicting-crimes-in-san-francisco-creatingsf-crime-map-using-xgboost/2960/#respond</comments>
		
		<dc:creator><![CDATA[Florian Follonier]]></dc:creator>
		<pubDate>Sun, 07 Mar 2021 16:16:19 +0000</pubDate>
				<category><![CDATA[Algorithms]]></category>
		<category><![CDATA[Classification (multi-class)]]></category>
		<category><![CDATA[Decision Trees]]></category>
		<category><![CDATA[Fighting Crime]]></category>
		<category><![CDATA[Gradient Boosting]]></category>
		<category><![CDATA[Insurance]]></category>
		<category><![CDATA[Kaggle Competitions]]></category>
		<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[mplleaflet]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[Random Decision Forests]]></category>
		<category><![CDATA[Scikit-Learn]]></category>
		<category><![CDATA[Seaborn]]></category>
		<category><![CDATA[Classic Machine Learning]]></category>
		<category><![CDATA[Crime Data]]></category>
		<category><![CDATA[Geographic Maps]]></category>
		<category><![CDATA[Intermediate Tutorials]]></category>
		<category><![CDATA[Kaggle]]></category>
		<category><![CDATA[Multivariate Models]]></category>
		<category><![CDATA[Smart City]]></category>
		<category><![CDATA[Spatial Data]]></category>
		<category><![CDATA[Supervised Learning]]></category>
		<guid isPermaLink="false">https://www.relataly.com/?p=2960</guid>

					<description><![CDATA[<p>In this tutorial, we&#8217;ll be using machine learning to predict and map out crime in San Francisco. We&#8217;ll be working with a dataset from Kaggle that contains information on 39 different types of crimes, including everything from vehicle theft to drug offenses. Using Python and the powerful Scikit-Learn library, we&#8217;ll train a classification model using ... <a title="Predictive Policing: Preventing Crime in San Francisco using XGBoost and Python" class="read-more" href="https://www.relataly.com/predicting-crimes-in-san-francisco-creatingsf-crime-map-using-xgboost/2960/" aria-label="Read more about Predictive Policing: Preventing Crime in San Francisco using XGBoost and Python">Read more</a></p>
<p>The post <a href="https://www.relataly.com/predicting-crimes-in-san-francisco-creatingsf-crime-map-using-xgboost/2960/">Predictive Policing: Preventing Crime in San Francisco using XGBoost and Python</a> appeared first on <a href="https://www.relataly.com">relataly.com</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">In this tutorial, we&#8217;ll be using machine learning to predict and map out crime in San Francisco. We&#8217;ll be working with a dataset from Kaggle that contains information on 39 different types of crimes, including everything from vehicle theft to drug offenses. Using Python and the powerful Scikit-Learn library, we&#8217;ll train a classification model using the <a href="https://www.relataly.com/category/machine-learning-algorithms/gradient-boosting/" target="_blank" rel="noreferrer noopener">XGboost algorithm</a> to predict 39 types of crimes based on when and where it occurred. We&#8217;ll then use the Plotly library to visualize the results on a map of the city, highlighting areas with higher rates of certain crimes. This type of prediction and mapping is similar to what the San Francisco Police Department uses in their practice of predictive policing, where they allocate resources to at-risk areas in an effort to prevent crime.</p>



<p class="wp-block-paragraph">As we embark on this thrilling journey, we&#8217;ll start by downloading and preprocessing the San Francisco crime data. Next, we&#8217;ll channel the data to train two distinct classification models. The first model will utilize a standard Random Forest Classifier, while the second will leverage the exceptional XGBoost package. We&#8217;ll experiment with various models that boast different hyperparameters. Ultimately, we&#8217;ll visualize our predictions on a striking SF crime map and assess the performance of our diverse models. So, buckle up and let&#8217;s dive into the exhilarating world of crime prediction and mapping!</p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%">
<figure class="wp-block-image size-full"><img decoding="async" width="508" height="513" data-attachment-id="12478" data-permalink="https://www.relataly.com/predicting-crimes-in-san-francisco-creatingsf-crime-map-using-xgboost/2960/crime-prediction-san-francisco-city-map-xgboost-python-tutorial-min/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2023/02/crime-prediction-san-francisco-city-map-xgboost-python-tutorial-min.png" data-orig-size="508,513" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="crime-prediction-san-francisco-city-map-xgboost-python-tutorial-min" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2023/02/crime-prediction-san-francisco-city-map-xgboost-python-tutorial-min.png" src="https://www.relataly.com/wp-content/uploads/2023/02/crime-prediction-san-francisco-city-map-xgboost-python-tutorial-min.png" alt="crime prediction san francisco city map xgboost python tutorial. Image generated using Midjourney. relataly.com" class="wp-image-12478" srcset="https://www.relataly.com/wp-content/uploads/2023/02/crime-prediction-san-francisco-city-map-xgboost-python-tutorial-min.png 508w, https://www.relataly.com/wp-content/uploads/2023/02/crime-prediction-san-francisco-city-map-xgboost-python-tutorial-min.png 297w, https://www.relataly.com/wp-content/uploads/2023/02/crime-prediction-san-francisco-city-map-xgboost-python-tutorial-min.png 140w" sizes="(max-width: 508px) 100vw, 508px" /><figcaption class="wp-element-caption">Predictive policing can make police work much more efficient and effective. Image generated using <a href="http://www.midjourney.com" target="_blank" rel="noreferrer noopener">Midjourney</a>.</figcaption></figure>
</div>
</div>



<h2 class="wp-block-heading">What is Predictive Policing?</h2>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">The use case we are looking at in this article falls into predictive policing. Predictive policing uses data, algorithms, and other technological tools to predict where and when crimes are likely to occur. The goal of predictive policing is to help law enforcement agencies better allocate their resources and focus their efforts on areas where crime is likely to happen, with the ultimate goal of reducing crime and improving public safety. This approach to policing is based on the idea that by using data and other tools to identify patterns and trends, law enforcement agencies can better anticipate where crimes are likely to occur and take steps to prevent them from happening.</p>



<p class="wp-block-paragraph">The benefits of predictive policing include the ability to allocate law enforcement resources better, the potential to reduce crime and improve public safety, and the ability to identify trends and patterns that may not be immediately obvious to law enforcement officers. Additionally, by using data and other tools to anticipate where crimes are likely to occur, law enforcement agencies can take proactive steps to prevent those crimes from happening, which can save time and money.</p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%"></div>
</div>



<h2 class="wp-block-heading" id="h-creating-a-crime-map-for-predictive-policing-using-xgboost-in-python">Creating a Crime Map for Predictive Policing using XGBoost in Python</h2>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">In this practical tutorial, we&#8217;ll construct an XGBoost multi-label classifier to predict crime types in San Francisco. Urban crime, such as in San Francisco, is a dynamic and multifaceted issue that can dramatically vary based on location, time, and other factors. Our aim is to develop a predictive algorithm capable of forecasting specific crime types based on a given location and time parameters. The end product is an interactive San Francisco crime map providing a snapshot of crime hotspots throughout the city.</p>



<p class="wp-block-paragraph">Law enforcement agencies, like the San Francisco Police Department, use similar maps for strategic resource allocation to curb crime rates effectively. Additionally, this SF crime map will underscore crime clusters &#8211; areas notorious for particular types of crime incidents. By the end of this tutorial, you&#8217;ll have a deeper understanding of using machine learning in practical scenarios and aiding real-world decision-making.</p>



<p class="wp-block-paragraph">The code is available on the GitHub repository.</p>



<div class="wp-block-kadence-advancedbtn kb-buttons-wrap kb-btns_262c97-e5"><a class="kb-button kt-button button kb-btn_e6ce86-27 kt-btn-size-standard kt-btn-width-type-full kb-btn-global-inherit kt-btn-has-text-true kt-btn-has-svg-true wp-block-button__link wp-block-kadence-singlebtn" href="https://github.com/flo7up/relataly-public-python-tutorials/blob/master/02%20Classification/018%20Forecasting%20Criminal%20Activity%20in%20San%20Francisco.ipynb" target="_blank" rel="noreferrer noopener"><span class="kb-svg-icon-wrap kb-svg-icon-fe_eye kt-btn-icon-side-left"><svg viewBox="0 0 24 24"  fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"  aria-hidden="true"><path d="M1 12s4-8 11-8 11 8 11 8-4 8-11 8-11-8-11-8z"/><circle cx="12" cy="12" r="3"/></svg></span><span class="kt-btn-inner-text">View on GitHub </span></a>

<a class="kb-button kt-button button kb-btn_3b4a4f-fe kt-btn-size-standard kt-btn-width-type-full kb-btn-global-inherit kt-btn-has-text-true kt-btn-has-svg-true wp-block-button__link wp-block-kadence-singlebtn" href="https://github.com/flo7up/relataly-public-python-API-tutorials" target="_blank" rel="noreferrer noopener"><span class="kb-svg-icon-wrap kb-svg-icon-fa_github kt-btn-icon-side-left"><svg viewBox="0 0 496 512"  fill="currentColor" xmlns="http://www.w3.org/2000/svg"  aria-hidden="true"><path d="M165.9 397.4c0 2-2.3 3.6-5.2 3.6-3.3.3-5.6-1.3-5.6-3.6 0-2 2.3-3.6 5.2-3.6 3-.3 5.6 1.3 5.6 3.6zm-31.1-4.5c-.7 2 1.3 4.3 4.3 4.9 2.6 1 5.6 0 6.2-2s-1.3-4.3-4.3-5.2c-2.6-.7-5.5.3-6.2 2.3zm44.2-1.7c-2.9.7-4.9 2.6-4.6 4.9.3 2 2.9 3.3 5.9 2.6 2.9-.7 4.9-2.6 4.6-4.6-.3-1.9-3-3.2-5.9-2.9zM244.8 8C106.1 8 0 113.3 0 252c0 110.9 69.8 205.8 169.5 239.2 12.8 2.3 17.3-5.6 17.3-12.1 0-6.2-.3-40.4-.3-61.4 0 0-70 15-84.7-29.8 0 0-11.4-29.1-27.8-36.6 0 0-22.9-15.7 1.6-15.4 0 0 24.9 2 38.6 25.8 21.9 38.6 58.6 27.5 72.9 20.9 2.3-16 8.8-27.1 16-33.7-55.9-6.2-112.3-14.3-112.3-110.5 0-27.5 7.6-41.3 23.6-58.9-2.6-6.5-11.1-33.3 2.6-67.9 20.9-6.5 69 27 69 27 20-5.6 41.5-8.5 62.8-8.5s42.8 2.9 62.8 8.5c0 0 48.1-33.6 69-27 13.7 34.7 5.2 61.4 2.6 67.9 16 17.7 25.8 31.5 25.8 58.9 0 96.5-58.9 104.2-114.8 110.5 9.2 7.9 17 22.9 17 46.4 0 33.7-.3 75.4-.3 83.6 0 6.5 4.6 14.4 17.3 12.1C428.2 457.8 496 362.9 496 252 496 113.3 383.5 8 244.8 8zM97.2 352.9c-1.3 1-1 3.3.7 5.2 1.6 1.6 3.9 2.3 5.2 1 1.3-1 1-3.3-.7-5.2-1.6-1.6-3.9-2.3-5.2-1zm-10.8-8.1c-.7 1.3.3 2.9 2.3 3.9 1.6 1 3.6.7 4.3-.7.7-1.3-.3-2.9-2.3-3.9-2-.6-3.6-.3-4.3.7zm32.4 35.6c-1.6 1.3-1 4.3 1.3 6.2 2.3 2.3 5.2 2.6 6.5 1 1.3-1.3.7-4.3-1.3-6.2-2.2-2.3-5.2-2.6-6.5-1zm-11.4-14.7c-1.6 1-1.6 3.6 0 5.9 1.6 2.3 4.3 3.3 5.6 2.3 1.6-1.3 1.6-3.9 0-6.2-1.4-2.3-4-3.3-5.6-2z"/></svg></span><span class="kt-btn-inner-text">Relataly Github Repo </span></a></div>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%">
<figure class="wp-block-image size-full"><img decoding="async" width="500" height="487" data-attachment-id="12476" data-permalink="https://www.relataly.com/predicting-crimes-in-san-francisco-creatingsf-crime-map-using-xgboost/2960/san-francisco-crime-prediction-machine-learning-python-tutorial-crime-map/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2023/02/san-francisco-crime-prediction-machine-learning-python-tutorial-crime-map.png" data-orig-size="500,487" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="san-francisco-crime-prediction-machine-learning-python-tutorial-crime-map" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2023/02/san-francisco-crime-prediction-machine-learning-python-tutorial-crime-map.png" src="https://www.relataly.com/wp-content/uploads/2023/02/san-francisco-crime-prediction-machine-learning-python-tutorial-crime-map.png" alt="" class="wp-image-12476" srcset="https://www.relataly.com/wp-content/uploads/2023/02/san-francisco-crime-prediction-machine-learning-python-tutorial-crime-map.png 500w, https://www.relataly.com/wp-content/uploads/2023/02/san-francisco-crime-prediction-machine-learning-python-tutorial-crime-map.png 300w" sizes="(max-width: 500px) 100vw, 500px" /><figcaption class="wp-element-caption">Crime doesn&#8217;t sleep in San Francisco. That&#8217;s why predictive policing can make a real impact. Image generated with <a href="http://www.midjourney.com" target="_blank" rel="noreferrer noopener">Midjourney</a></figcaption></figure>
</div>
</div>



<h3 class="wp-block-heading" id="h-prerequisites">Prerequisites</h3>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">Before starting the Python coding part, ensure that you have set up your <a href="https://www.python.org/downloads/" target="_blank" rel="noreferrer noopener">Python 3</a> environment and required packages. If you don&#8217;t have an environment, follow&nbsp;<a href="https://www.relataly.com/anaconda-python-environment-machine-learning/1663/" target="_blank" rel="noreferrer noopener">this tutorial</a>&nbsp;to set up the&nbsp;<a href="https://www.anaconda.com/products/individual" target="_blank" rel="noreferrer noopener">Anaconda environment</a>.</p>



<p class="wp-block-paragraph">Also, make sure you install all required packages. In this tutorial, we will be working with the following standard packages:&nbsp;</p>



<ul class="wp-block-list">
<li><em><a href="https://pandas.pydata.org/" target="_blank" rel="noreferrer noopener">pandas</a></em></li>



<li><em><a href="https://numpy.org/" target="_blank" rel="noreferrer noopener">NumPy</a></em></li>



<li><em><a href="https://matplotlib.org/" target="_blank" rel="noreferrer noopener">matplotlib</a></em></li>



<li>Seaborn</li>
</ul>



<p class="wp-block-paragraph">In addition, we will be using XGBoost (&#8216;xgboost&#8217;) and the machine learning library scikit-learn. </p>



<p class="wp-block-paragraph">You can install packages using console commands:</p>



<ul class="wp-block-list">
<li><em>pip install &lt;package name&gt;</em></li>



<li><em>conda install &lt;package name&gt;</em>&nbsp;(if you are using the anaconda packet manager)</li>
</ul>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%"></div>
</div>



<h3 class="wp-block-heading" id="h-step-1-load-the-data">Step #1 Load the Data</h3>



<p class="wp-block-paragraph">We begin by downloading the San Francisco crime challenge data on kaggle.com. Once you have downloaded the dataset, place the CSV files (train.csv) into your Python working folder.</p>



<p class="wp-block-paragraph">The dataset was collected by the SFO police department between 2003 and 2015. According to the data description from the SF crime challenge, the dataset contains the following variables:</p>



<ul class="wp-block-list">
<li><strong>Dates</strong>: timestamp of the crime incident</li>



<li><strong>Category</strong>: Category of the crime incident (only in train.csv) that we will use as the target variable</li>



<li><strong>Descript</strong>: detailed description of the crime incident&nbsp;(only in train.csv)</li>



<li><strong>DayOfWeek:</strong> the day of the week</li>



<li><strong>PdDistrict</strong>: the name of the Police Department District</li>



<li><strong>Resolution</strong>: how the crime incident was resolved&nbsp;(only in train.csv)</li>



<li><strong>Address</strong>: the approximate street address of the crime incident&nbsp;</li>



<li><strong>X</strong>: Longitude</li>



<li><strong>Y</strong>: Latitude</li>
</ul>



<p class="wp-block-paragraph">The next step is to load the data into a dataframe. Then we use the head() command to print the first five lines and ensure you can see the data.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}">import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import matplotlib.pyplot as plt 
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix
from xgboost import XGBClassifier
import plotly.express as px

# The Data is part of the Kaggle Competition: https://www.kaggle.com/c/sf-crime/data
df_base = pd.read_csv(&quot;data/crime/sf-crime/train.csv&quot;)

print(df_base.describe())
df_base.head()</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">		X              Y
count  	878049.000000  878049.000000
mean     -122.422616      37.771020
std         0.030354       0.456893
min      -122.513642      37.707879
25%      -122.432952      37.752427
50%      -122.416420      37.775421
75%      -122.406959      37.784369
max      -120.500000      90.000000</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">	Dates				Category	Descript			DayOfWeek	PdDistrict	Resolution	Address				X			Y
0	2015-05-13 23:53:00	WARRANTS	WARRANT ARREST		Wednesday	NORTHERN	ARREST, 	OAK ST / ...		-122.425892	37.774599
1	2015-05-13 23:53:00	OTHER ...	TRAFFIC ...			Wednesday	NORTHERN	ARREST, 	OAK ST / ...		-122.425892	37.774599
2	2015-05-13 23:33:00	OTHER ...	TRAFFIC ...			Wednesday	NORTHERN	ARREST, 	VANNESS AV... ST	-122.424363	37.800414
3	2015-05-13 23:30:00	LARCENY/THEFT	GRAND THEFT...	Wednesday	NORTHERN	NONE		1500 Block... ST	-122.426995	37.800873
4	2015-05-13 23:30:00	LARCENY/THEFT	GRAND THEFT ...	Wednesday	PARK		NONE		100 Block... ST		-122.438738	37.771541</pre></div>



<p class="wp-block-paragraph">If the data was loaded correctly, you should see the first five records of the dataframe, as shown above.</p>



<h3 class="wp-block-heading" id="h-step-2-explore-the-data">Step #2 Explore the Data</h3>



<p class="wp-block-paragraph">At the beginning of a new project, we usually don&#8217;t understand the data well and need to acquire that understanding. Therefore, next, we will explore the data and familiarize ourselves with its characteristics. </p>



<p class="wp-block-paragraph">The following examples will help us better understand our data&#8217;s characteristics. For example, you can use whisker charts and a correlation matrix to understand better the correlation between variables, such as between weekdays and prediction categories. Feel free to create more charts.</p>



<h4 class="wp-block-heading" id="h-2-1-prediction-labels">2.1 Prediction Labels</h4>



<p class="wp-block-paragraph">Running the code below shows a bar plot of the prediction labels. The plot shows the frequency in which the class labels occur in the data.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># print the value counts of the categories
plt.figure(figsize=(15,5))
ax = sns.countplot(x = df_base['Category'], orient='v', order = df_base['Category'].value_counts().index)
ax.set_xticklabels(ax.get_xticklabels(),rotation = 90)</pre></div>



<figure class="wp-block-image size-large is-resized"><img decoding="async" data-attachment-id="3167" data-permalink="https://www.relataly.com/predicting-crimes-in-san-francisco-creatingsf-crime-map-using-xgboost/2960/image-13-8/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2021/04/image-13.png" data-orig-size="910,467" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-13" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2021/04/image-13.png" src="https://www.relataly.com/wp-content/uploads/2021/04/image-13.png" alt="crime types in San Francisco" class="wp-image-3167" width="914" height="470" srcset="https://www.relataly.com/wp-content/uploads/2021/04/image-13.png 910w, https://www.relataly.com/wp-content/uploads/2021/04/image-13.png 300w, https://www.relataly.com/wp-content/uploads/2021/04/image-13.png 768w" sizes="(max-width: 914px) 100vw, 914px" /></figure>



<p class="wp-block-paragraph">As shown above, our class labels are highly imbalanced, affecting model accuracy. When we evaluate the performance of our model, we need to consider this.</p>



<h4 class="wp-block-heading" id="h-2-2-when-a-crime-occured-considering-dates-and-time">2.2 When a Crime Occured &#8211; Considering Dates and Time</h4>



<p class="wp-block-paragraph">We assume that when a crime occurs impacts the type of crime. For this reason, we look at how crimes distribute across different days of the week and times of the day. First, we look at crime numbers per weekday. </p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Print Crime Counts per Weekday
plt.figure(figsize=(6,3))
ax = sns.countplot(y = df_base['DayOfWeek'], orient='h', order = df_base['DayOfWeek'].value_counts().index)
ax.set_xticklabels(ax.get_xticklabels(),rotation = 90)</pre></div>



<figure class="wp-block-image size-large is-resized"><img decoding="async" data-attachment-id="3164" data-permalink="https://www.relataly.com/predicting-crimes-in-san-francisco-creatingsf-crime-map-using-xgboost/2960/image-12-8/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2021/04/image-12.png" data-orig-size="437,238" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-12" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2021/04/image-12.png" src="https://www.relataly.com/wp-content/uploads/2021/04/image-12.png" alt="" class="wp-image-3164" width="669" height="364" srcset="https://www.relataly.com/wp-content/uploads/2021/04/image-12.png 437w, https://www.relataly.com/wp-content/uploads/2021/04/image-12.png 300w" sizes="(max-width: 669px) 100vw, 669px" /></figure>



<p class="wp-block-paragraph">Fewer crimes happen on Sundays, and most are on Fridays. So it seems that even criminals like to have a weekend. For the sake of clarity, we thereby limit the categories. Let&#8217;s take a look at the time when certain crimes are reported.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Convert the time to minutes
df_base['Hour_Min'] = pd.to_datetime(df_base['Dates']).dt.hour  + pd.to_datetime(df_base['Dates']).dt.minute / 60

# Print Crime Counts per Time and Category
df_base_filtered = df_base[df_base['Category'].isin([
    'PROSTITUTION', 
    'VEHICLE THEFT', 
    'DRUG/NARCOTIC', 
    'WARRENTS', 
    'BURGLERY', 
    'FRAUD', 
    'ASSAULT',
    'LARCENY/THEFT',
    'VANDALISM'])]

plt.figure(figsize=(16,10))
ax = sns.displot(x = 'Hour_Min', hue=&quot;Category&quot;, data = df_base_filtered, kind=&quot;kde&quot;, height=8, aspect=1.5)</pre></div>



<figure class="wp-block-image size-large is-resized"><img decoding="async" data-attachment-id="3174" data-permalink="https://www.relataly.com/predicting-crimes-in-san-francisco-creatingsf-crime-map-using-xgboost/2960/image-17-7/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2021/04/image-17.png" data-orig-size="983,568" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-17" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2021/04/image-17.png" src="https://www.relataly.com/wp-content/uploads/2021/04/image-17.png" alt="different crime types in San Francisco and how often they occur during the day" class="wp-image-3174" width="906" height="522"/></figure>



<p class="wp-block-paragraph">In addition, the time when a crime happens affects the likelihood of certain types. For example, we can see that FRAUD rarely occurs at night and usually during the day. We can see that criminals often go to work in the afternoon and at midnight. On the other hand, certain crimes, such as VEHICLE THEFT, mainly occur at night and late afternoon but less often in the morning.</p>



<p class="wp-block-paragraph">If you want to gain an overview of additional features, you can use the pair plot function. Because our dataset is large, we reduce the computation time by plotting 1/100 of the data.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}">sns.pairplot(data = df_base_filtered[0::100], height=4, aspect=1.5, hue='Category')</pre></div>



<figure class="wp-block-image size-large is-resized"><img decoding="async" data-attachment-id="8386" data-permalink="https://www.relataly.com/predicting-crimes-in-san-francisco-creatingsf-crime-map-using-xgboost/2960/pairplot-by-category-san-francisco-crime-map/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/05/pairplot-by-category-san-francisco-crime-map.png" data-orig-size="1464,844" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="pairplot-by-category-san-francisco-crime-map" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/05/pairplot-by-category-san-francisco-crime-map.png" src="https://www.relataly.com/wp-content/uploads/2022/05/pairplot-by-category-san-francisco-crime-map-1024x590.png" alt="pairplot by category, san francisco crime map" class="wp-image-8386" width="1080" height="622" srcset="https://www.relataly.com/wp-content/uploads/2022/05/pairplot-by-category-san-francisco-crime-map.png 1024w, https://www.relataly.com/wp-content/uploads/2022/05/pairplot-by-category-san-francisco-crime-map.png 300w, https://www.relataly.com/wp-content/uploads/2022/05/pairplot-by-category-san-francisco-crime-map.png 768w, https://www.relataly.com/wp-content/uploads/2022/05/pairplot-by-category-san-francisco-crime-map.png 1464w" sizes="(max-width: 1080px) 100vw, 1080px" /></figure>



<h4 class="wp-block-heading" id="h-2-3-where-a-crime-occured-considering-address">2.3 Where a Crime Occured &#8211; Considering Address</h4>



<p class="wp-block-paragraph">Next, we look at the address information, from which we can often extract additional information. We do this by printing some sample address values.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Extracting information from the streetnames
for i in df_base['Address'][0:10]:
    print(i)</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">OAK ST / LAGUNA ST
OAK ST / LAGUNA ST
VANNESS AV / GREENWICH ST
1500 Block of LOMBARD ST
100 Block of BRODERICK ST
0 Block of TEDDY AV
AVALON AV / PERU AV
KIRKWOOD AV / DONAHUE ST
600 Block of 47TH AV
JEFFERSON ST / LEAVENWORTH ST</pre></div>



<p class="wp-block-paragraph">The street names alone are not so helpful. However, the address data does provide additional information. For example, it tells us whether the location is a street intersection or not. In addition, it contains the type of street. This information is valuable because now we can extract parts of the text and use them as separate features.</p>



<p class="wp-block-paragraph">We could do a lot more, but we&#8217;ve got a good enough idea of the data.</p>



<h3 class="wp-block-heading" id="h-step-3-data-preprocessing">Step #3 Data Preprocessing</h3>



<p class="wp-block-paragraph">Probably the most exciting and important aspect of model development is feature engineering. Compared to model parameterization, the right features can often achieve more significant leaps in performance. </p>



<h4 class="wp-block-heading" id="h-3-1-remarks-on-data-preprocessing-for-xgboost">3.1 Remarks on Data Preprocessing for XGBoost </h4>



<p class="wp-block-paragraph">When preprocessing the data, it is helpful to know which algorithms to use because some algorithms are picky about the shape of the data. We will prepare the data to train a gradient-boosting model (XGBoost). This algorithm uses a random forest ensemble, which can only handle integer and Boolean values, but no categorical data. Therefore we need to encode our values. We also need to map the categorical labels to integer values.</p>



<p class="wp-block-paragraph">We don&#8217;t need to scale the continuous feature variables because gradient boosting and decision trees, generally, are not sensitive to variables that have different scales.</p>



<h4 class="wp-block-heading" id="h-3-2-feature-engineering">3.2 Feature Engineering</h4>



<p class="wp-block-paragraph">Based on the data exploration that we have done in the previous section, we create three feature types:</p>



<ul class="wp-block-list">
<li><strong>Date &amp; Time:</strong> When a crime happens is essential. For example, when there is a lot of traffic on the street, there is a higher likelihood of traffic-related crimes. For example, when it is Saturday, more people will usually come to the nightlife district, which attracts certain crimes, e.g., drug-related. Therefore, we will create different features for the time, the day, the month, and the year. </li>



<li><strong>Address</strong>: As mentioned, we will extract additional features from the address column. First, we create different features for the street type (for example, ST, AV, WY, TR, DR). In addition, we check whether the address contains the word &#8220;Block.&#8221; In addition, we will let our model know whether the address is a street crossing.</li>



<li><strong>Latitude &amp; <strong>Longitude</strong></strong>: We will transform the latitude and longitude values into polar coordinates. We will also remove some outliers from the dataset whose latitude is far off the grid. Above all, this will make it easier for our model to make sense of the location.</li>
</ul>



<p class="wp-block-paragraph">Considering these features, the primary input to our crime-type prediction model is the information on when and where a crime occurs. </p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Processing Function for Features
def cart2polar(x, y):
    dist = np.sqrt(x**2 + y**2)
    phi = np.arctan2(y, x)
    return dist, phi

def preprocessFeatures(dfx):
    
    # Time Feature Engineering
    df = pd.get_dummies(dfx[['DayOfWeek' , 'PdDistrict']])
    df['Hour_Min'] = pd.to_datetime(dfx['Dates']).dt.hour + pd.to_datetime(dfx['Dates']).dt.minute / 60
    # We add a feature that contains the expontential time
    df['Hour_Min_Exp'] = np.exp(df['Hour_Min'])
    
    df['Day'] = pd.to_datetime(dfx['Dates']).dt.day
    df['Month'] = pd.to_datetime(dfx['Dates']).dt.month
    df['Year'] = pd.to_datetime(dfx['Dates']).dt.year

    month_one_hot_encoded = pd.get_dummies(pd.to_datetime(dfx['Dates']).dt.month, prefix='Month')
    df = pd.concat([df, month_one_hot_encoded], axis=1, join=&quot;inner&quot;)
    
    # Convert Carthesian Coordinates to Polar Coordinates
    df[['X', 'Y']] = dfx[['X', 'Y']] # we maintain the original coordindates as additional features
    df['dist'], df['phi'] = cart2polar(dfx['X'], dfx['Y'])
  
    # Extracting Street Types
    df['Is_ST'] = dfx['Address'].str.contains(&quot; ST&quot;, case=True)
    df['Is_AV'] = dfx['Address'].str.contains(&quot; AV&quot;, case=True)
    df['Is_WY'] = dfx['Address'].str.contains(&quot; WY&quot;, case=True)
    df['Is_TR'] = dfx['Address'].str.contains(&quot; TR&quot;, case=True)
    df['Is_DR'] = dfx['Address'].str.contains(&quot; DR&quot;, case=True)
    df['Is_Block'] = dfx['Address'].str.contains(&quot; Block&quot;, case=True)
    df['Is_crossing'] = dfx['Address'].str.contains(&quot; / &quot;, case=True)
    
    return df

# Processing Function for Labels
def encodeLabels(dfx):
    df = pd.DataFrame (columns = [])
    factor = pd.factorize(dfx['Category'])
    return factor

# Remove Outliers by Longitude
df_cleaned = df_base[df_base['Y']&lt;70]

# Encode Labels as Integer
factor = encodeLabels(df_cleaned)
y_df = factor[0]
labels = list(factor[1])
# for val, i in enumerate(labels):
#     print(val, i)</pre></div>



<p class="wp-block-paragraph">We could also try to further improve our features by using additional data sources, such as weather data. However, there is no guarantee that this will improve the model results, and it did not in the case of criminal records. Therefore, we have omitted this part.</p>



<h3 class="wp-block-heading" id="h-step-4-visualize-crime-types-on-a-map-of-san-francisco">Step #4 Visualize Crime Types on a Map of San Francisco</h3>



<p class="wp-block-paragraph">Next, we create a San Francisco crime map using the cartesian coordinates indicating where a crime has occurred. First, we only plot the data without a geographical map. Later we will use these spatial data to create a dot plot and overlay it with a map of San Francisco. Visualizing the crime types on a map helps us understand how crime types distribute across the city. </p>



<h4 class="wp-block-heading" id="h-4-1-plot-crime-types-using-a-scatter-plot">4.1 Plot Crime Types using a Scatter Plot</h4>



<p class="wp-block-paragraph">Next, we want to gain an overview of possible spatial patterns and hotspots. We expect to see streets and neighborhoods where certain crimes are more common than in the more expensive areas of the city. In addition, we expect to see places in the city where certain crime types occur relatively rarely. To gain an overview of the crime distribution in San Francisco, we use a scatter plot to display the crime coordinates on a blank chart. </p>



<p class="wp-block-paragraph">Running the code below creates the crime map of San Francisco with all crime types. Depending on the speed of your machine, the creation of the map may take several minutes.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Plot Criminal Activities by Lat and Long
df_filtered = df_cleaned.sample(frac=0.05)  
#df_filtered = df_cleaned[df_cleaned['Category'].isin(['PROSTITUTION', 'VEHICLE THEFT', 'FRAUD'])].sample(frac=0.05) # to filter 

groups = df_filtered.groupby('Category')

fig, ax = plt.subplots(sharex=False, figsize=(20, 12))
ax.margins(0.05) # Optional, just adds 5% padding to the autoscaling
for name, group in groups:
    ax.plot(group['X'], group['Y'], marker='.', linestyle='', label=name, alpha=0.9)
ax.legend()
plt.show()</pre></div>



<figure class="wp-block-image size-large is-resized"><img decoding="async" data-attachment-id="8389" data-permalink="https://www.relataly.com/predicting-crimes-in-san-francisco-creatingsf-crime-map-using-xgboost/2960/output-8/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/05/output.png" data-orig-size="1170,683" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="output" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/05/output.png" src="https://www.relataly.com/wp-content/uploads/2022/05/output-1024x598.png" alt="Crime Map of San Francisco (sf crime map) - Kaggle Crime Prediction Challenge. Classification with XGBoost" class="wp-image-8389" width="1111" height="649" srcset="https://www.relataly.com/wp-content/uploads/2022/05/output.png 1024w, https://www.relataly.com/wp-content/uploads/2022/05/output.png 300w, https://www.relataly.com/wp-content/uploads/2022/05/output.png 768w, https://www.relataly.com/wp-content/uploads/2022/05/output.png 1170w" sizes="(max-width: 1111px) 100vw, 1111px" /></figure>



<p class="wp-block-paragraph">The plot shows that certain streets in San Francisco are more prone to specific crime types than others. It is also clear that there are certain crime hotspots in the city, especially in the center. We can also see that few crimes are reported in public park areas. </p>



<h4 class="wp-block-heading" id="h-4-2-create-a-crime-map-of-san-francisco-using-plotly">4.2 Create a Crime Map of San Francisco using Plotly</h4>



<p class="wp-block-paragraph">Next, we will create a San Francisco crime map using the Plotly Python library. Because the plugin can handle a limited amount of data simultaneously, we will reduce our data to a fraction of 1% and a few selected crime types. </p>



<p class="wp-block-paragraph">Running the code below opens a _map.html file in your browser that displays the SF crime map. The result is a zoomable geographic map of San Francisco that shows how the selected crime types distribute across the city.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># 4.2 Create a Crime Map of San Francisco using Plotly
# Limit the data to a fraction and selected categories
df_filtered = df_cleaned.sample(frac=0.01) 
fig = px.scatter_mapbox(df_filtered, lat=&quot;Y&quot;, lon=&quot;X&quot;, hover_name=&quot;Category&quot;, color='Category', hover_data=[&quot;Y&quot;, &quot;X&quot;], zoom=12, height=800)
fig.update_layout(mapbox_style=&quot;open-street-map&quot;)
fig.update_layout(margin={&quot;r&quot;:0,&quot;t&quot;:0,&quot;l&quot;:0,&quot;b&quot;:0})
fig.show()</pre></div>



<figure class="wp-block-image size-large is-resized"><img decoding="async" data-attachment-id="8401" data-permalink="https://www.relataly.com/predicting-crimes-in-san-francisco-creatingsf-crime-map-using-xgboost/2960/map-sf-crime/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/05/map-sf-crime.png" data-orig-size="1600,650" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="map-sf-crime" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/05/map-sf-crime.png" src="https://www.relataly.com/wp-content/uploads/2022/05/map-sf-crime-1024x416.png" alt="Crime Map of San Francisco (SF Crime Map) - Kaggle Crime Prediction Challenge. Crime Classification XGBoost" class="wp-image-8401" width="1097" height="445" srcset="https://www.relataly.com/wp-content/uploads/2022/05/map-sf-crime.png 1024w, https://www.relataly.com/wp-content/uploads/2022/05/map-sf-crime.png 300w, https://www.relataly.com/wp-content/uploads/2022/05/map-sf-crime.png 768w, https://www.relataly.com/wp-content/uploads/2022/05/map-sf-crime.png 1536w, https://www.relataly.com/wp-content/uploads/2022/05/map-sf-crime.png 1600w" sizes="(max-width: 1097px) 100vw, 1097px" /></figure>



<p class="wp-block-paragraph">The SF crime map shows different types of crimes, including prostitution, vehicle theft, and fraud. The interactive map allows you to change zoom levels and filter the type of crime displayed on the map. For example, if you filter DRUG/NARCOTIC-related crimes, you can see that these crimes mainly occur in the city center near the financial district and the nightlife area.</p>



<h3 class="wp-block-heading" id="h-step-5-split-the-data">Step #5 Split the Data</h3>



<p class="wp-block-paragraph">Before training our predictive model, we will split our data into separate datasets for training and testing. For this purpose, we use the train_test_split function of scikit-learn and configure a split ratio of 70%. Then we output the data, which we employ in the next step to train and validate a model.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Create train_df &amp; test_df
x_df = preprocessFeatures(df_cleaned).copy()

# Split the data into x_train and y_train data sets
x_train, x_test, y_train, y_test = train_test_split(x_df, y_df, train_size=0.7, random_state=0)
x_train</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">		DayOfWeek_Friday	DayOfWeek_Monday	DayOfWeek_Saturday	DayOfWeek_Sunday	DayOfWeek_Thursday	DayOfWeek_Tuesday	DayOfWeek_Wednesday	PdDistrict_BAYVIEW	PdDistrict_CENTRAL	PdDistrict_INGLESIDE	...	Y			dist		phi			Is_ST	Is_AV	Is_WY	Is_TR	Is_DR	Is_Block	Is_crossing
276998	0					0					0					0					0					1					0					0					0					0						...	37.785023	128.110900	2.842200	True	False	False	False	False	True		False
81579	0					0					0					0					0					1					0					0					0					0						...	37.748470	128.185052	2.842677	False	True	False	False	False	True		False
206676	0					0					0					1					0					0					0					0					0					0						...	37.762744	128.113657	2.842389	True	False	False	False	False	True		False
732006	0					0					0					0					0					0					1					0					0					0						...	37.784140	128.109653	2.842204	True	False	False	False	False	False		True
796194	1					0					0					0					0					0					0					0					0					0						...	37.791333	128.125982	2.842185	True	False	False	False	False	True		False
5 rows × 45 columns</pre></div>



<h3 class="wp-block-heading" id="h-step-6-train-a-random-forest-classifier">Step #6 Train a Random Forest Classifier</h3>



<p class="wp-block-paragraph">We can train the predictive models now that we have prepared the data. We train a basic model based on the Random Forest algorithm in the first step. The Random Forest is a robust algorithm that can handle regression and classification problems. One of our <a href="https://www.relataly.com/anyone-about-to-leave-predicting-the-customer-churn-of-a-telecommunications-provider/2378/" target="_blank" rel="noreferrer noopener">recent articles provides more information on Random Forests</a> and how you can find the optimal configuration of their hyperparameters. In this tutorial, we use the Random Forest to establish a baseline against which we can measure the performance of our XGboost model.  We, therefore, use the Random Forest with a simple parameter configuration without tuning the hyperparameters.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Train a single random forest classifier - parameters are a best guess
clf = RandomForestClassifier(max_depth=100, random_state=0, n_estimators = 200)
clf.fit(x_train, y_train.ravel())
y_pred = clf.predict(x_test)

results_log = classification_report(y_test, y_pred)
print(results_log)</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">Output exceeds the size limit. Open the full output data in a text editor
              precision    recall  f1-score   support

           0       0.15      0.10      0.12     12657
           1       0.29      0.35      0.32     37898
           2       0.38      0.63      0.47     52237
           3       0.46      0.40      0.43     16136
           4       0.16      0.08      0.10     13426
           5       0.25      0.21      0.23     27798
           6       0.10      0.04      0.06      6850
           7       0.23      0.22      0.23     23087
           8       0.19      0.12      0.15      2586
           9       0.20      0.13      0.15     10942
          10       0.08      0.03      0.05      9559
          11       0.00      0.00      0.00      1300
          12       0.20      0.10      0.14      3200
          13       0.37      0.43      0.40     16282
          14       0.02      0.02      0.02      1350
          15       0.01      0.00      0.00      2912
          16       0.05      0.03      0.04      2217
          17       0.61      0.52      0.56      7865
          18       0.11      0.06      0.08      4954
          19       0.04      0.03      0.03       723
          20       0.28      0.19      0.23       581
          21       0.05      0.02      0.03       708
          22       0.25      0.13      0.17      1333
...
    accuracy                           0.31    263395
   macro avg       0.15      0.12      0.13    263395
weighted avg       0.28      0.31      0.28    263395</pre></div>



<p class="wp-block-paragraph">The baseline model is a random forest classifier with 31% percent accuracy on the test dataset.</p>



<h3 class="wp-block-heading" id="h-step-7-train-an-xgboost-classifier">Step #7 Train an XGBoost Classifier</h3>



<p class="wp-block-paragraph">Now that we have a baseline model, we can train our gradient boosting classifier using the XGBoost package. We expect this model to perform better than the baseline. </p>



<h4 class="wp-block-heading" id="h-7-1-about-gradient-boosting">7.1 About Gradient Boosting</h4>



<p class="wp-block-paragraph">XGBoost is an implementation of a <a href="https://www.relataly.com/category/machine-learning-algorithms/gradient-boosting/" target="_blank" rel="noreferrer noopener">gradient-boosting algorithm</a> that uses a decision-tree-based ensemble machine learning algorithm. The algorithm searches for an optimal ensemble of trees. In this process, the algorithm iteratively adds trees to the model or removes them to reduce the prediction error of the previous tree constellation. The algorithm repeats these steps until it can make no further improvements. Thus, training does not optimize the model against the predictions but the previous model&#8217;s residuals (prediction errors).</p>



<p class="wp-block-paragraph" id="h-xgboost-an-implementation-of-a-gradient-boosting-algorithm-that-creates-a-random-forest-which-is-an-ensemble-of-decision-trees-the-basic-idea-of-a-decision-forest-is-to-have-an-ensemble-of-decision-trees-that-come-to-a-conclusion-on-the-prediction-label-by-casting-a-vote-gradient-boosting-is-used-to-find-an-optimal-ensemble-of-trees-and-parameters-boosting-is-the-process-of-iteratively-adding-trees-to-correct-the-prediction-error-of-a-previous-ensemble-of-trees-until-no-further-improvements-can-be-achieved-the-models-are-thereby-trained-to-predict-the-residuals-prediction-errors-of-the-previous-model">But XGBoost does more! It is an extreme version of gradient boosting that uses additional optimization techniques to achieve the best result with minimal effort. In contrast to the random decision forest, the XGBoost classification algorithm determines an optimal number of trees in the training process. We do not have to specify this number in advance.</p>



<p class="wp-block-paragraph" id="h-xgboost-an-implementation-of-a-gradient-boosting-algorithm-that-creates-a-random-forest-which-is-an-ensemble-of-decision-trees-the-basic-idea-of-a-decision-forest-is-to-have-an-ensemble-of-decision-trees-that-come-to-a-conclusion-on-the-prediction-label-by-casting-a-vote-gradient-boosting-is-used-to-find-an-optimal-ensemble-of-trees-and-parameters-boosting-is-the-process-of-iteratively-adding-trees-to-correct-the-prediction-error-of-a-previous-ensemble-of-trees-until-no-further-improvements-can-be-achieved-the-models-are-thereby-trained-to-predict-the-residuals-prediction-errors-of-the-previous-model">A disadvantage of XGBoost is that it tends to overfit the data. Therefore, testing against unseen data is essential. This tutorial will test only against a single test sample for simplicity, but using cross-validation would be a better choice.</p>



<h4 class="wp-block-heading" id="h-7-2-train-the-xgboost-classifier">7.2 Train the XGBoost Classifier</h4>



<p class="wp-block-paragraph">Various Gradient Boosting Algorithms are available for Python, including one from scikit-learn. However, scikit-learn does not support multi-threading, which makes the training process slower than necessary. For this reason, we will use the gradient boosting classifier from the XGBoost package.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Configure the XGBoost model
param = {'booster': 'gbtree', 
         'tree_method': 'gpu_hist',
         'predictor': 'gpu_predictor',
         'max_depth': 140, 
         'eta': 0.3, 
         'objective': '{multi:softmax}', 
         'eval_metric': 'mlogloss', 
         'num_round': 30,
         'feature_selector ': 'cyclic'
        }

xgb_clf = XGBClassifier(param)
xgb_clf.fit(x_train, y_train.ravel())
score = xgb_clf.score(x_test, y_test.ravel())
print(score)

# Create predictions on the test dataset
y_pred = xgb_clf.predict(x_test)

# Print a classification report
results_log = classification_report(y_test, y_pred)
print(results_log)</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">Output exceeds the size limit. Open the full output data in a text editor
0.30852142219859907
              precision    recall  f1-score   support

           0       0.17      0.01      0.02     12657
           1       0.30      0.42      0.35     37898
           2       0.33      0.72      0.46     52237
           3       0.31      0.27      0.29     16136
           4       0.21      0.03      0.05     13426
           5       0.24      0.18      0.21     27798
           6       0.17      0.01      0.01      6850
           7       0.21      0.19      0.20     23087
           8       0.26      0.01      0.02      2586
           9       0.22      0.08      0.12     10942
          10       0.13      0.00      0.00      9559
          11       0.07      0.00      0.01      1300
          12       0.20      0.08      0.11      3200
          13       0.34      0.43      0.38     16282
          14       0.00      0.00      0.00      1350
          15       0.12      0.00      0.01      2912
          16       0.15      0.02      0.03      2217
          17       0.57      0.34      0.43      7865
          18       0.19      0.03      0.05      4954
          19       0.00      0.00      0.00       723
          20       0.50      0.24      0.32       581
          21       0.10      0.01      0.01       708
...
    accuracy                           0.31    263395
   macro avg       0.18      0.11      0.11    263395
weighted avg       0.27      0.31      0.25    263395
</pre></div>



<p class="wp-block-paragraph">Now that we have trained our classification model, let&#8217;s see how it performs. For this purpose, we will generate predictions (y_pred) on the test dataset (x_test). Afterward, we use the predictions and the valid values (y_test) to create a classification report.</p>



<p class="wp-block-paragraph">Our model achieves an accuracy score of 31%. At first hand, this might not look so good, but considering that we have 39 categories and only sparse information available, this performance is quite impressive.   </p>



<h3 class="wp-block-heading" id="h-step-8-measure-model-performance">Step #8 Measure Model Performance</h3>



<p class="wp-block-paragraph">So how well does our XGboost model perform? To measure the performance of our model, we create a confusion matrix that visualizes the performance of the XGboost classifier. If you want to learn more about measuring the performance of classification models, check out<a href="https://www.relataly.com/measuring-classification-performance-with-python-and-scikit-learn/846/" target="_blank" rel="noreferrer noopener"> this tutorial on measuring classification performance</a>.</p>



<p class="wp-block-paragraph">Running the code below creates the confusion matrix that shows the number of correct and false predictions for each crime category.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Print a multi-Class Confusion Matrix
cnf_matrix = confusion_matrix(y_test.reshape(-1), y_pred)
df_cm = pd.DataFrame(cnf_matrix, columns=np.unique(y_test), index = np.unique(y_test))
df_cm.index.name = 'Actual'
df_cm.columns.name = 'Predicted'
plt.figure(figsize = (16,12))
plt.tight_layout()
sns.set(font_scale=1.4) #for label size
sns.heatmap(df_cm, cbar=True, cmap= &quot;inferno&quot;, annot=False, fmt='.0f' #, annot_kws={&quot;size&quot;: 13}
           )</pre></div>



<figure class="wp-block-image size-large is-resized"><img decoding="async" data-attachment-id="3213" data-permalink="https://www.relataly.com/predicting-crimes-in-san-francisco-creatingsf-crime-map-using-xgboost/2960/image-22-5/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2021/04/image-22.png" data-orig-size="903,709" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-22" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2021/04/image-22.png" src="https://www.relataly.com/wp-content/uploads/2021/04/image-22.png" alt="Evaluating the performance of our XGboost classifier; sfo crime map" class="wp-image-3213" width="669" height="525" srcset="https://www.relataly.com/wp-content/uploads/2021/04/image-22.png 903w, https://www.relataly.com/wp-content/uploads/2021/04/image-22.png 300w, https://www.relataly.com/wp-content/uploads/2021/04/image-22.png 768w" sizes="(max-width: 669px) 100vw, 669px" /></figure>



<p class="wp-block-paragraph">The confusion matrix shows that our model frequently predicts crime category two and neglects the other crime types. The reason is the uneven distribution of crime types in the training data. As a result, when we evaluate the model, we need to pay attention to the importance of the different crime types. For example, we might train the model to predict certain crime types accurately, although this might come at a lower accuracy when predicting other crime types. However, such optimizations depend on the technical context and the goals one wants to achieve with the prediction model. </p>



<h2 class="wp-block-heading" id="h-summary">Summary</h2>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">This tutorial has presented the machine learning use case &#8220;Predictive Policing&#8221; and showed how to implement it in Python. We have trained an XGBoost model that predicts crime types in San Francisco based on the information on when and where specific crimes have occurred. We also illustrated our data on an interactive crime map of San Francisco with the Plotly Python library. The Crime Map is an intuitive way of visualizing crime in a city and highlighting particular hotspots. Finally, we have used the prediction model to make test predictions and evaluate the model performance against other algorithms, such as a classic Random Decision Forest. The XGBoost model achieves a prediction accuracy of about 31%—a respectable performance, considering that the prediction problem involves 39 crime classes.</p>



<p class="wp-block-paragraph">We hope this tutorial was helpful. If you have any questions or suggestions on what we could improve, feel free to post them in the comments. We appreciate your feedback.</p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%">
<figure class="wp-block-image size-full is-resized"><img decoding="async" data-attachment-id="8410" data-permalink="https://www.relataly.com/predicting-crimes-in-san-francisco-creatingsf-crime-map-using-xgboost/2960/image-12-13/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/05/image-12.png" data-orig-size="658,592" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-12" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/05/image-12.png" src="https://www.relataly.com/wp-content/uploads/2022/05/image-12.png" alt="Crime Map of San Francisco - Kaggle Crime Prediction Challenge. Crim Classification XGBoost" class="wp-image-8410" width="346" height="311" srcset="https://www.relataly.com/wp-content/uploads/2022/05/image-12.png 658w, https://www.relataly.com/wp-content/uploads/2022/05/image-12.png 300w" sizes="(max-width: 346px) 100vw, 346px" /><figcaption class="wp-element-caption">Predictive policing with machine learning &#8211; Crime map of San Francisco, created with Python and Plotly</figcaption></figure>
</div>
</div>



<h2 class="wp-block-heading">Sources and Further Reading</h2>



<p class="wp-block-paragraph">Looking for more esciting map vizualizations? Consider the relataly tutorial on <a href="https://www.relataly.com/visualize-covid-19-data-on-a-geographic-heat-maps/291/" target="_blank" rel="noreferrer noopener">visualizing COVID-19 data on geographic heatmaps using GeoPandas</a>.</p>



<div style="display: inline-block;">
  <iframe sandbox="allow-popups allow-scripts allow-modals allow-forms allow-same-origin" style="width:120px;height:240px;" marginwidth="0" marginheight="0" scrolling="no" frameborder="0" src="//ws-eu.amazon-adsystem.com/widgets/q?ServiceVersion=20070822&amp;OneJS=1&amp;Operation=GetAdHtml&amp;MarketPlace=DE&amp;source=ss&amp;ref=as_ss_li_til&amp;ad_type=product_link&amp;tracking_id=flo7up-21&amp;language=de_DE&amp;marketplace=amazon&amp;region=DE&amp;placement=3030181162&amp;asins=3030181162&amp;linkId=669e46025028259138fbb5ccec12dfbe&amp;show_border=true&amp;link_opens_in_new_window=true"></iframe>
<iframe sandbox="allow-popups allow-scripts allow-modals allow-forms allow-same-origin" style="width:120px;height:240px;" marginwidth="0" marginheight="0" scrolling="no" frameborder="0" src="//ws-eu.amazon-adsystem.com/widgets/q?ServiceVersion=20070822&amp;OneJS=1&amp;Operation=GetAdHtml&amp;MarketPlace=DE&amp;source=ss&amp;ref=as_ss_li_til&amp;ad_type=product_link&amp;tracking_id=flo7up-21&amp;language=de_DE&amp;marketplace=amazon&amp;region=DE&amp;placement=1999579577&amp;asins=1999579577&amp;linkId=91d862698bf9010ff4c09539e4c49bf4&amp;show_border=true&amp;link_opens_in_new_window=true"></iframe>
<iframe sandbox="allow-popups allow-scripts allow-modals allow-forms allow-same-origin" style="width:120px;height:240px;" marginwidth="0" marginheight="0" scrolling="no" frameborder="0" src="//ws-eu.amazon-adsystem.com/widgets/q?ServiceVersion=20070822&amp;OneJS=1&amp;Operation=GetAdHtml&amp;MarketPlace=DE&amp;source=ss&amp;ref=as_ss_li_til&amp;ad_type=product_link&amp;tracking_id=flo7up-21&amp;language=de_DE&amp;marketplace=amazon&amp;region=DE&amp;placement=1839217715&amp;asins=1839217715&amp;linkId=356ba074068849ff54393f527190825d&amp;show_border=true&amp;link_opens_in_new_window=true"></iframe>
<iframe sandbox="allow-popups allow-scripts allow-modals allow-forms allow-same-origin" style="width:120px;height:240px;" marginwidth="0" marginheight="0" scrolling="no" frameborder="0" src="//ws-eu.amazon-adsystem.com/widgets/q?ServiceVersion=20070822&amp;OneJS=1&amp;Operation=GetAdHtml&amp;MarketPlace=DE&amp;source=ss&amp;ref=as_ss_li_til&amp;ad_type=product_link&amp;tracking_id=flo7up-21&amp;language=de_DE&amp;marketplace=amazon&amp;region=DE&amp;placement=1492032646&amp;asins=1492032646&amp;linkId=2214804dd039e7103577abd08722abac&amp;show_border=true&amp;link_opens_in_new_window=true"></iframe>
</div>



<p class="has-contrast-2-color has-base-3-background-color has-text-color has-background wp-block-paragraph"><em>The links above to Amazon are affiliate links. By buying through these links, you support the Relataly.com blog and help to cover the hosting costs. Using the links does not affect the price.</em></p>
<p>The post <a href="https://www.relataly.com/predicting-crimes-in-san-francisco-creatingsf-crime-map-using-xgboost/2960/">Predictive Policing: Preventing Crime in San Francisco using XGBoost and Python</a> appeared first on <a href="https://www.relataly.com">relataly.com</a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://www.relataly.com/predicting-crimes-in-san-francisco-creatingsf-crime-map-using-xgboost/2960/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">2960</post-id>	</item>
		<item>
		<title>Classifying Purchase Intention of Online Shoppers with Python</title>
		<link>https://www.relataly.com/predicting-the-purchase-intention-of-online-shoppers/982/</link>
					<comments>https://www.relataly.com/predicting-the-purchase-intention-of-online-shoppers/982/#respond</comments>
		
		<dc:creator><![CDATA[Florian Follonier]]></dc:creator>
		<pubDate>Mon, 11 May 2020 21:42:35 +0000</pubDate>
				<category><![CDATA[Algorithms]]></category>
		<category><![CDATA[Classification (two-class)]]></category>
		<category><![CDATA[Data Science]]></category>
		<category><![CDATA[Data Sources]]></category>
		<category><![CDATA[Feature Permutation Importance]]></category>
		<category><![CDATA[Insurance]]></category>
		<category><![CDATA[Kaggle Competitions]]></category>
		<category><![CDATA[Logistic Regression]]></category>
		<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[Marketing Automation]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[Retail]]></category>
		<category><![CDATA[Sales Forecasting]]></category>
		<category><![CDATA[Scikit-Learn]]></category>
		<category><![CDATA[Seaborn]]></category>
		<category><![CDATA[AI in E-Commerce]]></category>
		<category><![CDATA[AI in Marketing]]></category>
		<category><![CDATA[Beginner Tutorials]]></category>
		<category><![CDATA[Classic Machine Learning]]></category>
		<category><![CDATA[Classification Error Metrics]]></category>
		<category><![CDATA[Confusion Matrix]]></category>
		<category><![CDATA[Supervised Learning]]></category>
		<category><![CDATA[Whisker Plots]]></category>
		<guid isPermaLink="false">https://www.relataly.com/?p=982</guid>

					<description><![CDATA[<p>Online shopping has become a part of our daily lives, and online stores are continually seeking to improve their sales. One way to achieve this is by using machine learning to predict customers&#8217; purchase intentions. This innovative process can help businesses understand their customers&#8217; behavior and tailor their marketing strategies accordingly. In this article, we ... <a title="Classifying Purchase Intention of Online Shoppers with Python" class="read-more" href="https://www.relataly.com/predicting-the-purchase-intention-of-online-shoppers/982/" aria-label="Read more about Classifying Purchase Intention of Online Shoppers with Python">Read more</a></p>
<p>The post <a href="https://www.relataly.com/predicting-the-purchase-intention-of-online-shoppers/982/">Classifying Purchase Intention of Online Shoppers with Python</a> appeared first on <a href="https://www.relataly.com">relataly.com</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">Online shopping has become a part of our daily lives, and online stores are continually seeking to improve their sales. One way to achieve this is by using machine learning to predict customers&#8217; purchase intentions. This innovative process can help businesses understand their customers&#8217; behavior and tailor their marketing strategies accordingly.</p>



<p class="wp-block-paragraph">In this article, we will explore the practical side of purchase intention prediction. Our focus is on developing a classification model that predicts whether a visitor will make a purchase or not. We&#8217;ll use Scikit-Learn&#8217;s machine learning library to train a Logistic Regression algorithm, and evaluate the model&#8217;s performance. Our ultimate goal is to provide insights into the circumstances under which customers make purchase decisions.</p>



<p class="wp-block-paragraph">Predicting purchase intentions can offer significant benefits to online stores, such as identifying potential customers who are most likely to buy and targeting their marketing efforts accordingly. By understanding the practical application of machine learning for purchase intention prediction, online businesses can gain a competitive edge and increase their revenue.</p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%"></div>
</div>



<p class="wp-block-paragraph">Also: <a href="https://www.relataly.com/simple-sentiment-analysis-using-naive-bayes-and-logistic-regression/2007/" target="_blank" rel="noreferrer noopener">Sentiment Analysis with Naive Bayes and Logistic Regression in Python</a></p>



<h2 class="wp-block-heading">About Modeling Customer Purchase Intentions</h2>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">Customer purchase intention prediction is the process of using machine learning algorithms to predict the likelihood that a particular customer will make a purchase. This can be useful for various applications, such as identifying potential customers most likely interested in a particular product or service and targeting marketing and sales efforts accordingly.</p>



<p class="wp-block-paragraph">To make accurate predictions about customer purchase intentions, it is important to have access to high-quality data about the customer, such as their demographic information, purchasing history, and other relevant factors. By analyzing this data and applying appropriate machine learning algorithms, it is possible to identify patterns and trends that can predict the likelihood that a particular customer will make a purchase.</p>



<p class="wp-block-paragraph">There are many different approaches to customer purchase intention prediction, and the specific methods used can vary depending on the application and the data available. Some common techniques for predicting customer purchase intentions include using regression analysis to model the relationship between purchase intentions and other variables and using classification algorithms to classify customers as likely or unlikely to make a purchase. By using these techniques, it is possible to make more accurate and useful predictions about customer purchase intentions.</p>



<p class="wp-block-paragraph">Also: <a href="https://www.relataly.com/predicting-the-customer-churn-of-a-telecommunications-provider/2378/" target="_blank" rel="noreferrer noopener">Customer Churn Prediction &#8211; Understanding Models with Feature Permutation Importance</a></p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%">
<figure class="wp-block-image size-full"><img decoding="async" width="478" height="500" data-attachment-id="12685" data-permalink="https://www.relataly.com/men-and-woman-doing-groceries-machine-learning-customer-purchase-intention-prediction-relataly-midjourney-min/" data-orig-file="https://www.relataly.com/wp-content/uploads/2023/03/men-and-woman-doing-groceries-machine-learning-customer-purchase-intention-prediction-relataly-midjourney-min.png" data-orig-size="478,500" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="men and woman doing groceries machine learning customer purchase intention prediction relataly midjourney-min" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2023/03/men-and-woman-doing-groceries-machine-learning-customer-purchase-intention-prediction-relataly-midjourney-min.png" src="https://www.relataly.com/wp-content/uploads/2023/03/men-and-woman-doing-groceries-machine-learning-customer-purchase-intention-prediction-relataly-midjourney-min.png" alt="Customer purchase intentions sometimes follow patterns that can be used for predictive purposes. Image created with Midjourney." class="wp-image-12685" srcset="https://www.relataly.com/wp-content/uploads/2023/03/men-and-woman-doing-groceries-machine-learning-customer-purchase-intention-prediction-relataly-midjourney-min.png 478w, https://www.relataly.com/wp-content/uploads/2023/03/men-and-woman-doing-groceries-machine-learning-customer-purchase-intention-prediction-relataly-midjourney-min.png 287w" sizes="(max-width: 478px) 100vw, 478px" /><figcaption class="wp-element-caption">Customer purchase intentions sometimes follow patterns that can be used for predictive purposes. Image created with <a href="http://www.midjourney.com" target="_blank" rel="noreferrer noopener">Midjourney</a>.</figcaption></figure>
</div>
</div>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<h2 class="wp-block-heading">How Modeling Purchase Intentions can Lead to a Better Customer Understanding</h2>



<p class="wp-block-paragraph">Predicting the purchase intentions of online shoppers can be a step for online stores to understand their customers better. Creating predictive models makes it possible to conclude the factors influencing customers&#8217; buying behavior. At what time of day are our customers most inclined to buy? For which products do customers often abandon the purchase process? Such questions are fascinating for marketing departments. Once understood, they can enable marketers to optimize their customers&#8217; buying experience and achieve a higher conversion rate. In this way, intention prediction can help online stores target customers with the right products at the right time and thus take a step toward marketing automation.</p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%"></div>
</div>



<figure class="wp-block-image size-large is-resized"><img decoding="async" data-attachment-id="6828" data-permalink="https://www.relataly.com/predicting-the-purchase-intention-of-online-shoppers/982/image-13-12/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/04/image-13.png" data-orig-size="1846,861" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="Classifying Purchase Intentions of Online Shoppers with Python" data-image-description="&lt;p&gt;Classifying Purchase Intentions of Online Shoppers with Python&lt;/p&gt;
" data-image-caption="&lt;p&gt;Classifying Purchase Intentions of Online Shoppers with Python&lt;/p&gt;
" data-large-file="https://www.relataly.com/wp-content/uploads/2022/04/image-13.png" src="https://www.relataly.com/wp-content/uploads/2022/04/image-13-1024x478.png" alt="A classification model that predicts the buying intention of online shoppers" class="wp-image-6828" width="760" height="355" srcset="https://www.relataly.com/wp-content/uploads/2022/04/image-13.png 1024w, https://www.relataly.com/wp-content/uploads/2022/04/image-13.png 300w, https://www.relataly.com/wp-content/uploads/2022/04/image-13.png 768w, https://www.relataly.com/wp-content/uploads/2022/04/image-13.png 1536w, https://www.relataly.com/wp-content/uploads/2022/04/image-13.png 1846w" sizes="(max-width: 760px) 100vw, 760px" /></figure>



<h2 class="wp-block-heading" id="h-implementing-a-prediction-model-for-purchase-intentions-with-python">Implementing a Prediction Model for Purchase Intentions with Python</h2>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">Logistic regression is a widely-used algorithm in machine learning that is particularly useful for solving two-class classification problems. One of the primary benefits of using logistic regression models is that they can help us understand the factors that influence the predictions made by the model. This interpretability is a key advantage of logistic regression, making it a popular choice in many real-world applications.</p>



<p class="wp-block-paragraph">In the next steps of our analysis, we will develop a two-class classification model that utilizes the logistic regression algorithm to predict the purchase intentions of online shoppers. By analyzing a set of features that are likely to influence a shopper&#8217;s decision to purchase, such as product price, customer reviews, and shipping time, we can build a model that accurately predicts the likelihood of a shopper completing a purchase. The logistic regression algorithm will be particularly useful in this case, as it allows us to identify which features are the most significant predictors of purchase intention.</p>



<p class="wp-block-paragraph">The code is available on the GitHub repository.</p>



<div class="wp-block-kadence-advancedbtn kb-buttons-wrap kb-btns_d5d832-9e"><a class="kb-button kt-button button kb-btn_7d1c88-9e kt-btn-size-standard kt-btn-width-type-full kb-btn-global-inherit kt-btn-has-text-true kt-btn-has-svg-true wp-block-button__link wp-block-kadence-singlebtn" href="https://github.com/flo7up/relataly-public-python-tutorials/blob/master/02%20Classification/019%20%20Classifying%20Shopper%20Buying%20Intention%20using%20Logistic%20Regression.ipynb" target="_blank" rel="noreferrer noopener"><span class="kb-svg-icon-wrap kb-svg-icon-fe_eye kt-btn-icon-side-left"><svg viewBox="0 0 24 24"  fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"  aria-hidden="true"><path d="M1 12s4-8 11-8 11 8 11 8-4 8-11 8-11-8-11-8z"/><circle cx="12" cy="12" r="3"/></svg></span><span class="kt-btn-inner-text">View on GitHub </span></a>

<a class="kb-button kt-button button kb-btn_040040-16 kt-btn-size-standard kt-btn-width-type-full kb-btn-global-inherit kt-btn-has-text-true kt-btn-has-svg-true wp-block-button__link wp-block-kadence-singlebtn" href="https://github.com/flo7up/relataly-public-python-API-tutorials" target="_blank" rel="noreferrer noopener"><span class="kb-svg-icon-wrap kb-svg-icon-fa_github kt-btn-icon-side-left"><svg viewBox="0 0 496 512"  fill="currentColor" xmlns="http://www.w3.org/2000/svg"  aria-hidden="true"><path d="M165.9 397.4c0 2-2.3 3.6-5.2 3.6-3.3.3-5.6-1.3-5.6-3.6 0-2 2.3-3.6 5.2-3.6 3-.3 5.6 1.3 5.6 3.6zm-31.1-4.5c-.7 2 1.3 4.3 4.3 4.9 2.6 1 5.6 0 6.2-2s-1.3-4.3-4.3-5.2c-2.6-.7-5.5.3-6.2 2.3zm44.2-1.7c-2.9.7-4.9 2.6-4.6 4.9.3 2 2.9 3.3 5.9 2.6 2.9-.7 4.9-2.6 4.6-4.6-.3-1.9-3-3.2-5.9-2.9zM244.8 8C106.1 8 0 113.3 0 252c0 110.9 69.8 205.8 169.5 239.2 12.8 2.3 17.3-5.6 17.3-12.1 0-6.2-.3-40.4-.3-61.4 0 0-70 15-84.7-29.8 0 0-11.4-29.1-27.8-36.6 0 0-22.9-15.7 1.6-15.4 0 0 24.9 2 38.6 25.8 21.9 38.6 58.6 27.5 72.9 20.9 2.3-16 8.8-27.1 16-33.7-55.9-6.2-112.3-14.3-112.3-110.5 0-27.5 7.6-41.3 23.6-58.9-2.6-6.5-11.1-33.3 2.6-67.9 20.9-6.5 69 27 69 27 20-5.6 41.5-8.5 62.8-8.5s42.8 2.9 62.8 8.5c0 0 48.1-33.6 69-27 13.7 34.7 5.2 61.4 2.6 67.9 16 17.7 25.8 31.5 25.8 58.9 0 96.5-58.9 104.2-114.8 110.5 9.2 7.9 17 22.9 17 46.4 0 33.7-.3 75.4-.3 83.6 0 6.5 4.6 14.4 17.3 12.1C428.2 457.8 496 362.9 496 252 496 113.3 383.5 8 244.8 8zM97.2 352.9c-1.3 1-1 3.3.7 5.2 1.6 1.6 3.9 2.3 5.2 1 1.3-1 1-3.3-.7-5.2-1.6-1.6-3.9-2.3-5.2-1zm-10.8-8.1c-.7 1.3.3 2.9 2.3 3.9 1.6 1 3.6.7 4.3-.7.7-1.3-.3-2.9-2.3-3.9-2-.6-3.6-.3-4.3.7zm32.4 35.6c-1.6 1.3-1 4.3 1.3 6.2 2.3 2.3 5.2 2.6 6.5 1 1.3-1.3.7-4.3-1.3-6.2-2.2-2.3-5.2-2.6-6.5-1zm-11.4-14.7c-1.6 1-1.6 3.6 0 5.9 1.6 2.3 4.3 3.3 5.6 2.3 1.6-1.3 1.6-3.9 0-6.2-1.4-2.3-4-3.3-5.6-2z"/></svg></span><span class="kt-btn-inner-text">Relataly GitHub Repo </span></a></div>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%"></div>
</div>



<h3 class="wp-block-heading" id="h-prerequisites">Prerequisites</h3>



<p class="wp-block-paragraph">Before starting the coding part, make sure that you have set up your <a href="https://www.python.org/downloads/" target="_blank" rel="noreferrer noopener">Python 3</a> environment and required packages. If you don&#8217;t have an environment, consider the&nbsp;<a href="https://www.anaconda.com/products/individual" target="_blank" rel="noreferrer noopener">Anaconda Python environment</a>. To set it up, you can follow the steps in&nbsp;<a href="https://www.relataly.com/category/data-science/setup-anaconda-environment/" target="_blank" rel="noreferrer noopener">this tutorial</a>. Please ensure to install all required packages:</p>



<ul class="wp-block-list">
<li><em><a href="https://pandas.pydata.org/" target="_blank" rel="noreferrer noopener">pandas</a></em></li>



<li><em><a href="https://numpy.org/" target="_blank" rel="noreferrer noopener">NumPy</a></em></li>



<li><em><a href="https://matplotlib.org/" target="_blank" rel="noreferrer noopener">matplotlib</a></em></li>
</ul>



<p class="wp-block-paragraph">In addition, we will be using the machine learning library <a href="https://scikit-learn.org/stable/" target="_blank" rel="noreferrer noopener">Scikit-learn</a> and <a data-type="URL" data-id="https://seaborn.pydata.org/" href="https://seaborn.pydata.org/" target="_blank" rel="noreferrer noopener">Seaborn</a> for visualization. You can install packages using console commands:</p>



<ul class="wp-block-list">
<li><em>pip install &lt;package name&gt;</em></li>



<li><em>conda install &lt;package name&gt;</em>&nbsp;(if you are using the anaconda packet manager)</li>
</ul>



<h3 class="wp-block-heading" id="h-about-the-dataset">About the Dataset</h3>



<p class="wp-block-paragraph">In this tutorial, we will be working with a public dataset from <a href="https://www.kaggle.com/roshansharma/online-shoppers-intention" target="_blank" rel="noreferrer noopener">Kaggle.com</a>. The data consists of 18 feature vectors belonging to 12,330 shopping sessions. You can download the data via the link below:</p>



<div class="wp-block-file"><a id="wp-block-file--media-3f304c01-ab35-4462-bda0-88dce356d27e" href="https://www.relataly.com/wp-content/uploads/2020/05/online_shoppers_intention.csv">online_shoppers_intention.csv</a><a href="https://www.relataly.com/wp-content/uploads/2020/05/online_shoppers_intention.csv" class="wp-block-file__button wp-element-button" download aria-describedby="wp-block-file--media-3f304c01-ab35-4462-bda0-88dce356d27e">Download</a></div>



<p class="wp-block-paragraph">The data stems from a big shopping website that has recorded the session for one year. Each record belongs to a separate shopping session and user. Thus, there is no bias in the data, such as a specific period, user, or day to avoid. </p>



<p class="wp-block-paragraph">Below you will find an overview of the features contained in the data (Source: Kaggle.com): </p>



<ul class="wp-block-list">
<li>&#8220;Administrative,&#8221; &#8220;Administrative Duration,&#8221; &#8220;Informational,&#8221; &#8220;Informational Duration,&#8221; &#8220;Product Related,&#8221; and &#8220;Product-Related Duration&#8221; represent the number of different types of pages visited by the visitor in that session and the total time spent in each of these page categories.&nbsp;</li>



<li>The &#8220;Bounce Rate,&#8221; &#8220;Exit Rate,&#8221; and &#8220;Page Value&#8221; features represent the metrics measured by &#8220;Google Analytics&#8221; for each page on the e-commerce site. </li>



<li>The &#8220;Special Day&#8221; feature indicates the closeness of the site visiting time to a specific special day (e.g., Mother&#8217;s Day, Valentine&#8217;s Day)</li>



<li>The dataset also includes an operating system, browser, region, traffic type, visitor type as returning or new visitor, a Boolean value indicating whether the date of the visit is a weekend, and the month of the year.</li>
</ul>



<p class="wp-block-paragraph">The &#8216;Revenue&#8217; attribute is the class label, called the &#8220;prediction label.&#8221;</p>



<h3 class="wp-block-heading" id="h-step-1-load-the-data">Step #1 Load the Data</h3>



<p class="wp-block-paragraph">We begin by loading the shopping dataset into a Pandas DataFrame. Afterward, we will print a brief overview of the data.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}">import calendar
import math 
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt 
from matplotlib import cm
import seaborn as sns

from sklearn.model_selection import train_test_split as train_test_split
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import confusion_matrix, roc_curve, auc, roc_auc_score
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Load train data
filepath = &quot;data/classification-online-shopping/&quot;
df_shopping_base = pd.read_csv(filepath + 'online_shoppers_intention.csv') 
df_shopping_base</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">	Administrative	Administrative_Duration	Informational	Informational_Duration	ProductRelated	ProductRelated_Duration	BounceRates	ExitRates	PageValues	SpecialDay	Month	OperatingSystems	Browser	Region	TrafficType	VisitorType			Weekend	Revenue
0	0.0				0.0						0.0				0.0						1.0				0.000000				0.20		0.20		0.0			0.0			Feb		1					1		1		1			Returning_Visitor	False	False
1	0.0				0.0						0.0				0.0						2.0				64.000000				0.00		0.10		0.0			0.0			Feb		2					2		1		2			Returning_Visitor	False	False
2	0.0				-1.0					0.0				-1.0					1.0				-1.000000				0.20		0.20		0.0			0.0			Feb		4					1		9		3			Returning_Visitor	False	False
3	0.0				0.0						0.0				0.0						2.0				2.666667				0.05		0.14		0.0			0.0			Feb		3					2		2		4			Returning_Visitor	False	False
4	0.0				0.0						0.0				0.0						10.0			627.500000				0.02		0.05		0.0			0.0			Feb		3					3		1		4			Returning_Visitor	True	False</pre></div>



<h3 class="wp-block-heading" id="h-step-2-cleaning-the-data">Step #2 Cleaning the Data</h3>



<p class="wp-block-paragraph">Before we can start training our prediction model, we&#8217;ll do some cleanups (handling missing data, data type conversions, treating outliers, and so on).</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Replacing visitor_type to int
print(df_shopping_base['VisitorType'].unique())
df_shop = df_shopping_base.replace({'VisitorType' : { 'New_Visitor' : 0, 'Returning_Visitor' : 1, 'Other' : 2 }})

# Coverting month column to numeric numeric values
monthlist = df_shop['Month'].replace('June', 'Jun')
mlist = []
m = np.array(monthlist)
for mi in m:
    a = list(calendar.month_abbr).index(mi)
    mlist.append(a)
df_shop['Month'] =  mlist

# Delete records with NAs
df_shop.dropna(inplace=True)

df_shop.head()</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">['Returning_Visitor' 'New_Visitor' 'Other']
	Administrative	Administrative_Duration	Informational	Informational_Duration	ProductRelated	ProductRelated_Duration	BounceRates	ExitRates	PageValues	SpecialDay	Month	OperatingSystems	Browser	Region	TrafficType	VisitorType	Weekend	Revenue
  0	0.0				0.0						0.0				0.0						1.0				0.000000				0.20		0.20		0.0			0.0			2		1					1		1		1			1			False	False
1	0.0				0.0						0.0				0.0						2.0				64.000000				0.00		0.10		0.0			0.0			2		2					2		1		2			1			False	False
2	0.0				-1.0					0.0				-1.0					1.0				-1.000000				0.20		0.20		0.0			0.0			2		4					1		9		3			1			False	False
3	0.0				0.0						0.0				0.0						2.0				2.666667				0.05		0.14		0.0			0.0			2		3					2		2		4			1			False	False
4	0.0				0.0						0.0				0.0						10.0			627.50</pre></div>



<h3 class="wp-block-heading" id="h-step-3-exploring-the-data">Step #3 Exploring the Data</h3>



<p class="wp-block-paragraph">Next, we will familiarize ourselves with the data. </p>



<h4 class="wp-block-heading" id="h-3-1-class-labels">3.1 Class Labels</h4>



<p class="wp-block-paragraph">First, we take a look at the class labels to see how balanced they are. If class labels are balanced, it means that each class has an approximately equal number of examples in the training data. This is important because it helps ensure that the trained model will be able to make accurate predictions on new data. If the class labels are unbalanced, then the model is more likely to be biased towards the more common classes, which can lead to poor performance on less common classes. Additionally, unbalanced class labels can make it more difficult to evaluate the performance of a machine learning model, because the model&#8217;s accuracy may not be an accurate reflection of its ability to generalize to new data.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Checking the balance of prediction labels
plt.figure(figsize=(16,2))
fig = sns.countplot(y=&quot;Revenue&quot;, data=df_shop, palette=&quot;muted&quot;)
plt.show()</pre></div>



<figure class="wp-block-image size-full is-resized"><img decoding="async" data-attachment-id="6830" data-permalink="https://www.relataly.com/predicting-the-purchase-intention-of-online-shoppers/982/output-3-2/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/04/output-3.png" data-orig-size="953,154" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="output-3" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/04/output-3.png" src="https://www.relataly.com/wp-content/uploads/2022/04/output-3.png" alt="" class="wp-image-6830" width="946" height="153" srcset="https://www.relataly.com/wp-content/uploads/2022/04/output-3.png 953w, https://www.relataly.com/wp-content/uploads/2022/04/output-3.png 300w, https://www.relataly.com/wp-content/uploads/2022/04/output-3.png 768w" sizes="(max-width: 946px) 100vw, 946px" /></figure>



<p class="wp-block-paragraph">Our class labels are somewhat imbalanced, as there are much more cases in the data with a prediction &#8220;false.&#8221; The reason is that more visitors won&#8217;t buy anything. Imbalanced data can affect the performance of classification models. But now that we are aware of the imbalance in our data, we can choose appropriate evaluation metrics later.</p>



<h4 class="wp-block-heading" id="h-3-2-feature-correlation">3.2 Feature Correlation</h4>



<p class="wp-block-paragraph">When developing classification models, not all features are usually equally useful. It is important that features are not correlated because correlated features can provide redundant information to a machine learning model. If two or more features are highly correlated, they may convey the same information to the model, which can make the model&#8217;s predictions less accurate. Additionally, having correlated features can make it more difficult to interpret the model&#8217;s predictions, because it is not clear which features are actually contributing to the model&#8217;s decision-making process. </p>



<p class="wp-block-paragraph">Let&#8217;s check which of our features are correlated. First, we will create a series of Whiskerplots for the features in our dataset. They help us identify potential outliers and get a better idea of how the data looks.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Whiskerplots
c= 'black'
df_shop.drop('Revenue', axis=1).plot(kind='box', 
                                subplots=True, layout=(4,4), 
                                sharex=False, sharey=False, 
                                figsize=(14,14), 
                                title='Whister plot for input variables')
plt.show()</pre></div>



<figure class="wp-block-image size-large is-resized"><img decoding="async" data-attachment-id="986" data-permalink="https://www.relataly.com/predicting-the-purchase-intention-of-online-shoppers/982/image-35-2/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2020/05/image-35.png" data-orig-size="821,893" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-35" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2020/05/image-35.png" src="https://www.relataly.com/wp-content/uploads/2020/05/image-35.png" alt="Purchase Intention Prediction, Feature Permutation Importance, Feature Correlation plot" class="wp-image-986" width="664" height="721" srcset="https://www.relataly.com/wp-content/uploads/2020/05/image-35.png 821w, https://www.relataly.com/wp-content/uploads/2020/05/image-35.png 276w, https://www.relataly.com/wp-content/uploads/2020/05/image-35.png 768w" sizes="(max-width: 664px) 100vw, 664px" /><figcaption class="wp-element-caption">Feature Whiskerplots</figcaption></figure>



<p class="wp-block-paragraph">The Whiskerplots show that there are a couple of outliers in the data. However, the outliers are not significant enough to worry about them.</p>



<p class="wp-block-paragraph">Histograms are another way of visualizing the distribution of numerical or categorical variables. They give a rough sense of the density of the distribution. To create the histograms, run the code below.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># # Create pariplots for feature columns separated by prediction label value
df_plot = df_shop.copy()

# class_columnname = 'Revenue'
sns.pairplot(df_plot, hue=&quot;Revenue&quot;, height=2.5)</pre></div>



<figure class="wp-block-image size-large is-resized"><img decoding="async" data-attachment-id="6829" data-permalink="https://www.relataly.com/predicting-the-purchase-intention-of-online-shoppers/982/shopper-buying-intention/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/04/Shopper-Buying-Intention.png" data-orig-size="2560,2485" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="Shopper-Buying-Intention pair plots with seaborn" data-image-description="&lt;p&gt;Shopper-Buying-Intention pair plots with seaborn&lt;/p&gt;
" data-image-caption="&lt;p&gt;Shopper-Buying-Intention pair plots with seaborn&lt;/p&gt;
" data-large-file="https://www.relataly.com/wp-content/uploads/2022/04/Shopper-Buying-Intention.png" src="https://www.relataly.com/wp-content/uploads/2022/04/Shopper-Buying-Intention-1024x994.png" alt="Purchase Intention Prediction, Feature Permutation Importance, Feature Correlation plot" class="wp-image-6829" width="1117" height="1085" srcset="https://www.relataly.com/wp-content/uploads/2022/04/Shopper-Buying-Intention.png 1024w, https://www.relataly.com/wp-content/uploads/2022/04/Shopper-Buying-Intention.png 300w, https://www.relataly.com/wp-content/uploads/2022/04/Shopper-Buying-Intention.png 768w, https://www.relataly.com/wp-content/uploads/2022/04/Shopper-Buying-Intention.png 1536w, https://www.relataly.com/wp-content/uploads/2022/04/Shopper-Buying-Intention.png 2048w, https://www.relataly.com/wp-content/uploads/2022/04/Shopper-Buying-Intention.png 2475w" sizes="(max-width: 1117px) 100vw, 1117px" /></figure>



<p class="wp-block-paragraph">Finally, we create a correlation matrix and visualize it as a heat map. The matrix provides a quick overview of which features are correlated and not.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Feature correlation
plt.figure(figsize=(15,4))
f_cor = df_shop.corr()
sns.heatmap(f_cor, cmap=&quot;Blues_r&quot;)</pre></div>



<figure class="wp-block-image size-large is-resized"><img decoding="async" data-attachment-id="4662" data-permalink="https://www.relataly.com/predicting-the-purchase-intention-of-online-shoppers/982/image-50-3/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2021/06/image-50.png" data-orig-size="899,367" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-50" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2021/06/image-50.png" src="https://www.relataly.com/wp-content/uploads/2021/06/image-50.png" alt="Purchase Intention Prediction, Feature Permutation Importance" class="wp-image-4662" width="674" height="275" srcset="https://www.relataly.com/wp-content/uploads/2021/06/image-50.png 899w, https://www.relataly.com/wp-content/uploads/2021/06/image-50.png 300w, https://www.relataly.com/wp-content/uploads/2021/06/image-50.png 768w" sizes="(max-width: 674px) 100vw, 674px" /></figure>



<p class="wp-block-paragraph">The correlation plot shows that some features are highly correlated. The following features are highly correlated:</p>



<ul class="wp-block-list">
<li>ProductRelated and ProductRelated_Duration. </li>



<li>BounceRates and ExitRates</li>
</ul>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}">plt.figure(figsize=(8,5))
sns.scatterplot(x= 'BounceRates',y='ExitRates',data=df_shop,hue='Revenue')
plt.title('Bounce Rate vs. Exit Rate', fontweight='bold', fontsize=15)
plt.show()</pre></div>



<figure class="wp-block-image size-large is-resized"><img decoding="async" data-attachment-id="4674" data-permalink="https://www.relataly.com/predicting-the-purchase-intention-of-online-shoppers/982/image-51-3/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2021/06/image-51.png" data-orig-size="510,335" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-51" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2021/06/image-51.png" src="https://www.relataly.com/wp-content/uploads/2021/06/image-51.png" alt="Purchase Intention Prediction, Feature Permutation Importance" class="wp-image-4674" width="537" height="352" srcset="https://www.relataly.com/wp-content/uploads/2021/06/image-51.png 510w, https://www.relataly.com/wp-content/uploads/2021/06/image-51.png 300w" sizes="(max-width: 537px) 100vw, 537px" /></figure>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}">plt.figure(figsize=(8,5))
sns.scatterplot(x= 'ProductRelated',y='ProductRelated_Duration',data=df_shop,hue='Revenue')
plt.title('Bounce Rate vs. Exit Rate', fontweight='bold', fontsize=15)
plt.show()</pre></div>



<figure class="wp-block-image size-large is-resized"><img decoding="async" data-attachment-id="4675" data-permalink="https://www.relataly.com/predicting-the-purchase-intention-of-online-shoppers/982/image-52-3/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2021/06/image-52.png" data-orig-size="514,335" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-52" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2021/06/image-52.png" src="https://www.relataly.com/wp-content/uploads/2021/06/image-52.png" alt="Purchase Intention Prediction, Feature Permutation Importance" class="wp-image-4675" width="528" height="343" srcset="https://www.relataly.com/wp-content/uploads/2021/06/image-52.png 514w, https://www.relataly.com/wp-content/uploads/2021/06/image-52.png 300w" sizes="(max-width: 528px) 100vw, 528px" /></figure>



<p class="wp-block-paragraph">When we start to train our model, we will only use one of the features from the two pairs.</p>



<h3 class="wp-block-heading" id="h-step-4-data-preprocessing">Step #4 Data Preprocessing </h3>



<p class="wp-block-paragraph">Now that we are familiar with the data, we can prepare the data to train the purchase intention classification model. Firstly, we will include only selecting the features from the original shopping dataset. Second, we will split the data into two separate datasets: train and test with a ratio of 70%. Train X_train and X_test datasets contain the features, while y_train and y_test include the respective prediction labels. Thirdly, we will use the MinMaxScaler to scale the numeric features between 0 and 1. Scaling makes it easier for the algorithm to interpret the data and improve classification performance.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Separate labels from training data
features = ['Administrative', 'Administrative_Duration', 'Informational', 
            'Informational_Duration', 'ProductRelated', 'BounceRates', 'PageValues', 
            'Month', 'Region', 'TrafficType', 'VisitorType']
X = df_shop[features] #Training data
y = df_shop['Revenue'] #Prediction label

# Split the data into x_train and y_train data sets
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, random_state=0)

# Scale the numeric values
scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)</pre></div>



<h3 class="wp-block-heading" id="h-step-5-train-a-purchase-intention-classifier">Step #5 Train a Purchase Intention Classifier</h3>



<p class="wp-block-paragraph">Next, it is time to train our prediction model. Various classification algorithms could be used to solve this problem, for example, decision trees, random forests, neural networks, or support-vector machines. We will use the logistic regression algorithm, a common choice for simple two-class prediction problems. </p>



<p class="wp-block-paragraph">We start the training process using the &#8220;fit&#8221; method of the logistic regression algorithm. </p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Training a classification model using logistic regression 
logreg = LogisticRegression(solver='lbfgs')
score = logreg.fit(X_train, y_train).decision_function(X_test)</pre></div>



<p class="wp-block-paragraph">The trained model returns a training score showing how well the model has performed on the test dataset. </p>



<h3 class="wp-block-heading" id="h-step-6-evaluate-model-performance">Step #6 Evaluate Model Performance</h3>



<p class="wp-block-paragraph">Finally, we will evaluate the performance of our classification model. For this purpose, we first create a confusion matrix. Then we calculate and compare different error metrics.</p>



<h4 class="wp-block-heading" id="h-6-1-confusion-matrix">6.1 Confusion Matrix</h4>



<p class="wp-block-paragraph">The confusion matrix is a holistic and clean way to illustrate the results of a classification model. It differentiates between predicted labels and actual labels. For a binary classification model, the matrix comprises 2&#215;2 quadrants that show the number of cases in each quadrant. </p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># create a confusion matrix
y_pred = logreg.predict(X_test)
cnf_matrix = confusion_matrix(y_test, y_pred)

# create heatmap
%matplotlib inline
class_names=[False, True] # name  of classes
fig, ax = plt.subplots(figsize=(7, 6))
tick_marks = np.arange(len(class_names))
plt.xticks(tick_marks, class_names)
plt.yticks(tick_marks, class_names)
sns.heatmap(pd.DataFrame(cnf_matrix), annot=True, cmap=&quot;YlGnBu&quot;, fmt='g')
ax.xaxis.set_label_position(&quot;top&quot;)
plt.tight_layout()
plt.title('Confusion matrix')
plt.ylabel('Actual label')
plt.xlabel('Predicted label')</pre></div>



<figure class="wp-block-image size-large is-resized"><img decoding="async" data-attachment-id="990" data-permalink="https://www.relataly.com/predicting-the-purchase-intention-of-online-shoppers/982/image-39-2/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2020/05/image-39.png" data-orig-size="492,452" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-39" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2020/05/image-39.png" src="https://www.relataly.com/wp-content/uploads/2020/05/image-39.png" alt="confusion matrix on the results of our classification model that predicts purchase intentions, purchase intention prediction model" class="wp-image-990" width="374" height="344" srcset="https://www.relataly.com/wp-content/uploads/2020/05/image-39.png 492w, https://www.relataly.com/wp-content/uploads/2020/05/image-39.png 300w" sizes="(max-width: 374px) 100vw, 374px" /></figure>



<p class="wp-block-paragraph">In the upper left (0,0), we see that the model correctly predicted for 3102 online shopping sessions that these sessions will not lead to a purchase (True negatives). In 30 cases, the model was wrong and expected that there would be a purchase, but there wasn&#8217;t (False positives). For 412 buyers, the model predicted that they would not buy anything, even though they were buying something (False negatives). In the lower right corner, we see that only in 151 cases could buyers be correctly identified as such (True positives). </p>



<h4 class="wp-block-heading" id="h-6-2-performance-metrics-for-classification-models">6.2 Performance Metrics for Classification Models</h4>



<p class="wp-block-paragraph">Next, let&#8217;s take a brief look at the performance metrics. Four standard metrics that measure the performance of classification models are Accuracy, Precision, Recall, and  f1_score. </p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}">print('Accuracy: {:.2f}'.format(accuracy_score(y_test, y_pred)))
print('Precision: {:.2f}'.format(precision_score(y_test, y_pred)))
print('Recall: {:.2f}'.format(recall_score(y_test, y_pred)))
print('f1_score: {:.2f}'.format(f1_score(y_test, y_pred)))</pre></div>



<h5 class="wp-block-heading" id="h-accuracy"><strong>Accuracy</strong></h5>



<p class="wp-block-paragraph">The accuracy of the test set shows that 88% of the online shopper sessions were correctly classified. However, our data is imbalanced. That is to say, most labels have the value &#8220;False,&#8221; and only a few target labels are &#8220;True.&#8221; Consequently, we must ensure that our model does not classify all online shoppers as &#8220;non-buyers&#8221; (label: False) but also correctly predicts the buyers (label: True). </p>



<h5 class="wp-block-heading" id="h-precision"><strong>Precision</strong></h5>



<p class="wp-block-paragraph">We calculate the precision as the number of True Positives divided by the number of True Positives and False Positives. Similar to Accuracy, Precision puts too much emphasis on the True negatives. Therefore, it does not say much about our model. The precision score for our model is just a little lower than the accuracy (83%).</p>



<h5 class="wp-block-heading" id="h-recall"><strong>Recall</strong></h5>



<p class="wp-block-paragraph">We calculate the Recall&nbsp;by dividing the number of True Positives by the sum of the True Positives and the False Negatives. The Recall of our model is 27%, which is significantly below accuracy and precision. In our case, the precision call is more meaningful than precision and Recall because it puts a higher penalty on the low number of True positives.</p>



<h5 class="wp-block-heading" id="h-f1-score"><strong>F1-Score</strong></h5>



<p class="wp-block-paragraph">The formula for the F1-Score is 2*((precision*recall)/(precision+recall)). Because the formula includes the Recall, the F-1 Score of our model is only 41%. Imagine we want to optimize our classification model further. In this case, we should look out for both F1-Score and Recall.</p>



<h4 class="wp-block-heading" id="h-6-3-interpretation">6.3 Interpretation</h4>



<p class="wp-block-paragraph">Metrics for classification models can be misleading. We should thus choose them carefully. Depending on which use case we are dealing with, False-negative and False-positive predictions can have different costs. Therefore, model evaluation is not always about exactness (precision and accuracy). Instead, the choice of performance metrics depends on what we want to achieve.</p>



<p class="wp-block-paragraph">The challenge for our model is to correctly classify the smaller group of buyers (True positives). So, optimizing our model would be about achieving a balance between good accuracy without significantly lowering the F1_Score and Recall.</p>



<h3 class="wp-block-heading" id="h-step-7-insights-on-customer-purchase-intentions">Step #7 Insights on Customer Purchase Intentions</h3>



<p class="wp-block-paragraph">Finally, we will use permutation feature importance to gain additional insights into our prediction model&#8217;s features. Permutation Feature Importance is a technique that measures the influence of features on the predictions of our model. Features with a high positive or negative score substantially impact predicting the prediction label. In contrast, features with scores close to zero play a lesser role in the predictions.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Load the data
r = permutation_importance(model_lgr, X_test, y_test, n_repeats=30, random_state=0)

# Plot the barchart
data_im = pd.DataFrame(r.importances_mean, columns=['feature_permuation_score'])
data_im['feature_names'] = X.columns
data_im = data_im.sort_values('feature_permuation_score', ascending=False)

fig, ax = plt.subplots(figsize=(16, 5))
sns.barplot(y=data_im['feature_names'], x=&quot;feature_permuation_score&quot;, data=data_im, palette='nipy_spectral')
ax.set_title(&quot;Logistic Regression Feature Importances&quot;)</pre></div>



<figure class="wp-block-image size-large"><img decoding="async" width="1024" height="326" data-attachment-id="4684" data-permalink="https://www.relataly.com/predicting-the-purchase-intention-of-online-shoppers/982/image-56-3/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2021/06/image-56.png" data-orig-size="1050,334" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-56" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2021/06/image-56.png" src="https://www.relataly.com/wp-content/uploads/2021/06/image-56-1024x326.png" alt="online purchase intention prediction - results of the feature permutation importance technique" class="wp-image-4684" srcset="https://www.relataly.com/wp-content/uploads/2021/06/image-56.png 1024w, https://www.relataly.com/wp-content/uploads/2021/06/image-56.png 300w, https://www.relataly.com/wp-content/uploads/2021/06/image-56.png 768w, https://www.relataly.com/wp-content/uploads/2021/06/image-56.png 1050w" sizes="(max-width: 1024px) 100vw, 1024px" /></figure>



<p class="wp-block-paragraph">We can see that the three features with the highest impact are PageValues, BounceRates and Administration_Duration. </p>



<ul class="wp-block-list">
<li>The higher the page&#8217;s value, the higher the customer&#8217;s chance to make a purchase. </li>



<li>The higher the average bounce rate that the customer visits, the higher the chance the customer makes a purchase.</li>



<li>In contrast, the more time a customer spends on administrative settings, the lower the chance the customer completes the purchase.</li>
</ul>



<p class="wp-block-paragraph">These were just a few sample findings. There is much more to explore in the data, and deeper analysis can uncover much more about the customers&#8217; buying decisions.</p>



<h2 class="wp-block-heading" id="h-summary">Summary</h2>



<p class="wp-block-paragraph">This article has presented customer purchase prediction as an interesting use case for machine learning in e-commerce. After discussing the use case, we have developed a classification model that predicts the purchase intentions of online shoppers. You have learned to preprocess the data, train a logistic regression model and evaluate the model&#8217;s performance. Classifying purchase intentions can help online shops understand their customers better and automate certain online marketing activities. The previous section showed how marketers could use this to gain further insights into their customers&#8217; behavior.</p>



<p class="wp-block-paragraph">Thanks for reading and if you have any questions, let me know in the comments. </p>



<h2 class="wp-block-heading">Sources and Further Reading</h2>



<p class="wp-block-paragraph">I hope this article was helpful. If you have any remarks or questions, please write them in the comments. </p>



<div style="display: inline-block;">
  <iframe sandbox="allow-popups allow-scripts allow-modals allow-forms allow-same-origin" style="width:120px;height:240px;" marginwidth="0" marginheight="0" scrolling="no" frameborder="0" src="//ws-eu.amazon-adsystem.com/widgets/q?ServiceVersion=20070822&amp;OneJS=1&amp;Operation=GetAdHtml&amp;MarketPlace=DE&amp;source=ss&amp;ref=as_ss_li_til&amp;ad_type=product_link&amp;tracking_id=flo7up-21&amp;language=de_DE&amp;marketplace=amazon&amp;region=DE&amp;placement=3030181162&amp;asins=3030181162&amp;linkId=669e46025028259138fbb5ccec12dfbe&amp;show_border=true&amp;link_opens_in_new_window=true"></iframe>
<iframe sandbox="allow-popups allow-scripts allow-modals allow-forms allow-same-origin" style="width:120px;height:240px;" marginwidth="0" marginheight="0" scrolling="no" frameborder="0" src="//ws-eu.amazon-adsystem.com/widgets/q?ServiceVersion=20070822&amp;OneJS=1&amp;Operation=GetAdHtml&amp;MarketPlace=DE&amp;source=ss&amp;ref=as_ss_li_til&amp;ad_type=product_link&amp;tracking_id=flo7up-21&amp;language=de_DE&amp;marketplace=amazon&amp;region=DE&amp;placement=1999579577&amp;asins=1999579577&amp;linkId=91d862698bf9010ff4c09539e4c49bf4&amp;show_border=true&amp;link_opens_in_new_window=true"></iframe>
<iframe sandbox="allow-popups allow-scripts allow-modals allow-forms allow-same-origin" style="width:120px;height:240px;" marginwidth="0" marginheight="0" scrolling="no" frameborder="0" src="//ws-eu.amazon-adsystem.com/widgets/q?ServiceVersion=20070822&amp;OneJS=1&amp;Operation=GetAdHtml&amp;MarketPlace=DE&amp;source=ss&amp;ref=as_ss_li_til&amp;ad_type=product_link&amp;tracking_id=flo7up-21&amp;language=de_DE&amp;marketplace=amazon&amp;region=DE&amp;placement=1839217715&amp;asins=1839217715&amp;linkId=356ba074068849ff54393f527190825d&amp;show_border=true&amp;link_opens_in_new_window=true"></iframe>
<iframe sandbox="allow-popups allow-scripts allow-modals allow-forms allow-same-origin" style="width:120px;height:240px;" marginwidth="0" marginheight="0" scrolling="no" frameborder="0" src="//ws-eu.amazon-adsystem.com/widgets/q?ServiceVersion=20070822&amp;OneJS=1&amp;Operation=GetAdHtml&amp;MarketPlace=DE&amp;source=ss&amp;ref=as_ss_li_til&amp;ad_type=product_link&amp;tracking_id=flo7up-21&amp;language=de_DE&amp;marketplace=amazon&amp;region=DE&amp;placement=1492032646&amp;asins=1492032646&amp;linkId=2214804dd039e7103577abd08722abac&amp;show_border=true&amp;link_opens_in_new_window=true"></iframe>
</div>



<p class="has-contrast-2-color has-base-3-background-color has-text-color has-background wp-block-paragraph"><em>The links above to Amazon are affiliate links. By buying through these links, you support the Relataly.com blog and help to cover the hosting costs. Using the links does not affect the price.</em></p>
<p>The post <a href="https://www.relataly.com/predicting-the-purchase-intention-of-online-shoppers/982/">Classifying Purchase Intention of Online Shoppers with Python</a> appeared first on <a href="https://www.relataly.com">relataly.com</a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://www.relataly.com/predicting-the-purchase-intention-of-online-shoppers/982/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">982</post-id>	</item>
	</channel>
</rss>
