<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Unsupervised Learning Archives - relataly.com</title>
	<atom:link href="https://www.relataly.com/tag/unsupervised-learning/feed/" rel="self" type="application/rss+xml" />
	<link>https://www.relataly.com/tag/unsupervised-learning/</link>
	<description>The Business AI Blog</description>
	<lastBuildDate>Sat, 27 May 2023 10:21:28 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	<generator>https://wordpress.org/?v=6.9.4</generator>

<image>
	<url>https://www.relataly.com/wp-content/uploads/2023/04/cropped-AI-cat-Icon-White.png</url>
	<title>Unsupervised Learning Archives - relataly.com</title>
	<link>https://www.relataly.com/tag/unsupervised-learning/</link>
	<width>32</width>
	<height>32</height>
</image> 
<site xmlns="com-wordpress:feed-additions:1">175977316</site>	<item>
		<title>Cluster Analysis with k-Means in Python</title>
		<link>https://www.relataly.com/simple-cluster-analysis-with-k-means-with-python/5070/</link>
					<comments>https://www.relataly.com/simple-cluster-analysis-with-k-means-with-python/5070/#respond</comments>
		
		<dc:creator><![CDATA[Florian Follonier]]></dc:creator>
		<pubDate>Sun, 27 Jun 2021 18:21:01 +0000</pubDate>
				<category><![CDATA[Algorithms]]></category>
		<category><![CDATA[Clustering]]></category>
		<category><![CDATA[Correlation]]></category>
		<category><![CDATA[Covariance]]></category>
		<category><![CDATA[K-Means]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[Scikit-Learn]]></category>
		<category><![CDATA[Seaborn]]></category>
		<category><![CDATA[Synthetic Data]]></category>
		<category><![CDATA[Use Cases]]></category>
		<category><![CDATA[Beginner Tutorials]]></category>
		<category><![CDATA[Unsupervised Learning]]></category>
		<guid isPermaLink="false">https://www.relataly.com/?p=5070</guid>

					<description><![CDATA[<p>Embark on a journey into the world of unsupervised machine learning with this beginner-friendly Python tutorial focusing on K-Means clustering, a powerful technique used to group similar data points into distinct clusters. This invaluable tool helps us make sense of complex datasets, finding hidden patterns and associations without the need for a predetermined target variable. ... <a title="Cluster Analysis with k-Means in Python" class="read-more" href="https://www.relataly.com/simple-cluster-analysis-with-k-means-with-python/5070/" aria-label="Read more about Cluster Analysis with k-Means in Python">Read more</a></p>
<p>The post <a href="https://www.relataly.com/simple-cluster-analysis-with-k-means-with-python/5070/">Cluster Analysis with k-Means in Python</a> appeared first on <a href="https://www.relataly.com">relataly.com</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow">
<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p>Embark on a journey into the world of unsupervised machine learning with this beginner-friendly Python tutorial focusing on K-Means clustering, a powerful technique used to group similar data points into distinct clusters. This invaluable tool helps us make sense of complex datasets, finding hidden patterns and associations without the need for a predetermined target variable. This comes in handy, especially when dealing with data whose similarities and differences aren&#8217;t immediately apparent.</p>



<p>Divided into two insightful sections, this blog post first delves into the theoretical foundation of the K-Means clustering algorithm. We&#8217;ll explore its real-world applications, its strengths, and its weaknesses, providing a comprehensive overview of what makes K-Means an essential tool in any data scientist&#8217;s toolkit.</p>



<p>In the second part of the blog, we switch gears and dive into a hands-on Python tutorial. Here, we&#8217;ll walk you through a practical example of K-Means clustering, using it to uncover three distinct spherical clusters within a synthetic dataset. To round up, we&#8217;ll put our model to the test, using it to predict clusters within a test dataset, and then we&#8217;ll visualize the results for easy understanding.</p>



<p>Whether you&#8217;re a seasoned data scientist or a beginner looking to dip your toes into the field, this blog post offers a simple and accessible introduction to cluster engineering with K-Means in Python. Follow along, and uncover the hidden potential of your data today!</p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%">
<figure class="wp-block-image size-full"><img fetchpriority="high" decoding="async" width="508" height="512" data-attachment-id="12672" data-permalink="https://www.relataly.com/rubber-ducks-machine-learning-python-tutorial-relataly-midjourney-min/" data-orig-file="https://www.relataly.com/wp-content/uploads/2023/03/rubber-ducks-machine-learning-python-tutorial-relataly-midjourney-min.png" data-orig-size="508,512" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="rubber ducks machine learning python tutorial relataly midjourney-min" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2023/03/rubber-ducks-machine-learning-python-tutorial-relataly-midjourney-min.png" src="https://www.relataly.com/wp-content/uploads/2023/03/rubber-ducks-machine-learning-python-tutorial-relataly-midjourney-min.png" alt="Patterns are everywhere, and machine learning can help to expose and understand them better. Image created with Midjourney. " class="wp-image-12672" srcset="https://www.relataly.com/wp-content/uploads/2023/03/rubber-ducks-machine-learning-python-tutorial-relataly-midjourney-min.png 508w, https://www.relataly.com/wp-content/uploads/2023/03/rubber-ducks-machine-learning-python-tutorial-relataly-midjourney-min.png 298w, https://www.relataly.com/wp-content/uploads/2023/03/rubber-ducks-machine-learning-python-tutorial-relataly-midjourney-min.png 140w" sizes="(max-width: 508px) 100vw, 508px" /><figcaption class="wp-element-caption">Patterns are everywhere, and machine learning can help to expose and understand them better. Image created with <a href="http://www.midjourney.com" target="_blank" rel="noreferrer noopener">Midjourney</a>. </figcaption></figure>
</div>
</div>
</div>
</div>



<h2 class="wp-block-heading" id="h-findings-clusters-in-multivariate-data-with-k-means">Findings Clusters in Multivariate Data with k-Means</h2>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p>K-Means clustering is a prominent unsupervised machine learning algorithm utilized to partition datasets into a predefined number (&#8216;k&#8217;) of distinct clusters, each brimming with similar data points. To begin, the algorithm identifies &#8216;k&#8217; random data points which serve as the initial centroids &#8211; the central points representing each cluster. It then proceeds in an iterative fashion, continuously reassigning each data point to the centroid nearest to it, followed by updating each centroid to the average of the data points assigned to its cluster. This process repeats until the centroids no longer change positions or until a set maximum number of iterations is reached.</p>



<p>The underlying objective of the K-Means algorithm is to minimize the within-cluster sum of squares (WCSS), a metric indicating the cumulative distance between each data point within a cluster and its corresponding centroid. By reducing the WCSS, K-Means aims to form the most compact and clearly separable clusters.</p>



<p>The K-Means algorithm has garnered substantial popularity in the field of data science for its speed and simplicity in implementation. However, it isn&#8217;t without its limitations. It requires the number of clusters to be specified beforehand and operates under the assumption that all clusters are spherical and uniformly sized. In the following sections, we&#8217;ll delve into the intricacies of this powerful yet straightforward algorithm and discuss its use in real-world applications.</p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%">
<figure class="wp-block-image size-large is-resized"><img decoding="async" data-attachment-id="5079" data-permalink="https://www.relataly.com/simple-cluster-analysis-with-k-means-with-python/5070/image-83-2/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2021/06/image-83.png" data-orig-size="567,452" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-83" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2021/06/image-83.png" src="https://www.relataly.com/wp-content/uploads/2021/06/image-83.png" alt="Three clusters in a dataset that were separated by the k-Means algorithms" class="wp-image-5079" width="355" height="283" srcset="https://www.relataly.com/wp-content/uploads/2021/06/image-83.png 567w, https://www.relataly.com/wp-content/uploads/2021/06/image-83.png 300w" sizes="(max-width: 355px) 100vw, 355px" /><figcaption class="wp-element-caption">Three clusters in a dataset that were separated by the k-Means algorithms</figcaption></figure>
</div>
</div>



<p></p>



<h3 class="wp-block-heading" id="h-understanding-the-mechanics-of-k-means-clustering-algorithm">Understanding the Mechanics of K-Means Clustering Algorithm</h3>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p>At its core, K-Means clustering is a dynamic algorithm designed to simplify complex data patterns. This machine learning model is a champion of unsupervised learning, adept at grouping similar data points into a predetermined number (&#8216;k&#8217;) of unique clusters.</p>



<p>Starting the process, the K-Means algorithm selects &#8216;k&#8217; random data points, which act as the initial centroids or the central figures of the clusters. The algorithm then goes through a series of iterative steps, continually allocating each data point to its closest centroid, and recalibrating the centroid to be the mean of all data points that are assigned to its cluster. This cycle repeats until we achieve stable centroids, i.e., they stop moving, or until a pre-set maximum number of iterations is completed.</p>



<p>The fundamental goal of K-Means lies in minimizing the within-cluster sum of squares (WCSS). This term refers to the sum of the distances of each data point in a cluster to its centroid. By lowering WCSS, K-Means strives to construct compact clusters that are distinctly separate from each other.</p>



<p>In the illustration below, we assume a cluster with three cluster centers. K-Means carries out several steps to partition the data.</p>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow">
<ol class="wp-block-list">
<li>Initially, the algorithm k chooses random starting positions for the centroids. Alternatively, we can also position the centroids manually.</li>



<li>Then the algorithm calculates the distance between the data points and the three centroids. The data points are assigned to the closest centroid/cluster where the cluster variance increases the least.</li>



<li>Next, the algorithm calculates the Euclidean distance between the centroids and their assigned data points. The result is linear decision boundaries that separate the clusters but are not yet optimal.</li>



<li>From then on, the algorithm optimizes the centroids&#8217; positions to lower the resulting clusters&#8217; variance. Then, the previous steps are repeated: averaging, assigning the data points to clusters, and shifting the centroids.</li>
</ol>



<p>The process ends when the positions of the centroids do not change anymore.</p>



<figure class="wp-block-image size-large is-resized"><img decoding="async" data-attachment-id="5099" data-permalink="https://www.relataly.com/simple-cluster-analysis-with-k-means-with-python/5070/image-88-2/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2021/06/image-88.png" data-orig-size="1183,1093" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-88" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2021/06/image-88.png" src="https://www.relataly.com/wp-content/uploads/2021/06/image-88-1024x946.png" alt="K-means partitions a set of data by iteratively optimizing the centroids and the decision boundaries that result from them. " class="wp-image-5099" width="787" height="726" srcset="https://www.relataly.com/wp-content/uploads/2021/06/image-88.png 1024w, https://www.relataly.com/wp-content/uploads/2021/06/image-88.png 300w, https://www.relataly.com/wp-content/uploads/2021/06/image-88.png 1183w" sizes="(max-width: 787px) 100vw, 787px" /><figcaption class="wp-element-caption">k-Means iteratively optimizes the decision boundaries between clusters</figcaption></figure>
</div>
</div>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%"></div>
</div>



<h3 class="wp-block-heading" id="h-how-many-clusters">How many Clusters?</h3>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p>A particular challenge is that k-means requires estimating the number of clusters k. When tackling new clustering problems, we usually don&#8217;t know the optimal number of clusters. Unless the data is not too complex, we can often estimate the number of centers by looking at one or more scatter plots. However, this approach only works when the data has a few dimensions. For complex data with many dimensions, it is common to experiment with varying numbers of k to find an appropriate size for the problem. We can automate this process using hyperparameter tuning techniques such as <a href="https://www.relataly.com/hyperparameter-tuning-with-grid-search/2261/" target="_blank" rel="noreferrer noopener">grid search</a> or <a href="https://www.relataly.com/using-random-search-to-tune-the-hyperparameters-of-a-random-decision-forest-with-python/6875/" target="_blank" rel="noreferrer noopener">random search</a>. The idea is to try out different cluster sizes and identify the size that best differentiates between clusters.</p>



<h3 class="wp-block-heading" id="h-pros-and-cons-of-k-means-clustering">Pros and Cons of k-Means Clustering</h3>



<p>Although K-Means is esteemed for its swift operation and uncomplicated implementation, it comes with a set of constraints. We need to know the strengths and weaknesses of clustering techniques such as k-Means. In general, clustering can reveal structures and relationships in data supervised machine learning methods like classification likely would not uncover. In particular, when we suspect different subgroups in the data that differ in their behavior, clustering can help discover what makes these groups unique.</p>



<p>K-Means, in particular, possesses a unique ability to detect and segregate spherical-shaped clusters efficiently. However, its performance may falter when faced with clusters embodying more intricate structures like half-moons or circles, often struggling to differentiate them accurately.</p>



<p>Another potential limitation of K-Means lies in its prerequisite of specifying the number of clusters in advance. This requirement can prove challenging when the true number of clusters within a dataset is unknown or ambiguous. In such scenarios, alternative clustering techniques, such as <a href="https://www.relataly.com/crypto-market-cluster-analysis-using-affinity-propagation-python/8114/" target="_blank" rel="noreferrer noopener">affinity propagation</a> or <a href="https://wp.me/pbUnJO-2WP">hi</a><a href="https://wp.me/pbUnJO-2WP" target="_blank" rel="noreferrer noopener">erarchical clustering</a>, might be more suitable, as they possess the ability to automatically determine the optimal number of clusters.</p>



<h3 class="wp-block-heading" id="h-applications-of-clustering">Applications of Clustering</h3>



<p>The k-means algorithm is used in a variety of applications, including the following:</p>



<ul class="wp-block-list">
<li>A typical use case for clustering is in <strong>marketing and market segmentation</strong>. Here clustering is used to identify meaningful segments of similar customers. The similarity can be based on demographic data (age, gender, etc.) or customer behavior (for example, the time and amount of a purchase).</li>



<li><strong>Medical research</strong> uses clustering to divide patient groups into different subgroups, for example, to assess stroke risk. After clustering, the next step is to develop separate prediction models for the subgroups to estimate the risk more accurately.</li>



<li>An application in the financial sector is outlier detection in <strong>fraud detection</strong>. Banks and credit card companies use clustering to detect unusual transactions and flag them for verification. </li>



<li><strong>Spam filtering</strong>: The input data are attributes of emails (text length, words contained, etc.) and help separate spam from non-spam emails.</li>
</ul>



<h2 class="wp-block-heading" id="h-implementing-a-k-means-clustering-model-in-python">Implementing a K-Means Clustering Model in Python</h2>



<p>In the following, we run a cluster analysis on <a href="https://www.relataly.com/category/apis/synthetic-data/" target="_blank" rel="noreferrer noopener">synthetic data</a> using Python and scikit-learn. We aim to train a K-Means cluster model in Python that distinguishes three clusters in the data. Given that the data is synthetic, we&#8217;re already privy to which cluster each data point pertains to. This foreknowledge enables us to evaluate the performance of our model post-training, gauging how effectively it can differentiate between the three predefined clusters.</p>



<p>The code is available on the GitHub repository.</p>



<div class="wp-block-kadence-advancedbtn kb-buttons-wrap kb-btns_3c55f5-f1"><a class="kb-button kt-button button kb-btn_79ce02-f1 kt-btn-size-standard kt-btn-width-type-full kb-btn-global-inherit  kt-btn-has-text-true kt-btn-has-svg-true  wp-block-button__link wp-block-kadence-singlebtn" href="https://github.com/flo7up/relataly-public-python-tutorials/blob/master/03%20Clustering/041%20Simple%20Clustering%20using%20K-means%20with%20Python.ipynb" target="_blank" rel="noreferrer noopener"><span class="kb-svg-icon-wrap kb-svg-icon-fe_eye kt-btn-icon-side-left"><svg viewBox="0 0 24 24"  fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"  aria-hidden="true"><path d="M1 12s4-8 11-8 11 8 11 8-4 8-11 8-11-8-11-8z"/><circle cx="12" cy="12" r="3"/></svg></span><span class="kt-btn-inner-text">View on GitHub </span></a>

<a class="kb-button kt-button button kb-btn_172639-a2 kt-btn-size-standard kt-btn-width-type-full kb-btn-global-inherit  kt-btn-has-text-true kt-btn-has-svg-true  wp-block-button__link wp-block-kadence-singlebtn" href="https://github.com/flo7up/relataly-public-python-API-tutorials" target="_blank" rel="noreferrer noopener"><span class="kb-svg-icon-wrap kb-svg-icon-fa_github kt-btn-icon-side-left"><svg viewBox="0 0 496 512"  fill="currentColor" xmlns="http://www.w3.org/2000/svg"  aria-hidden="true"><path d="M165.9 397.4c0 2-2.3 3.6-5.2 3.6-3.3.3-5.6-1.3-5.6-3.6 0-2 2.3-3.6 5.2-3.6 3-.3 5.6 1.3 5.6 3.6zm-31.1-4.5c-.7 2 1.3 4.3 4.3 4.9 2.6 1 5.6 0 6.2-2s-1.3-4.3-4.3-5.2c-2.6-.7-5.5.3-6.2 2.3zm44.2-1.7c-2.9.7-4.9 2.6-4.6 4.9.3 2 2.9 3.3 5.9 2.6 2.9-.7 4.9-2.6 4.6-4.6-.3-1.9-3-3.2-5.9-2.9zM244.8 8C106.1 8 0 113.3 0 252c0 110.9 69.8 205.8 169.5 239.2 12.8 2.3 17.3-5.6 17.3-12.1 0-6.2-.3-40.4-.3-61.4 0 0-70 15-84.7-29.8 0 0-11.4-29.1-27.8-36.6 0 0-22.9-15.7 1.6-15.4 0 0 24.9 2 38.6 25.8 21.9 38.6 58.6 27.5 72.9 20.9 2.3-16 8.8-27.1 16-33.7-55.9-6.2-112.3-14.3-112.3-110.5 0-27.5 7.6-41.3 23.6-58.9-2.6-6.5-11.1-33.3 2.6-67.9 20.9-6.5 69 27 69 27 20-5.6 41.5-8.5 62.8-8.5s42.8 2.9 62.8 8.5c0 0 48.1-33.6 69-27 13.7 34.7 5.2 61.4 2.6 67.9 16 17.7 25.8 31.5 25.8 58.9 0 96.5-58.9 104.2-114.8 110.5 9.2 7.9 17 22.9 17 46.4 0 33.7-.3 75.4-.3 83.6 0 6.5 4.6 14.4 17.3 12.1C428.2 457.8 496 362.9 496 252 496 113.3 383.5 8 244.8 8zM97.2 352.9c-1.3 1-1 3.3.7 5.2 1.6 1.6 3.9 2.3 5.2 1 1.3-1 1-3.3-.7-5.2-1.6-1.6-3.9-2.3-5.2-1zm-10.8-8.1c-.7 1.3.3 2.9 2.3 3.9 1.6 1 3.6.7 4.3-.7.7-1.3-.3-2.9-2.3-3.9-2-.6-3.6-.3-4.3.7zm32.4 35.6c-1.6 1.3-1 4.3 1.3 6.2 2.3 2.3 5.2 2.6 6.5 1 1.3-1.3.7-4.3-1.3-6.2-2.2-2.3-5.2-2.6-6.5-1zm-11.4-14.7c-1.6 1-1.6 3.6 0 5.9 1.6 2.3 4.3 3.3 5.6 2.3 1.6-1.3 1.6-3.9 0-6.2-1.4-2.3-4-3.3-5.6-2z"/></svg></span><span class="kt-btn-inner-text">Relataly GitHub Repo </span></a></div>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%"></div>
</div>



<h3 class="wp-block-heading" id="h-prerequisites">Prerequisites</h3>



<p>Before beginning the coding part, make sure that you have set up your Python 3 environment and required packages. If you don&#8217;t have an environment, consider the&nbsp;<a href="https://www.anaconda.com/products/individual" target="_blank" rel="noreferrer noopener">Anaconda Python environment</a>. To set it up, you can follow the steps in&nbsp;<a href="https://www.relataly.com/category/data-science/setup-anaconda-environment/" target="_blank" rel="noreferrer noopener">this tutorial</a>.</p>



<p>Also, make sure you install all required packages. In this tutorial, we will be working with the following standard packages:&nbsp;</p>



<ul class="wp-block-list">
<li><em><a href="https://pandas.pydata.org/" target="_blank" rel="noreferrer noopener">pandas</a></em></li>



<li><em><a href="https://numpy.org/" target="_blank" rel="noreferrer noopener">NumPy</a></em></li>



<li><em><a href="https://matplotlib.org/" target="_blank" rel="noreferrer noopener">matplotlib</a></em></li>
</ul>



<p>This article uses the k-means clustering algorithm from the Python library <a rel="noreferrer noopener" href="https://scikit-learn.org/stable/" target="_blank">Scikit-learn</a>. We also use <a data-type="URL" data-id="https://seaborn.pydata.org/" href="https://seaborn.pydata.org/" target="_blank" rel="noreferrer noopener">Seaborn </a>for visualization.</p>



<p>You can install these libraries using console commands:</p>



<ul class="wp-block-list">
<li><em>pip install &lt;package name&gt;</em></li>



<li><em>conda install &lt;package name&gt;</em>&nbsp;(if you are using the anaconda packet manager)</li>
</ul>



<h3 class="wp-block-heading" id="h-step-1-generate-synthetic-data">Step #1: Generate Synthetic Data</h3>



<p>We start with the generation of synthetic data. For this purpose, we use the make_blobs function from the scikit-learn library. The function generates random clusters in two dimensions spherically arranged around a center. In addition, the data contains the respective cluster to which the data points belong. We use a scatterplot to visualize the data. </p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}">import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt
from sklearn.model_selection import train_test_split as train_test_split
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, confusion_matrix

# Generate synthetic data
features, labels = make_blobs(
    n_samples=400,
    centers=3,
    cluster_std=2.75,
    random_state=42
)

# Visualize the data in scatterplots
def scatter_plots(df, palette):
    fig, ax = plt.subplots(nrows=1, ncols=2, sharex=True, figsize=(20, 8))
    fig.subplots_adjust(hspace=0.5, wspace=0.2)
    

    ax1 = plt.subplot(1,2,1)
    sns.scatterplot(ax = ax1, data=df, x='x', y='y', hue= 'labels', palette=palette)
    plt.title('Clusters')
    
    ax2 = plt.subplot(1,2,2)
    sns.scatterplot(ax = ax2, data=df, x='x', y='y')
    plt.title('How the model sees the data during training')

palette = {1:&quot;tab:cyan&quot;,0:&quot;tab:orange&quot;, 2:&quot;tab:purple&quot;}
df = pd.DataFrame(features, columns=['x', 'y'])
df['labels'] = labels
scatter_plots(df, palette)</pre></div>



<figure class="wp-block-image size-large"><img decoding="async" width="1024" height="433" data-attachment-id="5061" data-permalink="https://www.relataly.com/image-77-2/" data-orig-file="https://www.relataly.com/wp-content/uploads/2021/06/image-77.png" data-orig-size="1172,496" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-77" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2021/06/image-77.png" src="https://www.relataly.com/wp-content/uploads/2021/06/image-77-1024x433.png" alt="Clusters vs how the model sees the data during training" class="wp-image-5061" srcset="https://www.relataly.com/wp-content/uploads/2021/06/image-77.png 1024w, https://www.relataly.com/wp-content/uploads/2021/06/image-77.png 300w, https://www.relataly.com/wp-content/uploads/2021/06/image-77.png 768w, https://www.relataly.com/wp-content/uploads/2021/06/image-77.png 1172w" sizes="(max-width: 1024px) 100vw, 1024px" /></figure>



<h3 class="wp-block-heading" id="h-step-2-preprocessing">Step #2: Preprocessing</h3>



<p>There are some general things to keep in mind when preparing data for use with K-Means clustering:</p>



<ul class="wp-block-list">
<li><strong>Missing data and outliers</strong>: if we have missing entries in our data, we need to handle these, for example, by removing the records or filling the missing values with a mean or median. In addition, K-means is sensitive to outliers. Therefore, make sure that you eliminate outliers from the training data. </li>



<li><strong>Normalization</strong>: K-Means can only deal with integer values. So, either we map the categorical variables to integer values or use one-hot-encoding to create separate binary variables. </li>



<li><strong>Dimensionality reduction</strong>: In general, having too many variables in a dataset can negatively affect the performance of clustering algorithms. A good practice is to keep the number of variables below 30, for example, by using techniques for dimensionality reduction such as Principal-Component-Analysis.</li>



<li><strong>Scaling</strong>: Important to note is also that K-means require scaling the data.</li>
</ul>



<p>Our synthetic dataset is free of outliers or missing values. Therefore, we only need to scale the data. In addition, we separate the class labels of the clusters from the training set and split the data into a train- and a test dataset. </p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Scale the data
scaler = StandardScaler()
scaled_features = scaler.fit_transform(features)

X = scaled_features #Training data
y = labels #Prediction label

# Split the data into x_train and y_train data sets
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, random_state=0)</pre></div>



<h3 class="wp-block-heading" id="h-step-3-train-a-k-means-clustering-model">Step #3: Train a k-Means Clustering Model</h3>



<p>Once we have prepared the data, we can begin the cluster analysis by training a K-means model. Our model uses the k-means algorithm from Python <a href="https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html?highlight=kmeans#sklearn.cluster.KMeans" target="_blank" rel="noreferrer noopener">scikit-learn library</a>. We have various options to configure the clustering process: </p>



<ul class="wp-block-list">
<li><strong>n_clusters</strong><em>: </em>The number of clusters we expect in the data. </li>



<li><strong>n_init</strong><em>: </em>The number of iterations k-means will run with different initial centroids. The algorithm returns the best model. </li>



<li><strong>max_iter</strong><em>: The max number of iterations for a single run</em></li>
</ul>



<p>We expect three clusters and configure the algorithm to run ten iterations.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Create a k-means model with n clusters
kmeans = KMeans(
    init=&quot;random&quot;,
    n_clusters=3,
    n_init=10,
    max_iter=300,
    random_state=42
)

# fit the model
kmeans.fit(X_train)
print(f'Converged after {kmeans.n_iter_} iterations')</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">Converged after 4 iterations</pre></div>



<p>Our model has already converged after four iterations. Next, we will look at the results.</p>



<h3 class="wp-block-heading" id="h-step-4-make-and-visualize-predictions">Step #4: Make and Visualize Predictions</h3>



<p>Next, we&#8217;ll be delving into the practical aspect of our Python tutorial by analyzing the performance of our trained K-Means model using synthetic data.</p>



<p>In the following Python code, we:</p>



<ul class="wp-block-list">
<li>Extract the cluster centers from the trained K-Means model and unscale them.</li>



<li>Add the predicted and actual labels to our unscaled training data.</li>



<li>Define a function, <code>scatter_plots()</code>, that creates two scatter plots: one for the predicted labels and another for the actual labels.</li>



<li>Call this function, passing the training data, cluster centers, and a color palette as arguments.</li>
</ul>



<p>Please note:</p>



<ul class="wp-block-list">
<li>We use a dictionary as a color palette to differentiate between clusters.</li>



<li>The colors between the two plots may not match due to K-Means assigning numbers to clusters without prior knowledge of initial labels.</li>
</ul>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Get the cluster centers from the trained K-means model
cluster_center = scaler.inverse_transform(kmeans.cluster_centers_)
df_cluster_centers = pd.DataFrame(cluster_center, columns=['x', 'y'])

# Unscale the predictions
X_train_unscaled = scaler.inverse_transform(X_train)
df_train = pd.DataFrame(X_train_unscaled, columns=['x', 'y'])
df_train['pred_label'] = kmeans.labels_
df_train['true_label'] = y_train

def scatter_plots(df, cc, palette):
    fig, ax = plt.subplots(nrows=1, ncols=2, sharex=True, figsize=(20, 8))
    fig.subplots_adjust(hspace=0.5, wspace=0.2)
    
    # Print the predictions    
    ax2 = plt.subplot(1,2,1)
    sns.scatterplot(ax = ax2, data=df, x='x', y='y', hue='pred_label', palette=palette)
    sns.scatterplot(ax = ax2, data=cc, x='x', y='y', color='r', marker=&quot;X&quot;)
    plt.title('Predicted Labels')

    # Print the actual values
    ax1 = plt.subplot(1,2,2)
    sns.scatterplot(ax = ax1, data=df, x='x', y='y', hue= 'true_label', palette=palette)
    sns.scatterplot(ax = ax1, data=cc, x='x', y='y', color='r', marker=&quot;X&quot;)
    plt.title('Actual Labels')


# The colors between the two plots may not match.
# This is because K-means does not know the initial labels and assigns numbers to clusters 
palette = {1:&quot;tab:cyan&quot;,0:&quot;tab:orange&quot;, 2:&quot;tab:purple&quot;}
scatter_plots(df_train, df_cluster_centers, palette)</pre></div>



<figure class="wp-block-image size-large is-resized"><img decoding="async" data-attachment-id="5062" data-permalink="https://www.relataly.com/image-78-2/" data-orig-file="https://www.relataly.com/wp-content/uploads/2021/06/image-78.png" data-orig-size="1172,496" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-78" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2021/06/image-78.png" src="https://www.relataly.com/wp-content/uploads/2021/06/image-78-1024x433.png" alt="actual clusters vs predicted cluster of the K-means model" class="wp-image-5062" width="1106" height="468" srcset="https://www.relataly.com/wp-content/uploads/2021/06/image-78.png 1024w, https://www.relataly.com/wp-content/uploads/2021/06/image-78.png 300w, https://www.relataly.com/wp-content/uploads/2021/06/image-78.png 768w, https://www.relataly.com/wp-content/uploads/2021/06/image-78.png 1172w" sizes="(max-width: 1106px) 100vw, 1106px" /></figure>



<p>The scatterplot above shows that K-Means found the three clusters. As a side note, the colors between the two plots do not match because K-means does not know the initial labels and assigns numbers to clusters. </p>



<h3 class="wp-block-heading" id="h-step-5-measuring-model-performance">Step #5: Measuring Model Performance</h3>



<p>Next, we measure the performance of our clustering model. K-means is an unsupervised machine learning algorithm, which means that it is used to cluster data points into groups based on their similarity, without the use of labeled training data. However, we can compare the cluster assignments to the labels attached to the data and see if the model can predict them correctly. In this way, we can use traditional classification metrics such as accuracy and f1_score to measure the performance of our clustering model.</p>



<p>First, we unify the cluster labels as A, B, and C.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># The predictive model has the labels 0 and 1 reversed. We will correct that first. 
#df_train['pred_test'] = df_train['pred_labels'].map({2:2, 1:3, 0:1})
df_eval = df_train.copy()
df_eval['true_label'] = df_eval['true_label'].map({0:'A', 1:'B', 2:'C'})
df_eval['pred_label'] = df_eval['pred_label'].map({0:'B', 1:'A', 2:'C'})
df_eval.head(10)</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">	x			y			pred_label	true_label
0	-9.007547	-10.302910	C			C
1	1.009238	7.009681	A			B
2	-6.565501	-6.466780	C			C
3	2.389772	7.727235	B			B
4	-5.422666	-2.915796	C			C
5	-12.024305	-7.846772	C			C
6	-4.006250	9.319323	A			A
7	-6.297788	6.435267	A			A
8	2.169238	3.325947	B			B
9	-5.140506	-4.205585	C			C</pre></div>



<p>It is a common practice to create scatterplots on the predictions to visually verify the quality of the clusters and their decision boundaries. The following scatter plot shows the correctly assigned values and where our model was wrong.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Scatterplot on correctly and falsely classified values
df_eval.loc[df_eval['pred_label'] == df_eval['true_label'], 'Correctly classified?'] = 'True' 
df_eval.loc[df_eval['pred_label'] != df_eval['true_label'], 'Correctly classified?'] = 'False' 

plt.rcParams[&quot;figure.figsize&quot;] = (10,8)
sns.scatterplot(data=df_eval, x='x', y='y', color='r', hue='Correctly classified?')</pre></div>



<figure class="wp-block-image size-large is-resized"><img decoding="async" data-attachment-id="5065" data-permalink="https://www.relataly.com/image-81-2/" data-orig-file="https://www.relataly.com/wp-content/uploads/2021/06/image-81.png" data-orig-size="614,480" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-81" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2021/06/image-81.png" src="https://www.relataly.com/wp-content/uploads/2021/06/image-81.png" alt="A 2-d scatterplot that shows the correctly assigned values and where our  K-means model was wrong." class="wp-image-5065" width="711" height="556" srcset="https://www.relataly.com/wp-content/uploads/2021/06/image-81.png 614w, https://www.relataly.com/wp-content/uploads/2021/06/image-81.png 300w" sizes="(max-width: 711px) 100vw, 711px" /></figure>



<p>The K-Means model correctly assigned most data points (blue) to their actual cluster. The few misclassified points are located at a decision boundary between two clusters (marked in orange).</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Create a confusion matrix
def evaluate_results(model, y_true, y_pred, class_names):
    tick_marks = [0.5, 1.5, 2.5]
    
    # Print the Confusion Matrix
    fig, ax = plt.subplots(figsize=(10, 6))
    results_log = classification_report(y_true, y_pred, target_names=class_names, output_dict=True)
    results_df_log = pd.DataFrame(results_log).transpose()
    matrix = confusion_matrix(y_true,  y_pred)
    model_score = score(y_pred, y_true, average='macro')
    
    sns.heatmap(pd.DataFrame(matrix), annot=True, fmt=&quot;d&quot;, linewidths=.5, cmap=&quot;YlGnBu&quot;)
    plt.xlabel('Predicted label'); plt.ylabel('Actual label')
    plt.title('Confusion Matrix on the Test Data')
    plt.yticks(tick_marks, class_names); plt.xticks(tick_marks, class_names)
    
    print(results_df_log)


y_true = df_eval['true_label']
y_pred = df_eval['pred_label']
class_names = ['A', 'B', 'C']
evaluate_results(kmeans, y_true, y_pred, class_names)</pre></div>



<figure class="wp-block-image size-large is-resized"><img decoding="async" data-attachment-id="5064" data-permalink="https://www.relataly.com/image-80-2/" data-orig-file="https://www.relataly.com/wp-content/uploads/2021/06/image-80.png" data-orig-size="682,588" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-80" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2021/06/image-80.png" src="https://www.relataly.com/wp-content/uploads/2021/06/image-80.png" alt="confusion matrix on the predictions generated by our k-means model" class="wp-image-5064" width="658" height="568" srcset="https://www.relataly.com/wp-content/uploads/2021/06/image-80.png 682w, https://www.relataly.com/wp-content/uploads/2021/06/image-80.png 300w" sizes="(max-width: 658px) 100vw, 658px" /></figure>



<p>As we can see, the model has done a good job of grouping the labels into the attached classes.</p>



<h2 class="wp-block-heading" id="h-summary">Summary</h2>



<p>algorithm, capable of parsing intricate datasets into discrete, non-overlapping clusters. With this guide, you should have a clear understanding of how this algorithm operates, as well as its unique strengths and potential limitations. Remember, k-Means excels when dealing with spherical clusters but may struggle with clusters of more complex shapes.</p>



<p>We walked through how to implement the k-Means algorithm in Python, using it to identify spherical clusters within synthetic data. Moreover, we explored different methods to evaluate and visualize a clustering model&#8217;s performance, providing you with valuable tools to effectively analyze and group your data.</p>



<p>Whether it&#8217;s anomaly detection in time-series data or customer segmentation, k-Means can be a powerful tool in your data science arsenal. If you&#8217;re interested in anomaly detection, feel free to explore my recent article on the subject, written with Python enthusiasts in mind. Armed with this knowledge, you&#8217;re ready to embark on your own data exploration journey using k-Means clustering.</p>



<p>If you are interested in this topic, check out my recent article on <a href="https://www.relataly.com/multivariate-outlier-detection-using-isolation-forests-in-python-detecting-credit-card-fraud/4233/" target="_blank" rel="noreferrer noopener">anomaly detection with Python</a>. And if you have any questions, please ask them in the comments.</p>



<p>Did you know that customer segmentation is an area where real-world data can be prone to bias and unfairness? If you&#8217;re concerned that your models may reflect the same bias, check out our latest article on <a href="https://www.relataly.com/building-fair-machine-machine-learning-models-with-fairlearn/12804/" target="_blank" rel="noreferrer noopener">addressing fairness in machine learning with fairlearn</a>.</p>



<h2 class="wp-block-heading">Sources and Further Reading</h2>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow">
<div class="wp-block-group"><div class="wp-block-group__inner-container is-layout-constrained wp-block-group-is-layout-constrained">
<h4 class="wp-block-heading"><strong>Books on Clustering</strong></h4>



<ul class="wp-block-list">
<li><a href="https://amzn.to/3Gb5kfj" target="_blank" rel="noreferrer noopener">&#8220;Data Clustering: Algorithms and Applications&#8221; by Charu C. Aggarwal</a>: This book covers a wide range of clustering algorithms, including hierarchical clustering, and discusses their applications in various fields.</li>



<li><a href="https://amzn.to/3WmhGXB" target="_blank" rel="noreferrer noopener">&#8220;Data Mining: Practical Machine Learning Tools and Techniques&#8221; by Ian H. Witten and Eibe Frank</a>: This book is a comprehensive introduction to data mining and machine learning, including a chapter on hierarchical clustering.</li>
</ul>



<div style="display: inline-block;">
<iframe sandbox="allow-popups allow-scripts allow-modals allow-forms allow-same-origin" style="width:120px;height:240px;" marginwidth="0" marginheight="0" scrolling="no" frameborder="0" src="//ws-eu.amazon-adsystem.com/widgets/q?ServiceVersion=20070822&amp;OneJS=1&amp;Operation=GetAdHtml&amp;MarketPlace=DE&amp;source=ss&amp;ref=as_ss_li_til&amp;ad_type=product_link&amp;tracking_id=flo7up-21&amp;language=de_DE&amp;marketplace=amazon&amp;region=DE&amp;placement=0128042915&amp;asins=0128042915&amp;linkId=1e9fe160a76f7255e3eea8e0119ca74f&amp;show_border=true&amp;link_opens_in_new_window=true"></iframe>

<iframe sandbox="allow-popups allow-scripts allow-modals allow-forms allow-same-origin" style="width:120px;height:240px;" marginwidth="0" marginheight="0" scrolling="no" frameborder="0" src="//ws-eu.amazon-adsystem.com/widgets/q?ServiceVersion=20070822&amp;OneJS=1&amp;Operation=GetAdHtml&amp;MarketPlace=DE&amp;source=ss&amp;ref=as_ss_li_til&amp;ad_type=product_link&amp;tracking_id=flo7up-21&amp;language=de_DE&amp;marketplace=amazon&amp;region=DE&amp;placement=B00EYROAQU&amp;asins=B00EYROAQU&amp;linkId=ba1fcb8a59417e729afe33f6eceb2a9f&amp;show_border=true&amp;link_opens_in_new_window=true"></iframe>
</div>
</div></div>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow">
<div class="wp-block-group"><div class="wp-block-group__inner-container is-layout-constrained wp-block-group-is-layout-constrained">
<h4 class="wp-block-heading"><strong>Books on Machine Learning</strong></h4>



<ul class="wp-block-list">
<li><a href="https://amzn.to/3S9Nfkl" target="_blank" rel="noreferrer noopener">Aurélien Géron (2019) Hands-On Machine Learning</a></li>



<li><a href="https://amzn.to/3EKidwE" target="_blank" rel="noreferrer noopener">David Forsyth (2019) Applied Machine Learning Springer</a></li>
</ul>



<div style="display: inline-block;">

  <iframe sandbox="allow-popups allow-scripts allow-modals allow-forms allow-same-origin" style="width:120px;height:240px;" marginwidth="0" marginheight="0" scrolling="no" frameborder="0" src="//ws-eu.amazon-adsystem.com/widgets/q?ServiceVersion=20070822&amp;OneJS=1&amp;Operation=GetAdHtml&amp;MarketPlace=DE&amp;source=ss&amp;ref=as_ss_li_til&amp;ad_type=product_link&amp;tracking_id=flo7up-21&amp;language=de_DE&amp;marketplace=amazon&amp;region=DE&amp;placement=3030181162&amp;asins=3030181162&amp;linkId=669e46025028259138fbb5ccec12dfbe&amp;show_border=true&amp;link_opens_in_new_window=true"></iframe>

<iframe sandbox="allow-popups allow-scripts allow-modals allow-forms allow-same-origin" style="width:120px;height:240px;" marginwidth="0" marginheight="0" scrolling="no" frameborder="0" src="//ws-eu.amazon-adsystem.com/widgets/q?ServiceVersion=20070822&amp;OneJS=1&amp;Operation=GetAdHtml&amp;MarketPlace=DE&amp;source=ss&amp;ref=as_ss_li_til&amp;ad_type=product_link&amp;tracking_id=flo7up-21&amp;language=de_DE&amp;marketplace=amazon&amp;region=DE&amp;placement=1492032646&amp;asins=1492032646&amp;linkId=2214804dd039e7103577abd08722abac&amp;show_border=true&amp;link_opens_in_new_window=true"></iframe>
</div>
</div></div>
</div>
</div>



<p class="has-contrast-2-color has-base-3-background-color has-text-color has-background"><em>The links above to Amazon are affiliate links. By buying through these links, you support the Relataly.com blog and help to cover the hosting costs. Using the links does not affect the price.</em></p>



<h4 class="wp-block-heading">Related Articles on Clustering</h4>



<ol class="wp-block-list">
<li><a href="https://wp.me/pbUnJO-2WP" target="_blank" rel="noreferrer noopener">Hierarchical clustering for customer segmentation in Python</a>: We present an alternative approach to k-means that can determine the number of the clusters themselves and works well if the distributions of the groups in the data is more complex.</li>



<li><a href="https://www.relataly.com/crypto-market-cluster-analysis-using-affinity-propagation-python/8114/" target="_blank" rel="noreferrer noopener">Clustering crypto markets using affinity propagation in Python</a>: This article applies cluster analysis to crypto markets and creates a market map for various cryptocurrencies.</li>
</ol>
<p>The post <a href="https://www.relataly.com/simple-cluster-analysis-with-k-means-with-python/5070/">Cluster Analysis with k-Means in Python</a> appeared first on <a href="https://www.relataly.com">relataly.com</a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://www.relataly.com/simple-cluster-analysis-with-k-means-with-python/5070/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">5070</post-id>	</item>
		<item>
		<title>Multivariate Anomaly Detection on Time-Series Data in Python: Using Isolation Forests to Detect Credit Card Fraud</title>
		<link>https://www.relataly.com/multivariate-outlier-detection-using-isolation-forests-in-python-detecting-credit-card-fraud/4233/</link>
					<comments>https://www.relataly.com/multivariate-outlier-detection-using-isolation-forests-in-python-detecting-credit-card-fraud/4233/#comments</comments>
		
		<dc:creator><![CDATA[Florian Follonier]]></dc:creator>
		<pubDate>Wed, 16 Jun 2021 09:33:00 +0000</pubDate>
				<category><![CDATA[Anomaly Detection]]></category>
		<category><![CDATA[Classification (two-class)]]></category>
		<category><![CDATA[Finance]]></category>
		<category><![CDATA[Fraud Detection]]></category>
		<category><![CDATA[K-Nearest Neighbors (KNN)]]></category>
		<category><![CDATA[Local Outlier Factor]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[Random Isolation Forest]]></category>
		<category><![CDATA[Scikit-Learn]]></category>
		<category><![CDATA[Seaborn]]></category>
		<category><![CDATA[AI in Business]]></category>
		<category><![CDATA[AI in Finance]]></category>
		<category><![CDATA[Intermediate Tutorials]]></category>
		<category><![CDATA[Multivariate Models]]></category>
		<category><![CDATA[Unsupervised Learning]]></category>
		<guid isPermaLink="false">https://www.relataly.com/?p=4233</guid>

					<description><![CDATA[<p>Credit card fraud has become one of the most common use cases for anomaly detection systems. The number of fraud attempts has risen sharply, resulting in billions of dollars in losses. Early detection of fraud attempts with machine learning is therefore becoming increasingly important. In this article, we take on the fight against international credit ... <a title="Multivariate Anomaly Detection on Time-Series Data in Python: Using Isolation Forests to Detect Credit Card Fraud" class="read-more" href="https://www.relataly.com/multivariate-outlier-detection-using-isolation-forests-in-python-detecting-credit-card-fraud/4233/" aria-label="Read more about Multivariate Anomaly Detection on Time-Series Data in Python: Using Isolation Forests to Detect Credit Card Fraud">Read more</a></p>
<p>The post <a href="https://www.relataly.com/multivariate-outlier-detection-using-isolation-forests-in-python-detecting-credit-card-fraud/4233/">Multivariate Anomaly Detection on Time-Series Data in Python: Using Isolation Forests to Detect Credit Card Fraud</a> appeared first on <a href="https://www.relataly.com">relataly.com</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow">
<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p>Credit card fraud has become one of the most common use cases for anomaly detection systems. The number of fraud attempts has risen sharply, resulting in billions of dollars in losses. Early detection of fraud attempts with <a href="https://www.relataly.com/category/machine-learning/" target="_blank" rel="noreferrer noopener">machine learning</a> is therefore becoming increasingly important. In this article, we take on the fight against international credit card fraud and develop a multivariate anomaly detection model in Python that spots fraudulent payment transactions. The model will use the Isolation Forest algorithm, one of the most effective techniques for detecting outliers. Isolation Forests are so-called ensemble models. They have various hyperparameters with which we can optimize model performance. However, we will not do this manually but instead, use grid search for hyperparameter tuning. To assess the performance of our model, we will also compare it with other models.</p>



<p>The remainder of this article is structured as follows: We start with a brief introduction to anomaly detection and look at the Isolation Forest algorithm. Equipped with these theoretical foundations, we then turn to the practical part, in which we train and validate an isolation forest that detects credit card fraud. We use an unsupervised learning approach, where the model learns to distinguish regular from suspicious card transactions. We will train our model on a public dataset from Kaggle that contains credit card transactions. Finally, we will compare the performance of our model against two nearest neighbor algorithms (LOF and <a href="https://www.relataly.com/category/machine-learning-algorithms/k-nearest-neighbors-knn/" target="_blank" rel="noreferrer noopener">KNN</a>).</p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%">
<figure class="wp-block-image size-large"><img decoding="async" width="512" height="512" data-attachment-id="12702" data-permalink="https://www.relataly.com/financial-crime-serious-business-python-relataly-midjourney-min/" data-orig-file="https://www.relataly.com/wp-content/uploads/2023/03/financial-crime-serious-business-python-relataly-midjourney-min.png" data-orig-size="1024,1024" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="financial crime serious business python relataly midjourney-min" data-image-description="&lt;p&gt;Financial crime is a real problem, although its usually not commited by cats. Image created with Midjourney&lt;/p&gt;
" data-image-caption="&lt;p&gt;Financial crime is a real problem, although its usually not commited by cats. Image created with Midjourney&lt;/p&gt;
" data-large-file="https://www.relataly.com/wp-content/uploads/2023/03/financial-crime-serious-business-python-relataly-midjourney-min.png" src="https://www.relataly.com/wp-content/uploads/2023/03/financial-crime-serious-business-python-relataly-midjourney-min-512x512.png" alt="Financial crime is a real problem, although its usually not commited by cats. Image created with Midjourney" class="wp-image-12702" srcset="https://www.relataly.com/wp-content/uploads/2023/03/financial-crime-serious-business-python-relataly-midjourney-min.png 512w, https://www.relataly.com/wp-content/uploads/2023/03/financial-crime-serious-business-python-relataly-midjourney-min.png 300w, https://www.relataly.com/wp-content/uploads/2023/03/financial-crime-serious-business-python-relataly-midjourney-min.png 140w, https://www.relataly.com/wp-content/uploads/2023/03/financial-crime-serious-business-python-relataly-midjourney-min.png 768w, https://www.relataly.com/wp-content/uploads/2023/03/financial-crime-serious-business-python-relataly-midjourney-min.png 1024w" sizes="(max-width: 512px) 100vw, 512px" /><figcaption class="wp-element-caption">Financial crime is a real problem, although cats are relatively seldom involved. Image created with <a href="http://www.midjourney.com" target="_blank" rel="noreferrer noopener">Midjourney</a></figcaption></figure>
</div>
</div>
</div>
</div>



<h2 class="wp-block-heading" id="h-multivariate-anomaly-detection">Multivariate Anomaly Detection</h2>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p>Before we take a closer look at the use case and our unsupervised approach, let&#8217;s briefly discuss anomaly detection. Anomaly detection deals with finding points that deviate from legitimate data regarding their mean or median in a distribution. In machine learning, the term is often used synonymously with outlier detection. </p>



<p>Some anomaly detection models work with a single feature (univariate data), for example, in monitoring electronic signals. However, most anomaly detection models use multivariate data, which means they have two (bivariate) or more (multivariate) features. They find a wide range of applications, including the following:</p>



<ul class="wp-block-list">
<li>Predictive Maintenance and Detection of Malfunctions and Decay</li>



<li>Detection of Retail Bank Credit Card Fraud</li>



<li>Detection of Pricing Errors</li>



<li>Cyber Security, for example, Network Intrusion Detection</li>



<li>Detecting Fraudulent Market Behavior in Investment Banking</li>
</ul>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%">
<figure class="wp-block-image size-large"><img decoding="async" width="1024" height="410" data-attachment-id="9019" data-permalink="https://www.relataly.com/multivariate-outlier-detection-using-isolation-forests-in-python-detecting-credit-card-fraud/4233/output4/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/07/output4.png" data-orig-size="1184,474" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="identifying anomalous datapoints in two dimensions, credit card fraud detection" data-image-description="&lt;p&gt;identifying anomalous datapoints in two dimensions, credit card fraud detection&lt;/p&gt;
" data-image-caption="&lt;p&gt;identifying anomalous datapoints in two dimensions, credit card fraud detection&lt;/p&gt;
" data-large-file="https://www.relataly.com/wp-content/uploads/2022/07/output4.png" src="https://www.relataly.com/wp-content/uploads/2022/07/output4-1024x410.png" alt="identifying anomalous data points in two dimensions, credit card fraud detection" class="wp-image-9019" srcset="https://www.relataly.com/wp-content/uploads/2022/07/output4.png 1024w, https://www.relataly.com/wp-content/uploads/2022/07/output4.png 300w, https://www.relataly.com/wp-content/uploads/2022/07/output4.png 768w, https://www.relataly.com/wp-content/uploads/2022/07/output4.png 1184w" sizes="(max-width: 1024px) 100vw, 1024px" /><figcaption class="wp-element-caption">Identifying anomalous data points in two dimensions in credit card fraud detection</figcaption></figure>
</div>
</div>



<h3 class="wp-block-heading" id="h-unsupervised-algorithms-for-anomaly-detection">Unsupervised Algorithms for Anomaly Detection</h3>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p>Outlier detection is a classification problem. However, the field is more diverse as outlier detection is a problem we can approach with supervised and unsupervised machine learning techniques. It would go beyond the scope of this article to explain the multitude of outlier detection techniques. Still, the following chart provides a good overview of standard algorithms that learn unsupervised.</p>



<figure class="wp-block-image size-large is-resized"><img decoding="async" data-attachment-id="7649" data-permalink="https://www.relataly.com/multivariate-outlier-detection-using-isolation-forests-in-python-detecting-credit-card-fraud/4233/image-21-3/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/04/image-21.png" data-orig-size="1983,994" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-21" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/04/image-21.png" src="https://www.relataly.com/wp-content/uploads/2022/04/image-21-1024x513.png" alt="Unsupervised Algorithms for Anomaly Detection. Outlier detection using random isolation forests" class="wp-image-7649" width="759" height="380" srcset="https://www.relataly.com/wp-content/uploads/2022/04/image-21.png 1024w, https://www.relataly.com/wp-content/uploads/2022/04/image-21.png 300w, https://www.relataly.com/wp-content/uploads/2022/04/image-21.png 768w, https://www.relataly.com/wp-content/uploads/2022/04/image-21.png 1536w, https://www.relataly.com/wp-content/uploads/2022/04/image-21.png 1983w" sizes="(max-width: 759px) 100vw, 759px" /><figcaption class="wp-element-caption">Unsupervised Algorithms for Anomaly Detection. </figcaption></figure>



<p>A prerequisite for supervised learning is that we have information about which data points are outliers and belong to regular data. In credit card fraud detection, this information is available because banks can validate with their customers whether a suspicious transaction is a fraud or not. In many other outlier detection cases, it remains unclear which outliers are legitimate and which are just noise or other uninteresting events in the data. </p>



<p>Whether we know which classes in our dataset are outliers and which are not affects the selection of possible algorithms we could use to solve the outlier detection problem. Unsupervised learning techniques are a natural choice if the class labels are unavailable. And if the class labels are available, we could use both unsupervised and supervised learning algorithms.</p>



<p>In the following, we will focus on Isolation Forests.</p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%"></div>
</div>



<h3 class="wp-block-heading" id="h-the-isolation-forest-iforest-algorithm">The Isolation Forest (&#8220;iForest&#8221;) Algorithm</h3>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p>Isolation forests (sometimes called iForests) are among the most powerful techniques for identifying anomalies in a dataset. They belong to the group of so-called ensemble models. The predictions of ensemble models do not rely on a single model. Instead, they combine the results of multiple independent models (decision trees). Nevertheless, isolation forests should not be confused with traditional <a href="https://www.relataly.com/category/machine-learning-algorithms/decision-forests/" target="_blank" rel="noreferrer noopener">random decision forests</a>. While random forests predict given class labels (supervised learning), isolation forests learn to distinguish outliers from inliers (regular data) in an unsupervised learning process.</p>



<p>An Isolation Forest contains multiple independent isolation trees. The algorithm invokes a process that recursively divides the training data at random points to isolate data points from each other to build an Isolation Tree. The number of partitions required to isolate a point tells us whether it is an anomalous or regular point. The underlying assumption is that random splits can isolate an anomalous data point much sooner than nominal ones.</p>



<p>Also: <a href="https://www.relataly.com/stock-market-prediction-using-multivariate-time-series-in-python/1815/" target="_blank" rel="noreferrer noopener">Stock Market Prediction using Multivariate Time Series Data</a> </p>



<figure class="wp-block-image size-large is-resized"><img decoding="async" data-attachment-id="4616" data-permalink="https://www.relataly.com/multivariate-outlier-detection-using-isolation-forests-in-python-detecting-credit-card-fraud/4233/image-40-5/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2021/06/image-40.png" data-orig-size="1106,461" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-40" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2021/06/image-40.png" src="https://www.relataly.com/wp-content/uploads/2021/06/image-40-1024x427.png" alt="Isolation Tree and Isolation Forest (Tree Ensemble), Outlier detection using random isolation forests" class="wp-image-4616" width="754" height="314" srcset="https://www.relataly.com/wp-content/uploads/2021/06/image-40.png 1024w, https://www.relataly.com/wp-content/uploads/2021/06/image-40.png 300w, https://www.relataly.com/wp-content/uploads/2021/06/image-40.png 768w, https://www.relataly.com/wp-content/uploads/2021/06/image-40.png 1106w" sizes="(max-width: 754px) 100vw, 754px" /><figcaption class="wp-element-caption">Isolation Tree and Isolation Forest (Tree Ensemble)</figcaption></figure>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%"></div>
</div>



<h3 class="wp-block-heading" id="h-how-the-isolation-forest-algorithm-works">How the Isolation Forest Algorithm Works</h3>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p>The illustration below shows exemplary training of an Isolation Tree on univariate data, i.e., with only one feature. The algorithm has already split the data at five random points between the minimum and maximum values of a random sample. The isolated points are colored in purple. The example below has taken two partitions to isolate the point on the far left. The other purple points were separated after 4 and 5 splits.</p>



<p>The partitioning process ends when the algorithm has isolated all points from each other or when all remaining points have equal values. The algorithm has calculated and assigned an outlier score to each point at the end of the process, based on how many splits it took to isolate it.</p>



<p>When using an isolation forest model on unseen data to detect outliers, the algorithm will assign an anomaly score to the new data points. These scores will be calculated based on the ensemble trees we built during model training.</p>



<p>So how does this process work when our dataset involves multiple features? For multivariate anomaly detection, partitioning the data remains almost the same. The significant difference is that the algorithm selects a random feature in which the partitioning will occur before each partitioning. Consequently, multivariate isolation forests split the data along multiple dimensions (features).</p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%">
<figure class="wp-block-image size-large is-resized"><img decoding="async" data-attachment-id="4615" data-permalink="https://www.relataly.com/multivariate-outlier-detection-using-isolation-forests-in-python-detecting-credit-card-fraud/4233/image-39-5/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2021/06/image-39.png" data-orig-size="685,430" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-39" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2021/06/image-39.png" src="https://www.relataly.com/wp-content/uploads/2021/06/image-39.png" alt="Exemplary partitioning process of an isolation tree (5 Steps); Outlier detection using random isolation forests;  Outlier detection using isolation forests" class="wp-image-4615" width="375" height="235" srcset="https://www.relataly.com/wp-content/uploads/2021/06/image-39.png 685w, https://www.relataly.com/wp-content/uploads/2021/06/image-39.png 300w, https://www.relataly.com/wp-content/uploads/2021/06/image-39.png 80w" sizes="(max-width: 375px) 100vw, 375px" /><figcaption class="wp-element-caption">Exemplary partitioning process of an isolation tree (5 Steps)</figcaption></figure>
</div>
</div>



<h2 class="wp-block-heading" id="h-credit-card-fraud-detection-using-isolation-forests">Credit Card Fraud Detection using Isolation Forests</h2>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p>Monitoring transactions has become a crucial task for financial institutions. In 2019 alone, more than 271,000 cases of credit card theft were reported in the U.S., causing billions of dollars in losses and making credit card fraud one of the most common types of identity theft. The vast majority of fraud cases are attributable to organized crime, which often specializes in this particular crime.</p>



<p>Anything that deviates from the customer&#8217;s normal payment behavior can make a transaction suspicious, including an unusual location, time, or country in which the customer conducted the transaction. Credit card providers use similar anomaly detection systems to monitor their customers&#8217; transactions and look for potential fraud attempts. They can halt the transaction and inform their customer as soon as they detect a fraud attempt. We train an Isolation Forest algorithm for credit card fraud detection using Python in the following.</p>



<p>Now that we have established the context for our machine learning problem, we can begin implementing an anomaly detection model in Python.</p>



<p>The code is available on the GitHub repository.</p>



<div class="wp-block-kadence-advancedbtn kb-buttons-wrap kb-btns_60179c-c7"><a class="kb-button kt-button button kb-btn_ef16c1-1b kt-btn-size-standard kt-btn-width-type-full kb-btn-global-inherit  kt-btn-has-text-true kt-btn-has-svg-true  wp-block-button__link wp-block-kadence-singlebtn" href="https://github.com/flo7up/relataly-public-python-tutorials/blob/master/04%20Anomaly%20Detection/026%20Credit%20Card%20Fraud%20Detection%20with%20Isolation%20Forests.ipynb" target="_blank" rel="noreferrer noopener"><span class="kb-svg-icon-wrap kb-svg-icon-fe_eye kt-btn-icon-side-left"><svg viewBox="0 0 24 24"  fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"  aria-hidden="true"><path d="M1 12s4-8 11-8 11 8 11 8-4 8-11 8-11-8-11-8z"/><circle cx="12" cy="12" r="3"/></svg></span><span class="kt-btn-inner-text">View on GitHub </span></a>

<a class="kb-button kt-button button kb-btn_adf11b-23 kt-btn-size-standard kt-btn-width-type-full kb-btn-global-inherit  kt-btn-has-text-true kt-btn-has-svg-true  wp-block-button__link wp-block-kadence-singlebtn" href="https://github.com/flo7up/relataly-public-python-API-tutorials" target="_blank" rel="noreferrer noopener"><span class="kb-svg-icon-wrap kb-svg-icon-fa_github kt-btn-icon-side-left"><svg viewBox="0 0 496 512"  fill="currentColor" xmlns="http://www.w3.org/2000/svg"  aria-hidden="true"><path d="M165.9 397.4c0 2-2.3 3.6-5.2 3.6-3.3.3-5.6-1.3-5.6-3.6 0-2 2.3-3.6 5.2-3.6 3-.3 5.6 1.3 5.6 3.6zm-31.1-4.5c-.7 2 1.3 4.3 4.3 4.9 2.6 1 5.6 0 6.2-2s-1.3-4.3-4.3-5.2c-2.6-.7-5.5.3-6.2 2.3zm44.2-1.7c-2.9.7-4.9 2.6-4.6 4.9.3 2 2.9 3.3 5.9 2.6 2.9-.7 4.9-2.6 4.6-4.6-.3-1.9-3-3.2-5.9-2.9zM244.8 8C106.1 8 0 113.3 0 252c0 110.9 69.8 205.8 169.5 239.2 12.8 2.3 17.3-5.6 17.3-12.1 0-6.2-.3-40.4-.3-61.4 0 0-70 15-84.7-29.8 0 0-11.4-29.1-27.8-36.6 0 0-22.9-15.7 1.6-15.4 0 0 24.9 2 38.6 25.8 21.9 38.6 58.6 27.5 72.9 20.9 2.3-16 8.8-27.1 16-33.7-55.9-6.2-112.3-14.3-112.3-110.5 0-27.5 7.6-41.3 23.6-58.9-2.6-6.5-11.1-33.3 2.6-67.9 20.9-6.5 69 27 69 27 20-5.6 41.5-8.5 62.8-8.5s42.8 2.9 62.8 8.5c0 0 48.1-33.6 69-27 13.7 34.7 5.2 61.4 2.6 67.9 16 17.7 25.8 31.5 25.8 58.9 0 96.5-58.9 104.2-114.8 110.5 9.2 7.9 17 22.9 17 46.4 0 33.7-.3 75.4-.3 83.6 0 6.5 4.6 14.4 17.3 12.1C428.2 457.8 496 362.9 496 252 496 113.3 383.5 8 244.8 8zM97.2 352.9c-1.3 1-1 3.3.7 5.2 1.6 1.6 3.9 2.3 5.2 1 1.3-1 1-3.3-.7-5.2-1.6-1.6-3.9-2.3-5.2-1zm-10.8-8.1c-.7 1.3.3 2.9 2.3 3.9 1.6 1 3.6.7 4.3-.7.7-1.3-.3-2.9-2.3-3.9-2-.6-3.6-.3-4.3.7zm32.4 35.6c-1.6 1.3-1 4.3 1.3 6.2 2.3 2.3 5.2 2.6 6.5 1 1.3-1.3.7-4.3-1.3-6.2-2.2-2.3-5.2-2.6-6.5-1zm-11.4-14.7c-1.6 1-1.6 3.6 0 5.9 1.6 2.3 4.3 3.3 5.6 2.3 1.6-1.3 1.6-3.9 0-6.2-1.4-2.3-4-3.3-5.6-2z"/></svg></span><span class="kt-btn-inner-text">Relataly GitHub Repo </span></a></div>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%"><div class="wp-block-image">
<figure class="aligncenter size-full"><img decoding="async" width="512" height="496" data-attachment-id="12451" data-permalink="https://www.relataly.com/multivariate-outlier-detection-using-isolation-forests-in-python-detecting-credit-card-fraud/4233/credit-card-data-analytics-machine-learning-cyber-crime-fraud-detection/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2023/02/credit-card-data-analytics-machine-learning-cyber-crime-fraud-detection.png" data-orig-size="512,496" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="credit-card-data-analytics-machine-learning-cyber-crime-fraud-detection" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2023/02/credit-card-data-analytics-machine-learning-cyber-crime-fraud-detection.png" src="https://www.relataly.com/wp-content/uploads/2023/02/credit-card-data-analytics-machine-learning-cyber-crime-fraud-detection.png" alt="" class="wp-image-12451" srcset="https://www.relataly.com/wp-content/uploads/2023/02/credit-card-data-analytics-machine-learning-cyber-crime-fraud-detection.png 512w, https://www.relataly.com/wp-content/uploads/2023/02/credit-card-data-analytics-machine-learning-cyber-crime-fraud-detection.png 300w" sizes="(max-width: 512px) 100vw, 512px" /><figcaption class="wp-element-caption">Each application of a credit card creates a new data point to review. Image created with <a href="http://www.midjourney.com" target="_blank" rel="noreferrer noopener">Midjourney</a>.</figcaption></figure>
</div></div>
</div>



<h3 class="wp-block-heading" id="h-prerequisites">Prerequisites</h3>



<p>Before starting the coding part, make sure that you have set up your <a href="https://www.python.org/downloads/" target="_blank" rel="noreferrer noopener">Python 3</a> environment and required packages. If you don&#8217;t have an environment, consider the&nbsp;<a href="https://www.anaconda.com/products/individual" target="_blank" rel="noreferrer noopener">Anaconda Python environment</a>. To set it up, you can follow the steps in&nbsp;<a href="https://www.relataly.com/category/data-science/setup-anaconda-environment/" target="_blank" rel="noreferrer noopener">this tutorial</a>.</p>



<p>Also, make sure you install all required packages. In this tutorial, we will be working with the following standard packages:&nbsp;</p>



<ul class="wp-block-list">
<li><em><a href="https://pandas.pydata.org/" target="_blank" rel="noreferrer noopener">pandas</a></em></li>



<li><em><a href="https://numpy.org/" target="_blank" rel="noreferrer noopener">NumPy</a></em></li>



<li><em><a href="https://matplotlib.org/" target="_blank" rel="noreferrer noopener">matplotlib</a></em></li>
</ul>



<p>In addition, we will be using the machine learning library <a href="https://scikit-learn.org/stable/" target="_blank" rel="noreferrer noopener">Scikit-learn</a> and <a data-type="URL" data-id="https://seaborn.pydata.org/" href="https://seaborn.pydata.org/" target="_blank" rel="noreferrer noopener">Seaborn</a> for visualization. </p>



<p>You can install packages using console commands:</p>



<ul class="wp-block-list">
<li><em>pip install &lt;package name&gt;</em></li>



<li><em>conda install &lt;package name&gt;</em>&nbsp;(if you are using the anaconda packet manager)</li>
</ul>



<h3 class="wp-block-heading" id="h-dataset-credit-card-transactions">Dataset: Credit Card Transactions</h3>



<p>In the following, we will work with a public dataset containing anonymized credit card transactions made by European cardholders in September 2013. You can download the dataset from <a href="https://www.kaggle.com/mlg-ulb/creditcardfraud" target="_blank" rel="noreferrer noopener">Kaggle.com.</a> </p>



<p>The dataset contains 28 features (V1-V28) obtained from the source data using Principal Component Analysis (PCA). In addition, the data includes the date and the amount of the transaction. </p>



<p>Transactions are labeled fraudulent or genuine, with 492 fraudulent cases out of 284,807 transactions. The positive class (frauds) accounts for only 0.172% of all credit card transactions, so the classes are highly unbalanced.</p>



<p></p>



<h3 class="wp-block-heading" id="h-step-1-load-the-data">Step #1: Load the Data</h3>



<p>In the following, we will go through several steps of training an Anomaly detection model for credit card fraud. We will carry out several activities, such as:</p>



<ol class="wp-block-list">
<li>Loading and preprocessing the data: this involves cleaning, transforming, and preparing the data for analysis, in order to make it suitable for use with the isolation forest algorithm.</li>



<li>Feature engineering: this involves extracting and selecting relevant features from the data, such as transaction amounts, merchant categories, and time of day, in order to create a set of inputs for the anomaly detection algorithm.</li>



<li>Model training: We will train several machine learning models on different algorithms (incl. the isolation forest) on the preprocessed and engineered data. The models will learn the normal patterns and behaviors in credit card transactions. This activity includes hyperparameter tuning.</li>



<li>Model evaluation and testing: this involves evaluating the performance of the trained model on a test dataset in order to assess its accuracy, precision, recall, and other metrics and to identify any potential issues or improvements. As part of this activity, we compare the performance of the isolation forest to other models.</li>
</ol>



<p>We begin by setting up imports and loading the data into our Python project. Then we&#8217;ll quickly verify that the dataset looks as expected. </p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}">import math 
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
from datetime import date, timedelta, datetime
import seaborn as sns
from sklearn.ensemble import IsolationForest
from sklearn.model_selection import train_test_split
from sklearn.neighbors import LocalOutlierFactor, KNeighborsClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, plot_confusion_matrix

# The Data can be downloaded from Kaggle.com: https://www.kaggle.com/mlg-ulb/creditcardfraud?select=creditcard.csv
path = 'data/credit-card-transactions/'
df = pd.read_csv(f'{path}creditcard.csv')
df</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">		Time	V1			V2			V3			V4			V5			V6			V7			V8			V9			...	V21	V22	V23	V24	V25	V26	V27	V28	Amount	Class
0		0.0		-1.359807	-0.072781	2.536347	1.378155	-0.338321	0.462388	0.239599	0.098698	0.363787	...	-0.018307	0.277838	-0.110474	0.066928	0.128539	-0.189115	0.133558	-0.021053	149.62	0
1		0.0		1.191857	0.266151	0.166480	0.448154	0.060018	-0.082361	-0.078803	0.085102	-0.255425	...	-0.225775	-0.638672	0.101288	-0.339846	0.167170	0.125895	-0.008983	0.014724	2.69	0
2		1.0		-1.358354	-1.340163	1.773209	0.379780	-0.503198	1.800499	0.791461	0.247676	-1.514654	...	0.247998	0.771679	0.909412	-0.689281	-0.327642	-0.139097	-0.055353	-0.059752	378.66	0
3		1.0		-0.966272	-0.185226	1.792993	-0.863291	-0.010309	1.247203	0.237609	0.377436	-1.387024	...	-0.108300	0.005274	-0.190321	-1.175575	0.647376	-0.221929	0.062723	0.061458	123.50	0
4		2.0		-1.158233	0.877737	1.548718	0.403034	-0.407193	0.095921	0.592941	-0.270533	0.817739	...	-0.009431	0.798278	-0.137458	0.141267	-0.206010	0.502292	0.219422	0.215153	69.99	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
284802	172786.0	-11.881118	10.071785	-9.834783	-2.066656	-5.364473	-2.606837	-4.918215	7.305334	1.914428	...	0.213454	0.111864	1.014480	-0.509348	1.436807	0.250034	0.943651	0.823731	0.77	0
284803	172787.0	-0.732789	-0.055080	2.035030	-0.738589	0.868229	1.058415	0.024330	0.294869	0.584800	...	0.214205	0.924384	0.012463	-1.016226	-0.606624	-0.395255	0.068472	-0.053527	24.79	0
284804	172788.0	1.919565	-0.301254	-3.249640	-0.557828	2.630515	3.031260	-0.296827	0.708417	0.432454	...	0.232045	0.578229	-0.037501	0.640134	0.265745	-0.087371	0.004455	-0.026561	67.88	0
284805	172788.0	-0.240440	0.530483	0.702510	0.689799	-0.377961	0.623708	-0.686180	0.679145	0.392087	...	0.265245	0.800049	-0.163298	0.123205	-0.569159	0.546668	0.108821	0.104533	10.00	0
284806	172792.0	-0.533413	-0.189733	0.703337	-0.506271	-0.012546	-0.649617	1.577006	-0.414650	0.486180	...	0.261057	0.643078	0.376777	0.008797	-0.473649	-0.818267	-0.002415	0.013649	217.00	0</pre></div>



<p>Everything should look good so that we can continue.</p>



<h3 class="wp-block-heading" id="h-step-2-data-exploration">Step #2: Data Exploration</h3>



<p>The purpose of data exploration in anomaly detection is to gain a better understanding of the data and the underlying patterns and trends that it contains. This can help to identify potential anomalies or outliers in the data and to determine the appropriate approaches and algorithms for detecting them.</p>



<p>In the following, we will create histograms that visualize the distribution of the different features. </p>



<h4 class="wp-block-heading" id="h-2-1-features">2.1 Features</h4>



<p>First, we will create a series of frequency histograms for our dataset&#8217;s features (V1 &#8211; V28). We will subsequently take a different look at the Class, Time, and Amount so that we can drop them at the moment. </p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># create histograms on all features
df_hist = df_base.drop(['Time','Amount', 'Class'], 1)
df_hist.hist(figsize=(20,20), bins = 50, color = &quot;c&quot;, edgecolor='black')
plt.show()</pre></div>



<figure class="wp-block-image size-large is-resized"><img decoding="async" data-attachment-id="9013" data-permalink="https://www.relataly.com/multivariate-outlier-detection-using-isolation-forests-in-python-detecting-credit-card-fraud/4233/output1-1/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/07/output1-1.png" data-orig-size="1170,1131" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="output1-1" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/07/output1-1.png" src="https://www.relataly.com/wp-content/uploads/2022/07/output1-1-1024x990.png" alt="Multivariate Anomaly Detection on Time-Series Data in Python: Feature frequency distributions on credit card data" class="wp-image-9013" width="1011" height="976" srcset="https://www.relataly.com/wp-content/uploads/2022/07/output1-1.png 1024w, https://www.relataly.com/wp-content/uploads/2022/07/output1-1.png 300w, https://www.relataly.com/wp-content/uploads/2022/07/output1-1.png 768w, https://www.relataly.com/wp-content/uploads/2022/07/output1-1.png 1170w" sizes="(max-width: 1011px) 100vw, 1011px" /></figure>



<p>Next, we will look at the correlation between the 28 features. We expect the features to be uncorrelated due to the use of PCA. Let&#8217;s verify that by creating a heatmap on their correlation values.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># feature correlation
f_cor = df_hist.corr()
sns.heatmap(f_cor)</pre></div>



<figure class="wp-block-image size-full is-resized"><img decoding="async" data-attachment-id="9014" data-permalink="https://www.relataly.com/multivariate-outlier-detection-using-isolation-forests-in-python-detecting-credit-card-fraud/4233/output2-1/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/07/output2-1.png" data-orig-size="782,251" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="output2-1" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/07/output2-1.png" src="https://www.relataly.com/wp-content/uploads/2022/07/output2-1.png" alt="" class="wp-image-9014" width="796" height="255" srcset="https://www.relataly.com/wp-content/uploads/2022/07/output2-1.png 782w, https://www.relataly.com/wp-content/uploads/2022/07/output2-1.png 300w, https://www.relataly.com/wp-content/uploads/2022/07/output2-1.png 768w" sizes="(max-width: 796px) 100vw, 796px" /></figure>



<p>As we expected, our features are uncorrelated. </p>



<h4 class="wp-block-heading" id="h-2-2-class-labels">2.2 Class Labels</h4>



<p>Next, let&#8217;s print an overview of the class labels to understand better how balanced the two classes are.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Plot the balance of class labels
fig1, ax1 = plt.subplots(figsize=(14, 7))
plt.pie(df[['Class']].value_counts(), explode=[0,0.1], labels=[0,1], autopct='%1.2f%%', shadow=True, startangle=45)</pre></div>



<figure class="wp-block-image size-full is-resized"><img decoding="async" data-attachment-id="9015" data-permalink="https://www.relataly.com/multivariate-outlier-detection-using-isolation-forests-in-python-detecting-credit-card-fraud/4233/output6/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/07/output6.png" data-orig-size="394,394" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="output6" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/07/output6.png" src="https://www.relataly.com/wp-content/uploads/2022/07/output6.png" alt="balance of class labels in the credit card fraud dataset" class="wp-image-9015" width="494" height="494" srcset="https://www.relataly.com/wp-content/uploads/2022/07/output6.png 394w, https://www.relataly.com/wp-content/uploads/2022/07/output6.png 300w, https://www.relataly.com/wp-content/uploads/2022/07/output6.png 150w" sizes="(max-width: 494px) 100vw, 494px" /></figure>



<p>We see that the data set is highly unbalanced. While this would constitute a problem for traditional classification techniques, it is a predestined use case for outlier detection algorithms like the Isolation Forest.</p>



<h4 class="wp-block-heading" id="h-2-3-time-and-amount">2.3 Time and Amount</h4>



<p>Finally, we will create some plots to gain insights into time and amount. Let&#8217;s first have a look at the time variable. </p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Plot istribution of the Time variable, which contains transaction data for two days
fig, ax = plt.subplots(nrows=1, ncols=2, sharex=True, figsize=(14, 7))
sns.histplot(data=df_base[df_base['Class'] == 0], x='Time', kde=True, ax=ax[0])
sns.histplot(data=df_base[df_base['Class'] == 1], x='Time', kde=True, ax=ax[1])
plt.show()</pre></div>



<figure class="wp-block-image size-full is-resized"><img decoding="async" data-attachment-id="9016" data-permalink="https://www.relataly.com/multivariate-outlier-detection-using-isolation-forests-in-python-detecting-credit-card-fraud/4233/output3/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/07/output3.png" data-orig-size="842,423" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="output3" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/07/output3.png" src="https://www.relataly.com/wp-content/uploads/2022/07/output3.png" alt="Multivariate Anomaly Detection on Time-Series Data in Python: histogram of credit card transactions" class="wp-image-9016" width="933" height="469" srcset="https://www.relataly.com/wp-content/uploads/2022/07/output3.png 842w, https://www.relataly.com/wp-content/uploads/2022/07/output3.png 300w, https://www.relataly.com/wp-content/uploads/2022/07/output3.png 768w" sizes="(max-width: 933px) 100vw, 933px" /></figure>



<p>The time frame of our dataset covers two days, which reflects the distribution graph well. We can see that most transactions happen during the day &#8211; which is only plausible.</p>



<p>Next, let&#8217;s examine the correlation between transaction size and fraud cases. To do this, we create a scatterplot that distinguishes between the two classes.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Plot time against amount
x = df_base['Amount']
y = df_base['Time']
fig, ax = plt.subplots(figsize=(20, 7))
ax.set(xlim=(0, 1500))
sns.scatterplot(data=df_base[df_base['Class']==0][::15], x=x, y=y, hue=&quot;Class&quot;, palette=[&quot;#BECEE9&quot;], alpha=.5, ax=ax)
sns.scatterplot(data=df_base[df_base['Class']==1][::15], x=x, y=y, hue=&quot;Class&quot;, palette=[&quot;#EF1B1B&quot;], zorder=100, ax=ax)
plt.legend(['no fraud', 'fraud'], loc='lower right')
fig.suptitle('Transaction Amount over Time split by Class')</pre></div>



<figure class="wp-block-image size-large is-resized"><img decoding="async" data-attachment-id="9019" data-permalink="https://www.relataly.com/multivariate-outlier-detection-using-isolation-forests-in-python-detecting-credit-card-fraud/4233/output4/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/07/output4.png" data-orig-size="1184,474" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="identifying anomalous datapoints in two dimensions, credit card fraud detection" data-image-description="&lt;p&gt;identifying anomalous datapoints in two dimensions, credit card fraud detection&lt;/p&gt;
" data-image-caption="&lt;p&gt;identifying anomalous datapoints in two dimensions, credit card fraud detection&lt;/p&gt;
" data-large-file="https://www.relataly.com/wp-content/uploads/2022/07/output4.png" src="https://www.relataly.com/wp-content/uploads/2022/07/output4-1024x410.png" alt="Transaction Amount over Time splot by Class; Credit card fraud detection with isolation forests" class="wp-image-9019" width="1053" height="422" srcset="https://www.relataly.com/wp-content/uploads/2022/07/output4.png 1024w, https://www.relataly.com/wp-content/uploads/2022/07/output4.png 300w, https://www.relataly.com/wp-content/uploads/2022/07/output4.png 768w, https://www.relataly.com/wp-content/uploads/2022/07/output4.png 1184w" sizes="(max-width: 1053px) 100vw, 1053px" /></figure>



<p>The scatterplot provides the insight that suspicious amounts tend to be relatively low. In other words, there is some inverse correlation between class and transaction amount. </p>



<h3 class="wp-block-heading" id="h-step-3-preprocessing">Step #3: Preprocessing</h3>



<p>Now that we have a rough idea of the data, we will prepare it for training the model. For the training of the isolation forest, we drop the class label from the base dataset and then divide the data into separate datasets for training (70%) and testing (30%). We do not have to normalize or standardize the data when using a decision tree-based algorithm. </p>



<p>We will use all features from the dataset. So our model will be a multivariate anomaly detection model. </p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Separate the classes from the train set
df_classes = df_base['Class']
df_train = df_base.drop(['Class'], axis=1)

# split the data into train and test 
X_train, X_test, y_train, y_test = train_test_split(df_train, df_classes, test_size=0.30, random_state=42)</pre></div>



<h3 class="wp-block-heading" id="h-step-4-model-training">Step #4: Model Training</h3>



<p>Once we have prepared the data, it&#8217;s time to start training the Isolation Forest. However, to compare the performance of our model with other algorithms, we will train several different models. In total, we will prepare and compare the following five outlier detection models:</p>



<ul class="wp-block-list">
<li>Isolation Forest (default)</li>



<li>Isolation Forest (hypertuned)</li>



<li>Local Outlier Factor (default)</li>



<li>K Neared Neighbour (default)</li>



<li>K Nearest Neighbour (hypertuned)</li>
</ul>



<p>For hyperparameter tuning of the models, we use Grid Search.</p>



<h4 class="wp-block-heading" id="h-4-1-train-an-isolation-forest">4.1 Train an Isolation Forest</h4>



<p>Next, we train our isolation forest algorithm. An isolation forest is a type of machine learning algorithm for anomaly detection. It is a variant of the random forest algorithm, which is a widely-used ensemble learning method that uses multiple decision trees to make predictions.</p>



<p>The isolation forest algorithm works by randomly selecting a feature and a split value for the feature, and then using the split value to divide the data into two subsets. This process is repeated for each decision tree in the ensemble, and the trees are combined to make a final prediction.</p>



<p>The isolation forest algorithm is designed to be efficient and effective for detecting anomalies in high-dimensional datasets. It has a number of advantages, such as its ability to handle large and complex datasets, and its high accuracy and low false positive rate. It is widely used in a variety of applications, such as fraud detection, intrusion detection, and anomaly detection in manufacturing.</p>



<h5 class="wp-block-heading" id="h-4-1-1-isolation-forest-baseline">4.1.1 Isolation Forest (baseline)</h5>



<p>First, we train a baseline model. A baseline model is a simple or reference model used as a starting point for evaluating the performance of more complex or sophisticated models in machine learning. It provides a baseline or benchmark for comparison, which allows us to assess the relative performance of different models and to identify which models are more accurate, effective, or efficient.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># train the model on the nominal train set
model_isf = IsolationForest().fit(X_train)</pre></div>



<p>We create a function to measure the performance of our baseline model and illustrate the results in a confusion matrix. Later, when we go into hyperparameter tuning, we can use this function to objectively compare the performance of more sophisticated models.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}">def measure_performance(model, X_test, y_true, map_labels):
    # predict on testset
    df_pred_test = X_test.copy()
    #df_pred_test['Class'] = y_test
    df_pred_test['Pred'] = model.predict(X_test)
    if map_labels:
        df_pred_test['Pred'] = df_pred_test['Pred'].map({1: 0, -1: 1})
    #df_pred_test['Outlier_Score'] = model.decision_function(X_test)

    # measure performance
    #y_true = df_pred_test['Class']
    x_pred = df_pred_test['Pred'] 
    matrix = confusion_matrix(x_pred, y_true)

    sns.heatmap(pd.DataFrame(matrix, columns = ['Actual', 'Predicted']),
                xticklabels=['Regular [0]', 'Fraud [1]'], 
                yticklabels=['Regular [0]', 'Fraud [1]'], 
                annot=True, fmt=&quot;d&quot;, linewidths=.5, cmap=&quot;YlGnBu&quot;)
    plt.ylabel('Predicted')
    plt.xlabel('Actual')
    
    print(classification_report(x_pred, y_true))
    
    model_score = score(x_pred, y_true,average='macro')
    print(f'f1_score: {np.round(model_score[2]*100, 2)}%')
    
    return model_score

model_name = 'Isolation Forest (baseline)'
print(f'{model_name} model')

map_labels = True
model_score = measure_performance(model_isf, X_test, y_test, map_labels)

performance_df = pd.DataFrame().append({'model_name':model_name, 
                                    'f1_score': model_score[0], 
                                    'precision': model_score[1], 
                                    'recall': model_score[2]}, ignore_index=True)</pre></div>



<figure class="wp-block-image size-large is-resized"><img decoding="async" data-attachment-id="4585" data-permalink="https://www.relataly.com/multivariate-outlier-detection-using-isolation-forests-in-python-detecting-credit-card-fraud/4233/image-31-6/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2021/06/image-31.png" data-orig-size="668,699" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-31" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2021/06/image-31.png" src="https://www.relataly.com/wp-content/uploads/2021/06/image-31.png" alt="confusion matrix for our credit card fraud detection algorithm" class="wp-image-4585" width="490" height="514" srcset="https://www.relataly.com/wp-content/uploads/2021/06/image-31.png 668w, https://www.relataly.com/wp-content/uploads/2021/06/image-31.png 287w" sizes="(max-width: 490px) 100vw, 490px" /></figure>



<h5 class="wp-block-heading" id="h-4-1-2-isolation-forest-hypertuning">4.1.2 Isolation Forest (Hypertuning)</h5>



<p>Next, we will train another Isolation Forest Model using grid search hyperparameter tuning to test different parameter configurations. </p>



<p>The hyperparameters of an isolation forest include:</p>



<ul class="wp-block-list">
<li><strong>n_estimators</strong>: The number of decision trees in the forest.</li>



<li><strong>max_samples</strong>: The number of samples to draw from the dataset to train each decision tree.</li>



<li><strong>contamination: </strong>The expected proportion of anomalies in the dataset.</li>



<li><strong>max_features: </strong>The number of features to consider when choosing the split points in the decision trees.</li>



<li><strong>bootstrap: </strong>Whether or not to use bootstrap sampling when drawing samples to train the decision trees.</li>
</ul>



<p>These hyperparameters can be adjusted to improve the performance of the isolation forest. The optimal values for these hyperparameters will depend on the specific characteristics of the dataset and the task at hand, which is why we require several experiments.</p>



<p>The code below will evaluate the different parameter configurations based on their f1_score and automatically choose the best-performing model. </p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Define the parameter grid
n_estimators=[50, 100]
max_features=[1.0, 5, 10]
bootstrap=[True]
param_grid = dict(n_estimators=n_estimators, max_features=max_features, bootstrap=bootstrap)

# Build the gridsearch
model_isf = IsolationForest(n_estimators=n_estimators, 
                            max_features=max_features, 
                            contamination=contamination_rate, 
                            bootstrap=False, 
                            n_jobs=-1)

# Define an f1_scorer
f1sc = make_scorer(f1_score, average='macro')

grid = GridSearchCV(estimator=model_isf, param_grid=param_grid, cv = 3, scoring=f1sc)
grid_results = grid.fit(X=X_train, y=y_train)

# Summarize the results in a readable format
print(&quot;Best: {0}, using {1}&quot;.format(grid_results.cv_results_['mean_test_score'], grid_results.best_params_))
results_df = pd.DataFrame(grid_results.cv_results_)

# Evaluate model performance
model_name = 'KNN (tuned)'
print(f'{model_name} model')

best_model = grid_results.best_estimator_
map_labels = True # if True - maps 1 to 0 and -1 to 1 - not required for scikit-learn knn models
model_score = measure_performance(best_model, X_test, y_test, map_labels)
performance_df = performance_df.append({'model_name':model_name, 
                                    'f1_score': model_score[0], 
                                    'precision': model_score[1], 
                                    'recall': model_score[2]}, ignore_index=True)
results_df</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">Best: [0.61083219 0.55718259 0.55912644 0.52670328 0.5317127 ], using {'n_neighbors': 1}
KNN (tuned) model
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     85385
           1       0.21      0.48      0.29        58

    accuracy                           1.00     85443
   macro avg       0.60      0.74      0.64     85443
weighted avg       1.00      1.00      1.00     85443

f1_score: 64.39%</pre></div>



<h4 class="wp-block-heading" id="h-4-2-lof-model">4.2 LOF Model</h4>



<p>We train the Local Outlier Factor Model using the same training data and evaluation procedure. The local outlier factor (LOF) is a measure of the local deviation of a data point with respect to its neighbors. It is used to identify points in a dataset that are significantly different from their surrounding points and that may therefore be considered outliers. </p>



<p>The LOF is a useful tool for detecting outliers in a dataset, as it considers the local context of each data point rather than the global distribution of the data. This makes it more robust to outliers that are only significant within a specific region of the dataset. However, isolation forests can often outperform LOF models.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Train a tuned local outlier factor model
model_lof = LocalOutlierFactor(n_neighbors=3, contamination=contamination_rate, novelty=True)
model_lof.fit(X_train)

# Evaluate model performance
model_name = 'LOF (baseline)'
print(f'{model_name} model')

map_labels = True 
model_score = measure_performance(model_lof, X_test, y_test, map_labels)
performance_df = performance_df.append({'model_name':model_name, 
                                    'f1_score': model_score[0], 
                                    'precision': model_score[1], 
                                    'recall': model_score[2]}, ignore_index=True)</pre></div>



<h4 class="wp-block-heading" id="h-4-3-knn-model">4.3 KNN Model</h4>



<p>Next, we train the KNN models. KNN is a type of machine learning algorithm for classification and regression. It is a type of instance-based learning, which means that it stores and uses the training data instances themselves to make predictions, rather than building a model that summarizes or generalizes the data.</p>



<p>Below we add two K-Nearest Neighbor models to our list. We use the default parameter hyperparameter configuration for the first model. The second model will most likely perform better because we optimize its hyperparameters using the grid search technique.</p>



<h5 class="wp-block-heading" id="h-4-3-1-knn-default">4.3.1 KNN (default)</h5>



<p>First, we train the default model using the same training data as before. By experimenting with different values of this parameter, you can try to identify the optimal number of neighbors that maximize the model&#8217;s performance on the given dataset. This approach could help to achieve better results compared to the default settings of the KNN algorithm, which may not be the most appropriate for the specific dataset we are working with.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Train a KNN Model
model_knn = KNeighborsClassifier(n_neighbors=5)
model_knn.fit(X=X_train, y=y_train)

# Evaluate model performance
model_name = 'KNN (baseline)'
print(f'{model_name} model')

map_labels = False # if True - maps 1 to 0 and -1 to 1 - set to False for classification models (e.g., KNN)
model_score = measure_performance(model_knn, X_test, y_test, map_labels)
performance_df = performance_df.append({'model_name':model_name, 
                                    'f1_score': model_score[0], 
                                    'precision': model_score[1], 
                                    'recall': model_score[2]}, ignore_index=True)</pre></div>



<figure class="wp-block-image size-large is-resized"><img decoding="async" data-attachment-id="4595" data-permalink="https://www.relataly.com/multivariate-outlier-detection-using-isolation-forests-in-python-detecting-credit-card-fraud/4233/image-35-6/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2021/06/image-35.png" data-orig-size="533,532" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-35" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2021/06/image-35.png" src="https://www.relataly.com/wp-content/uploads/2021/06/image-35.png" alt="Multivariate Anomaly Detection on Time-Series Data in Python: Confusion matrix for our credit card fraud detection algorithm" class="wp-image-4595" width="586" height="585" srcset="https://www.relataly.com/wp-content/uploads/2021/06/image-35.png 533w, https://www.relataly.com/wp-content/uploads/2021/06/image-35.png 300w, https://www.relataly.com/wp-content/uploads/2021/06/image-35.png 150w, https://www.relataly.com/wp-content/uploads/2021/06/image-35.png 120w" sizes="(max-width: 586px) 100vw, 586px" /></figure>



<h5 class="wp-block-heading" id="h-4-3-1-knn-hypertuned">4.3.1 KNN (hypertuned)</h5>



<p>In the next step, we will train a second KNN model to improve its performance by fine-tuning its hyperparameters. Despite having only a few parameters, hyperparameter tuning can enhance the model&#8217;s ability to make accurate predictions. In this case, we will concentrate on optimizing the number of nearest neighbors considered in the KNN algorithm. </p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Define hypertuning parameters
n_neighbors=[1, 2, 3, 4, 5]
param_grid = dict(n_neighbors=n_neighbors)

# Build the gridsearch
model_knn = KNeighborsClassifier(n_neighbors=n_neighbors)
grid = GridSearchCV(estimator=model_knn, param_grid=param_grid, cv = 5)
grid_results = grid.fit(X=X_train, y=y_train)

# Summarize the results in a readable format
print(&quot;Best: {0}, using {1}&quot;.format(grid_results.cv_results_['mean_test_score'], grid_results.best_params_))
results_df = pd.DataFrame(grid_results.cv_results_)

# Evaluate model performance
model_name = 'KNN (tuned)'
print(f'{model_name} model')

best_model = grid_results.best_estimator_
map_labels = False # if True - maps 1 to 0 and -1 to 1 - set to False for classification models (e.g., KNN)
model_score = measure_performance(best_model, X_test, y_test, map_labels)
performance_df = performance_df.append({'model_name':model_name, 
                                    'f1_score': model_score[0], 
                                    'precision': model_score[1], 
                                    'recall': model_score[2]}, ignore_index=True)
results_df</pre></div>



<figure class="wp-block-image size-large is-resized"><img decoding="async" data-attachment-id="4596" data-permalink="https://www.relataly.com/multivariate-outlier-detection-using-isolation-forests-in-python-detecting-credit-card-fraud/4233/image-36-5/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2021/06/image-36.png" data-orig-size="379,262" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-36" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2021/06/image-36.png" src="https://www.relataly.com/wp-content/uploads/2021/06/image-36.png" alt="confusion matrix for our credit card fraud detection algorithm" class="wp-image-4596" width="521" height="360" srcset="https://www.relataly.com/wp-content/uploads/2021/06/image-36.png 379w, https://www.relataly.com/wp-content/uploads/2021/06/image-36.png 300w" sizes="(max-width: 521px) 100vw, 521px" /></figure>



<h3 class="wp-block-heading" id="h-step-5-measuring-and-comparing-performance">Step #5: Measuring and Comparing Performance</h3>



<p>Finally, we will compare the performance of our models with a bar chart that shows the f1_score, precision, and recall. If you want to learn more about classification performance, <a href="https://www.relataly.com/measuring-classification-performance-with-python-and-scikit-learn/846/" target="_blank" rel="noreferrer noopener">this tutorial discusses the different metrics in more detail</a>.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}">print(performance_df)

performance_df = performance_df.sort_values('model_name')

fig, ax = plt.subplots(figsize=(12, 4))
tidy = performance_df.melt(id_vars='model_name').rename(columns=str.title)
sns.barplot(y='Model_Name', x='Value', hue='Variable', data=tidy, ax=ax, palette='nipy_spectral', linewidth=1, edgecolor=&quot;w&quot;)
plt.title('Model Outlier Detection Performance (Macro)')</pre></div>



<figure class="wp-block-image size-large is-resized"><img decoding="async" data-attachment-id="4679" data-permalink="https://www.relataly.com/multivariate-outlier-detection-using-isolation-forests-in-python-detecting-credit-card-fraud/4233/image-54-3/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2021/06/image-54.png" data-orig-size="1446,650" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-54" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2021/06/image-54.png" src="https://www.relataly.com/wp-content/uploads/2021/06/image-54-1024x460.png" alt="Multivariate Anomaly Detection on Time-Series Data in Python: Performance comparison between different algorithms for credit card fraud detection" class="wp-image-4679" width="982" height="441" srcset="https://www.relataly.com/wp-content/uploads/2021/06/image-54.png 1024w, https://www.relataly.com/wp-content/uploads/2021/06/image-54.png 300w, https://www.relataly.com/wp-content/uploads/2021/06/image-54.png 768w, https://www.relataly.com/wp-content/uploads/2021/06/image-54.png 1446w" sizes="(max-width: 982px) 100vw, 982px" /></figure>



<p>All three metrics play an important role in evaluating performance because, on the one hand, we want to capture as many fraud cases as possible, but we also don&#8217;t want to raise false alarms too frequently. </p>



<ul class="wp-block-list">
<li>As we can see, the optimized Isolation Forest performs particularly well-balanced.</li>



<li>The default Isolation Forest has a high f1_score and detects many fraud cases but frequently raises false alarms.</li>



<li>The opposite is true for the KNN model. Only a few fraud cases are detected here, but the model is often correct when noticing a fraud case.</li>



<li>The default LOF model performs slightly worse than the other models. Compared to the optimized Isolation Forest, it performs worse in all three metrics.</li>
</ul>



<h2 class="wp-block-heading" id="h-summary">Summary</h2>



<p>Credit card fraud detection is important because it helps to protect consumers and businesses, to maintain trust and confidence in the financial system, and to reduce financial losses. It is a critical part of ensuring the security and reliability of credit card transactions. </p>



<p>This article has shown how to use Python and the Isolation Forest Algorithm to implement a credit card fraud detection system. We developed a multivariate anomaly detection model to spot fraudulent credit card transactions. You learned how to prepare the data for testing and training an isolation forest model and how to validate this model. Finally, we have proven that the Isolation Forest is a robust algorithm for anomaly detection that outperforms traditional techniques.</p>



<p>I hope you enjoyed the article and can apply what you learned to your projects. Have a great day!</p>



<h2 class="wp-block-heading">Sources and Further Reading</h2>



<div style="display: inline-block;">
  <iframe sandbox="allow-popups allow-scripts allow-modals allow-forms allow-same-origin" style="width:120px;height:240px;" marginwidth="0" marginheight="0" scrolling="no" frameborder="0" src="//ws-eu.amazon-adsystem.com/widgets/q?ServiceVersion=20070822&amp;OneJS=1&amp;Operation=GetAdHtml&amp;MarketPlace=DE&amp;source=ss&amp;ref=as_ss_li_til&amp;ad_type=product_link&amp;tracking_id=flo7up-21&amp;language=de_DE&amp;marketplace=amazon&amp;region=DE&amp;placement=3030181162&amp;asins=3030181162&amp;linkId=669e46025028259138fbb5ccec12dfbe&amp;show_border=true&amp;link_opens_in_new_window=true"></iframe>
<iframe sandbox="allow-popups allow-scripts allow-modals allow-forms allow-same-origin" style="width:120px;height:240px;" marginwidth="0" marginheight="0" scrolling="no" frameborder="0" src="//ws-eu.amazon-adsystem.com/widgets/q?ServiceVersion=20070822&amp;OneJS=1&amp;Operation=GetAdHtml&amp;MarketPlace=DE&amp;source=ss&amp;ref=as_ss_li_til&amp;ad_type=product_link&amp;tracking_id=flo7up-21&amp;language=de_DE&amp;marketplace=amazon&amp;region=DE&amp;placement=1999579577&amp;asins=1999579577&amp;linkId=91d862698bf9010ff4c09539e4c49bf4&amp;show_border=true&amp;link_opens_in_new_window=true"></iframe>
<iframe sandbox="allow-popups allow-scripts allow-modals allow-forms allow-same-origin" style="width:120px;height:240px;" marginwidth="0" marginheight="0" scrolling="no" frameborder="0" src="//ws-eu.amazon-adsystem.com/widgets/q?ServiceVersion=20070822&amp;OneJS=1&amp;Operation=GetAdHtml&amp;MarketPlace=DE&amp;source=ss&amp;ref=as_ss_li_til&amp;ad_type=product_link&amp;tracking_id=flo7up-21&amp;language=de_DE&amp;marketplace=amazon&amp;region=DE&amp;placement=1839217715&amp;asins=1839217715&amp;linkId=356ba074068849ff54393f527190825d&amp;show_border=true&amp;link_opens_in_new_window=true"></iframe>
<iframe sandbox="allow-popups allow-scripts allow-modals allow-forms allow-same-origin" style="width:120px;height:240px;" marginwidth="0" marginheight="0" scrolling="no" frameborder="0" src="//ws-eu.amazon-adsystem.com/widgets/q?ServiceVersion=20070822&amp;OneJS=1&amp;Operation=GetAdHtml&amp;MarketPlace=DE&amp;source=ss&amp;ref=as_ss_li_til&amp;ad_type=product_link&amp;tracking_id=flo7up-21&amp;language=de_DE&amp;marketplace=amazon&amp;region=DE&amp;placement=1492032646&amp;asins=1492032646&amp;linkId=2214804dd039e7103577abd08722abac&amp;show_border=true&amp;link_opens_in_new_window=true"></iframe>
</div>



<p class="has-contrast-2-color has-base-3-background-color has-text-color has-background"><em>The links above to Amazon are affiliate links. By buying through these links, you support the Relataly.com blog and help to cover the hosting costs. Using the links does not affect the price.</em></p>
<p>The post <a href="https://www.relataly.com/multivariate-outlier-detection-using-isolation-forests-in-python-detecting-credit-card-fraud/4233/">Multivariate Anomaly Detection on Time-Series Data in Python: Using Isolation Forests to Detect Credit Card Fraud</a> appeared first on <a href="https://www.relataly.com">relataly.com</a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://www.relataly.com/multivariate-outlier-detection-using-isolation-forests-in-python-detecting-credit-card-fraud/4233/feed/</wfw:commentRss>
			<slash:comments>2</slash:comments>
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">4233</post-id>	</item>
		<item>
		<title>Build a High-Performing Movie Recommender System using Collaborative Filtering in Python</title>
		<link>https://www.relataly.com/building-a-movie-recommender-using-collaborative-filtering/4376/</link>
					<comments>https://www.relataly.com/building-a-movie-recommender-using-collaborative-filtering/4376/#comments</comments>
		
		<dc:creator><![CDATA[Florian Follonier]]></dc:creator>
		<pubDate>Mon, 31 May 2021 10:29:35 +0000</pubDate>
				<category><![CDATA[Algorithms]]></category>
		<category><![CDATA[Collaborative Filtering]]></category>
		<category><![CDATA[Recommender Systems]]></category>
		<category><![CDATA[Retail]]></category>
		<category><![CDATA[Surprise Lib]]></category>
		<category><![CDATA[Use Cases]]></category>
		<category><![CDATA[AI in E-Commerce]]></category>
		<category><![CDATA[Collaborative Filtering Recommender System]]></category>
		<category><![CDATA[Intermediate Tutorials]]></category>
		<category><![CDATA[SVD]]></category>
		<category><![CDATA[Unsupervised Learning]]></category>
		<guid isPermaLink="false">https://www.relataly.com/?p=4376</guid>

					<description><![CDATA[<p>The digital age presents us with an unmanageable number of decisions and even more options. Which series to watch today? What song to listen to next? Nowadays, the internet and its vast content offer too many choices. But there is hope &#8211; recommender systems are here to solve this problem and support our decision-making. They ... <a title="Build a High-Performing Movie Recommender System using Collaborative Filtering in Python" class="read-more" href="https://www.relataly.com/building-a-movie-recommender-using-collaborative-filtering/4376/" aria-label="Read more about Build a High-Performing Movie Recommender System using Collaborative Filtering in Python">Read more</a></p>
<p>The post <a href="https://www.relataly.com/building-a-movie-recommender-using-collaborative-filtering/4376/">Build a High-Performing Movie Recommender System using Collaborative Filtering in Python</a> appeared first on <a href="https://www.relataly.com">relataly.com</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow">
<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p>The digital age presents us with an unmanageable number of decisions and even more options. Which series to watch today? What song to listen to next? Nowadays, the internet and its vast content offer too many choices. But there is hope &#8211; recommender systems are here to solve this problem and support our decision-making. They rank among the fascinating use cases for machine learning and intelligently filter information to present us with a smaller set of options that most likely fit our tastes and needs. This tutorial will explore recommender systems and implement a movie recommender in Python that uses &#8220;Collaborative Filtering.&#8221;</p>



<p>This article is structured as follows: We begin by briefly going through the basics of different types of recommender systems. Then we will look at the most common recommender algorithms and go into more detail on Collaborative Filtering. Once equipped with this conceptual understanding, we will develop our recommender system using the popular 100k Movies Dataset. We will train and test a recommender model to predict movie ratings. The library used is the Scikit-Surprise Python library. The recommendation approach combines Collaborative Filtering and Singular Value Decomposition (SVD).</p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%"></div>
</div>
</div>
</div>



<h2 class="wp-block-heading" id="h-an-overview-of-recommender-techniques">An Overview of Recommender Techniques</h2>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p>The first attempts with recommendation systems reach back to the 1970s. The approach was relatively simple and categorized users into groups to suggest the same content to all users in the same group. However, it was a breakthrough because, for the first time, a program could make personalized recommendations. As the importance of recommendation engines increased, so did the interest in improving their predictions.</p>



<p>With the rise of the internet and the rapidly growing amount of data available on the web, filtering relevant content has become increasingly important. Large tech companies such as Netflix (TV shows and movies), Amazon (products), or Facebook (user content and profiles) understood early on that they could use recommendation systems to personalize the selection of user content shown to their customers. They all face the same challenge of having massive content and only limited space to display it to their users. Therefore, it has become crucial for them to select and display only the content that matches the individual interests of their users.</p>



<p>Internet companies are predestined for using recommender systems, as they have massive amounts of user data available. The user data is the foundation for generating personalized recommendations on a large scale by analyzing behavior patterns among larger groups of users to tailor suggestions to the taste of individuals.</p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%"><div class="wp-block-image">
<figure class="alignright size-full"><img decoding="async" width="512" height="512" data-attachment-id="11938" data-permalink="https://www.relataly.com/reinforcement-learning-agents-learn-from-their-activities-if-our-reward-function-tells-them-how-they-perform/" data-orig-file="https://www.relataly.com/wp-content/uploads/2023/01/Reinforcement-Learning-Agents-learn-from-their-activities-if-our-reward-function-tells-them-how-they-perform.png" data-orig-size="512,512" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="Reinforcement Learning &amp;#8211; Agents learn from their activities if our reward function tells them how they perform" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2023/01/Reinforcement-Learning-Agents-learn-from-their-activities-if-our-reward-function-tells-them-how-they-perform.png" src="https://www.relataly.com/wp-content/uploads/2023/01/Reinforcement-Learning-Agents-learn-from-their-activities-if-our-reward-function-tells-them-how-they-perform.png" alt="" class="wp-image-11938" srcset="https://www.relataly.com/wp-content/uploads/2023/01/Reinforcement-Learning-Agents-learn-from-their-activities-if-our-reward-function-tells-them-how-they-perform.png 512w, https://www.relataly.com/wp-content/uploads/2023/01/Reinforcement-Learning-Agents-learn-from-their-activities-if-our-reward-function-tells-them-how-they-perform.png 300w, https://www.relataly.com/wp-content/uploads/2023/01/Reinforcement-Learning-Agents-learn-from-their-activities-if-our-reward-function-tells-them-how-they-perform.png 140w" sizes="(max-width: 512px) 100vw, 512px" /><figcaption class="wp-element-caption">The more choices we have, the more difficult it gets to decide </figcaption></figure>
</div></div>
</div>



<p></p>



<h3 class="wp-block-heading" id="h-three-common-approaches-to-recommender-systems">Three Common Approaches to Recommender Systems</h3>



<p>Nowadays, many different approaches can generate recommendations. However, most recommender systems use one of the following three techniques:</p>



<figure class="wp-block-image size-large is-resized"><img decoding="async" data-attachment-id="4348" data-permalink="https://www.relataly.com/content-based-movie-recommender-using-python/4294/image-29-6/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2021/05/image-29.png" data-orig-size="1834,751" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-29" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2021/05/image-29.png" src="https://www.relataly.com/wp-content/uploads/2021/05/image-29-1024x419.png" alt="machine learning for recommender systems -  comparison of three different techniques used to build recommender systems - content based filtering, collaborative filtering, hybrid approach" class="wp-image-4348" width="1056" height="430" srcset="https://www.relataly.com/wp-content/uploads/2021/05/image-29.png 300w, https://www.relataly.com/wp-content/uploads/2021/05/image-29.png 768w" sizes="(max-width: 1056px) 100vw, 1056px" /><figcaption class="wp-element-caption">Popular techniques used to build recommender systems</figcaption></figure>



<h4 class="wp-block-heading" id="h-content-based-filtering">Content-based Filtering&nbsp;</h4>



<p id="h-content-based-filtering-is-a-technique-that-recommends-similar-items-based-on-item-content-naturally-this-approach-is-based-on-metadata-to-find-out-which-items-are-similar-for-example-in-the-case-of-movie-recommendations-the-algorithm-would-look-at-the-genre-cast-or-director-of-a-movie-models-may-also-consider-metadata-on-users-such-as-age-gender-and-so-on-to-suggest-similar-content-to-similar-users-the-similarity-can-be-calculated-using-different-methods-for-example-cosine-similarity-or-minkowski-distance">Content-based filtering&nbsp;is a technique that recommends similar items based on item content. Naturally, this approach is based on metadata to determine which items are similar. For example, in the case of movie recommendations, the algorithm would look at the genre, cast, or director of a movie. Models may also consider metadata on users, such as age, gender, etc., to suggest similar content to similar users. There are different methods to calculate the similarity, for example, Cosine Similarity or Minkowski Distance.</p>



<p>A significant challenge in content-based Filtering is the transferability of user preference insights from one item type to another. Often, content-based recommenders struggle to transfer user actions on one item (e.g., book) to other content types (e.g., clothing). In addition, content-based systems tend to develop tunnel vision, leading the engine to recommend more and more of the same.</p>



<p>A separate article covers content-based filtering in more detail and shows <a href="https://www.relataly.com/content-based-movie-recommender-using-python/4294/" target="_blank" rel="noreferrer noopener">how to implement this approach for movie recommendations. </a></p>



<h4 class="wp-block-heading" id="h-collaborative-filtering">Collaborative Filtering</h4>



<p>Collaborative Filtering&nbsp;is a well-established approach used to build recommendation systems. The recommendations generated through Collaborative Filtering are based on past interactions between a user and a set of items (movies, products, etc.) that are matched against past item-user interactions within a larger group of people. The main idea is to use the interactions between a group of users and a group of items to guess how users rate items they have not yet rated before.</p>



<p>A challenge of collaborative filters is known as the cold start problem, which refers to the entry of new users into the system without any ratings. As a result, the engine does not know its interests and cannot make meaningful recommendations. The same applies to new items entering the system (e.g., products) that have not yet received any ratings. As a result, recommendations can become self-reinforcing. Popular content that many users have rated is recommended to almost all other users, making this content even more popular. On the other hand, the engine hardly suggests content with few or no ratings, so no one will rate this content.</p>



<h4 class="wp-block-heading" id="h-hybrid-approach">Hybrid Approach</h4>



<p>Combining the previous two techniques in a hybrid approach is also possible. We can implement hybrid approaches by generating content-based and collaborative-based predictions separately and then combining them. The result is a model that considers the interactions between users and items and context information. Hybrid recommender systems often achieve better results than recommendation approaches that use a single of the underlying techniques.</p>



<p>Netflix is known to use a hybrid recommendation system. Its engine recommends content to its users based on similar users&#8217; viewing and search habits (Collaborative Filtering). At the same time, it also recommends series and movies whose characteristics match the content users rated highly (content-based Filtering).</p>



<h3 class="wp-block-heading" id="h-how-model-based-collaborative-filtering-works">How Model-based Collaborative Filtering Works</h3>



<p>We can further differentiate between memory-based and model-based Collaborative Filtering. This tutorial focuses on model-based Collaborative Filtering, which is more commonly used. </p>



<h3 class="wp-block-heading" id="h-behavioral-patterns-dependencies-among-users-and-items">Behavioral Patterns: Dependencies among Users and Items</h3>



<p></p>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow">
<p>Collaborative filtering searches for behavioral patterns in interactions between a group of users and a group of items to infer the interests of individuals. The input data for collaborative Filtering is typically in the form of a user/item matrix filled with ratings (as shown below).</p>



<p>Patterns can exist in the user/item matrix in dependencies between users and items. Some dependencies are easy to grasp. Similar dependencies exist between items in that some movies receive high ratings from the same users. For example, assume two users, Ron and Joe, have rated movies. Ron enjoyed Batman, Indiana Jones, Star Wars, and Godzilla. Joe enjoyed the same movies as Ron, except Godzilla, which he has not yet rated. Based on the similarity between Joe and Ron, we assume that Joe would also enjoy Godzilla. </p>



<p>Things get more complex as latent dependencies are present in the data. Imagine Bob gave a three-star rating to five different movies. Another user, Jenny, rated the same movies as Bob but always gave four stars, which is an example of latent dependency. There is some form of dependence between the two users, and although it is not as significant as in the first example, considering latent dependencies will improve predictions.</p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow">
<figure class="wp-block-image size-large is-resized"><img decoding="async" data-attachment-id="4441" data-permalink="https://www.relataly.com/building-a-movie-recommender-using-collaborative-filtering/4376/image-3-14/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2021/06/image-3.png" data-orig-size="402,303" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-3" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2021/06/image-3.png" src="https://www.relataly.com/wp-content/uploads/2021/06/image-3.png" alt="Dependencies among users and items for movie recommendations" class="wp-image-4441" width="327" height="247" srcset="https://www.relataly.com/wp-content/uploads/2021/06/image-3.png 402w, https://www.relataly.com/wp-content/uploads/2021/06/image-3.png 300w" sizes="(max-width: 327px) 100vw, 327px" /><figcaption class="wp-element-caption">user/movies matrix</figcaption></figure>
</div>
</div>



<p></p>



<h3 class="wp-block-heading" id="h-machine-learning-and-dimensionality-reduction">Machine Learning and Dimensionality Reduction</h3>



<p>Model-based collaborative filtering techniques estimate the parameters of statistical models to predict how individual users would rate an unrated item. A widely used approach formulates this problem as a classification task that considers items over users as features and ratings as prediction labels (as shown in the matrix). We can use various algorithms to solve such an optimization problem, including gradient-based techniques or alternating least squares.</p>



<p>However, user/item matrices can become very large, making searching for patterns computationally expensive. Also, users will typically rate only a tiny fraction of the items in the matrix, so algorithms must deal with an abundant number of missing values (sparse matrix). Therefore, combining machine learning or deep learning with techniques for dimensionality reduction has become state-of-the-art.</p>



<p>One of the most widely used techniques for dimensionality reduction is matrix factorization. This approach compresses the initial sparse user/item matrix and presents it as separate matrices that present items and users as unknown feature vectors (as shown below). Such a matrix is densely populated and thus easier to handle, but it also enables the model to uncover latent dependencies among items and users, which increases model accuracy. </p>



<figure class="wp-block-image size-large"><img decoding="async" width="1024" height="276" data-attachment-id="4462" data-permalink="https://www.relataly.com/building-a-movie-recommender-using-collaborative-filtering/4376/image-7-14/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2021/06/image-7.png" data-orig-size="1299,350" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-7" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2021/06/image-7.png" src="https://www.relataly.com/wp-content/uploads/2021/06/image-7-1024x276.png" alt="Matrix Factorization applied to the sparse Items / User Matrix" class="wp-image-4462" srcset="https://www.relataly.com/wp-content/uploads/2021/06/image-7.png 1024w, https://www.relataly.com/wp-content/uploads/2021/06/image-7.png 300w, https://www.relataly.com/wp-content/uploads/2021/06/image-7.png 768w, https://www.relataly.com/wp-content/uploads/2021/06/image-7.png 1299w" sizes="(max-width: 1024px) 100vw, 1024px" /><figcaption class="wp-element-caption">Matrix Factorization is applied to the sparse Items / User Matrix.</figcaption></figure>



<h3 class="wp-block-heading" id="h-python-libraries-for-collaborative-filtering">Python Libraries for Collaborative Filtering</h3>



<p>So far, only a few Python libraries support model-based collaborative Filtering out of the box. The most well-known libraries for recommender systems are probably <a href="https://github.com/NicolasHug/Surprise" target="_blank" rel="noreferrer noopener">Scikit-Suprise</a> and <a href="https://www.fast.ai/" target="_blank" rel="noreferrer noopener">Fast.ai for Pytorch</a>. </p>



<p>Below you find an overview of the different algorithms that these libraries support. </p>



<figure class="wp-block-image size-large is-resized"><img decoding="async" data-attachment-id="4303" data-permalink="https://www.relataly.com/content-based-movie-recommender-using-python/4294/image-18-8/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2021/05/image-18.png" data-orig-size="989,218" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-18" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2021/05/image-18.png" src="https://www.relataly.com/wp-content/uploads/2021/05/image-18.png" alt="different algorithms used to train recommender systems based on collaborative filtering " class="wp-image-4303" width="592" height="130" srcset="https://www.relataly.com/wp-content/uploads/2021/05/image-18.png 989w, https://www.relataly.com/wp-content/uploads/2021/05/image-18.png 300w, https://www.relataly.com/wp-content/uploads/2021/05/image-18.png 768w" sizes="(max-width: 592px) 100vw, 592px" /></figure>



<h2 class="wp-block-heading" id="h-implementing-a-movie-recommender-in-python-using-collaborative-filtering">Implementing a Movie Recommender in Python using Collaborative Filtering</h2>



<p>Now it&#8217;s time to get our hands dirty and begin with implementing our movie recommender. As always, you find the code in the relataly git-hub repository. </p>



<div class="wp-block-kadence-advancedbtn kb-buttons-wrap kb-btns_6a7ed1-b5"><a class="kb-button kt-button button kb-btn_7902e3-f8 kt-btn-size-standard kt-btn-width-type-full kb-btn-global-inherit  kt-btn-has-text-true kt-btn-has-svg-true  wp-block-button__link wp-block-kadence-singlebtn" href="https://github.com/flo7up/relataly-public-python-tutorials/blob/master/05%20Recommender%20Systems/031Movie%20Recommender%20using%20Collaborative%20Filtering.ipynb" target="_blank" rel="noreferrer noopener"><span class="kb-svg-icon-wrap kb-svg-icon-fe_eye kt-btn-icon-side-left"><svg viewBox="0 0 24 24"  fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"  aria-hidden="true"><path d="M1 12s4-8 11-8 11 8 11 8-4 8-11 8-11-8-11-8z"/><circle cx="12" cy="12" r="3"/></svg></span><span class="kt-btn-inner-text">View on GitHub </span></a>

<a class="kb-button kt-button button kb-btn_26bd47-d7 kt-btn-size-standard kt-btn-width-type-full kb-btn-global-inherit  kt-btn-has-text-true kt-btn-has-svg-true  wp-block-button__link wp-block-kadence-singlebtn" href="https://github.com/flo7up/relataly-public-python-API-tutorials" target="_blank" rel="noreferrer noopener"><span class="kb-svg-icon-wrap kb-svg-icon-fa_github kt-btn-icon-side-left"><svg viewBox="0 0 496 512"  fill="currentColor" xmlns="http://www.w3.org/2000/svg"  aria-hidden="true"><path d="M165.9 397.4c0 2-2.3 3.6-5.2 3.6-3.3.3-5.6-1.3-5.6-3.6 0-2 2.3-3.6 5.2-3.6 3-.3 5.6 1.3 5.6 3.6zm-31.1-4.5c-.7 2 1.3 4.3 4.3 4.9 2.6 1 5.6 0 6.2-2s-1.3-4.3-4.3-5.2c-2.6-.7-5.5.3-6.2 2.3zm44.2-1.7c-2.9.7-4.9 2.6-4.6 4.9.3 2 2.9 3.3 5.9 2.6 2.9-.7 4.9-2.6 4.6-4.6-.3-1.9-3-3.2-5.9-2.9zM244.8 8C106.1 8 0 113.3 0 252c0 110.9 69.8 205.8 169.5 239.2 12.8 2.3 17.3-5.6 17.3-12.1 0-6.2-.3-40.4-.3-61.4 0 0-70 15-84.7-29.8 0 0-11.4-29.1-27.8-36.6 0 0-22.9-15.7 1.6-15.4 0 0 24.9 2 38.6 25.8 21.9 38.6 58.6 27.5 72.9 20.9 2.3-16 8.8-27.1 16-33.7-55.9-6.2-112.3-14.3-112.3-110.5 0-27.5 7.6-41.3 23.6-58.9-2.6-6.5-11.1-33.3 2.6-67.9 20.9-6.5 69 27 69 27 20-5.6 41.5-8.5 62.8-8.5s42.8 2.9 62.8 8.5c0 0 48.1-33.6 69-27 13.7 34.7 5.2 61.4 2.6 67.9 16 17.7 25.8 31.5 25.8 58.9 0 96.5-58.9 104.2-114.8 110.5 9.2 7.9 17 22.9 17 46.4 0 33.7-.3 75.4-.3 83.6 0 6.5 4.6 14.4 17.3 12.1C428.2 457.8 496 362.9 496 252 496 113.3 383.5 8 244.8 8zM97.2 352.9c-1.3 1-1 3.3.7 5.2 1.6 1.6 3.9 2.3 5.2 1 1.3-1 1-3.3-.7-5.2-1.6-1.6-3.9-2.3-5.2-1zm-10.8-8.1c-.7 1.3.3 2.9 2.3 3.9 1.6 1 3.6.7 4.3-.7.7-1.3-.3-2.9-2.3-3.9-2-.6-3.6-.3-4.3.7zm32.4 35.6c-1.6 1.3-1 4.3 1.3 6.2 2.3 2.3 5.2 2.6 6.5 1 1.3-1.3.7-4.3-1.3-6.2-2.2-2.3-5.2-2.6-6.5-1zm-11.4-14.7c-1.6 1-1.6 3.6 0 5.9 1.6 2.3 4.3 3.3 5.6 2.3 1.6-1.3 1.6-3.9 0-6.2-1.4-2.3-4-3.3-5.6-2z"/></svg></span><span class="kt-btn-inner-text">Relataly GitHub Repo </span></a></div>



<h3 class="wp-block-heading" id="h-prerequisites">Prerequisites</h3>



<p>Before beginning the coding part, make sure that you have set up your Python 3 environment and required packages. If you don&#8217;t have an environment, consider the&nbsp;<a href="https://www.anaconda.com/products/individual" target="_blank" rel="noreferrer noopener">Anaconda Python environment</a>. Follow <a href="https://www.relataly.com/category/data-science/setup-anaconda-environment/" target="_blank" rel="noreferrer noopener">this tutorial</a> to set it up.</p>



<p>Also, make sure you install all required packages. In this tutorial, we will be working with the following standard packages:&nbsp;</p>



<ul class="wp-block-list">
<li><em><a href="https://pandas.pydata.org/" target="_blank" rel="noreferrer noopener">pandas</a></em></li>



<li><em><a href="https://numpy.org/" target="_blank" rel="noreferrer noopener">NumPy</a></em></li>



<li><em><a href="https://matplotlib.org/" target="_blank" rel="noreferrer noopener">matplotlib</a></em></li>
</ul>



<p>In addition, we will be using <a href="https://seaborn.pydata.org/" target="_blank" rel="noreferrer noopener">Seaborn </a>for visualization and the recommender systems library <a href="https://www.fast.ai/" target="_blank" rel="noreferrer noopener"></a><em><a href="https://github.com/NicolasHug/Surprise" target="_blank" rel="noreferrer noopener">Scikit-Suprise</a></em>. You can install the surprise package by forging it with the following command: </p>



<ul class="wp-block-list">
<li><em>conda install -c conda-forge scikit-surprise</em></li>
</ul>



<p>You can install the other packages using standard console commands:</p>



<ul class="wp-block-list">
<li><em>pip install &lt;package name&gt;</em></li>



<li><em>conda install &lt;package name&gt;</em>&nbsp;(if you are using the anaconda packet manager)</li>
</ul>



<h3 class="wp-block-heading" id="h-about-the-imdb-movies-dataset">About the IMDB Movies Dataset</h3>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p>We will train our movie recommendation model on a popular Movies Dataset (you can download it from <a href="https://grouplens.org/datasets/movielens/latest/" target="_blank" rel="noreferrer noopener">grouplens.org</a>). The MovieLens recommendation service collected the Dataset from 610 users between 1996 and 2018. Unpack the data into the working folder of your project.</p>



<p>The full Dataset contains metadata on over 45,000 movies and 26 million ratings from over 270,000 users. However, we will be working with a subset of the data &#8220;ratings_small.csv,&#8221; which contains 100,836 ratings from 700 users on 9742 movies.</p>



<p>The Dataset contains the following files, from which we will only use the first two (Source of the data description: <a href="https://www.kaggle.com/rounakbanik/the-movies-dataset" target="_blank" rel="noreferrer noopener">Kaggle.com</a>):</p>



<ul class="wp-block-list">
<li><strong>movies_metadata.csv:</strong>&nbsp;The main Movies Metadata file contains information on 45,000 movies featured in the Full MovieLens Dataset. Features include posters, backdrops, budget, revenue, release dates, languages, production countries, and companies.</li>



<li><strong>ratings_small.csv:</strong>&nbsp;The subset of 100,000 ratings from 700 users on 9,000 movies. Each line corresponds to a single 5-star movie rating with half-star increments (0.5 &#8211; 5.0 stars).</li>
</ul>



<p>There are several other files included that we won&#8217;t use. </p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%">
<figure class="wp-block-image size-full is-resized"><img decoding="async" data-attachment-id="7128" data-permalink="https://www.relataly.com/building-a-movie-recommender-using-collaborative-filtering/4376/mdb-movie-database/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/04/MDB-Movie-Database.png" data-orig-size="1024,537" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="recomender systems collaborative filtering imdb movies" data-image-description="&lt;p&gt;recomender systems collaborative filtering imdb movies&lt;/p&gt;
" data-image-caption="&lt;p&gt;recomender systems collaborative filtering imdb movies&lt;/p&gt;
" data-large-file="https://www.relataly.com/wp-content/uploads/2022/04/MDB-Movie-Database.png" src="https://www.relataly.com/wp-content/uploads/2022/04/MDB-Movie-Database.png" alt="MDB Movie Database Recommender Systems Collaborative Filtering " class="wp-image-7128" width="366" height="192" srcset="https://www.relataly.com/wp-content/uploads/2022/04/MDB-Movie-Database.png 1024w, https://www.relataly.com/wp-content/uploads/2022/04/MDB-Movie-Database.png 300w, https://www.relataly.com/wp-content/uploads/2022/04/MDB-Movie-Database.png 768w" sizes="(max-width: 366px) 100vw, 366px" /><figcaption class="wp-element-caption">IMDB Movie Database </figcaption></figure>
</div>
</div>



<h3 class="wp-block-heading" id="h-step-1-load-the-data">Step #1: Load the Data</h3>



<p>Ensure you have downloaded and unpacked the data and the required packages.</p>



<p>You can then load the movie data into our Python project using the code snippet below. We do not need all of the files in the movie dataset and only work with the following two.</p>



<ul class="wp-block-list">
<li>movies_metadata.csv</li>



<li>ratings_small.csv</li>
</ul>



<h4 class="wp-block-heading" id="h-1-1-load-the-movies-data">1.1 Load the Movies Data</h4>



<p>First, we will load the movies_metadata, which contains a list of all movies and meta information such as the release year, a short description, etc. </p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}">import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
from surprise import SVD, Dataset, Reader
from surprise.model_selection import train_test_split, cross_validate
from ast import literal_eval

# in case you have placed the files outside of your working directory, you need to specify a path
path = '' # for example: 'data/movie_recommendations/'  

# load the movie metadata
df_moviesmetadata=pd.read_csv(path + 'movies_metadata.csv', low_memory=False) 
print(df_moviesmetadata.shape)
print(df_moviesmetadata.columns)
df_moviesmetadata.head(1)</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">(45466, 24)
Index(['adult', 'belongs_to_collection', 'budget', 'genres', 'homepage', 'id',
       'imdb_id', 'original_language', 'original_title', 'overview',
       'popularity', 'poster_path', 'production_companies',
       'production_countries', 'release_date', 'revenue', 'runtime',
       'spoken_languages', 'status', 'tagline', 'title', 'video',
       'vote_average', 'vote_count'],
      dtype='object')
	adult	belongs_to_collection								budget		genres												homepage								id	imdb_id		original_language	original_title	overview	...	release_date	revenue	runtime	spoken_languages	status	tagline	title	video	vote_average	vote_count
0	False	{'id': 10194, 'name': 'Toy Story Collection', ...	30000000	[{'id': 16, 'name': 'Animation'}, {'id': 35, '...	http://toystory.disney.com/toy-story	862	tt0114709	en					Toy Story		Led by Woody, Andy's toys live happily in his ...	...	1995-10-30	373554033.0	81.0	[{'iso_639_1': 'en', 'name': 'English'}]	Released	NaN	Toy Story	False	7.7	5415.0
1 rows × 24 columns</pre></div>



<h4 class="wp-block-heading" id="h-1-2-load-the-ratings-data">1.2 Load the Ratings Data</h4>



<p>We proceed by loading the rating file. This file contains the movie ratings for each user, the movieId, and a timestamp.</p>



<p>In addition, we print the value counts for rankings in our Dataset. </p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># load the movie ratings
df_ratings=pd.read_csv(path + 'ratings_small.csv', low_memory=False) 

print(df_ratings.shape)
print(df_ratings.columns)
df_ratings.head(3)

rankings_count = df_ratings.rating.value_counts().sort_values()
sns.barplot(x=rankings_count.index.sort_values(), y=rankings_count, color=&quot;b&quot;)
sns.set_theme(style=&quot;whitegrid&quot;)</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">(100004, 4)
Index(['userId', 'movieId', 'rating', 'timestamp'], dtype='object')
	userId	movieId	rating	timestamp
0	1		31		2.5		1260759144
1	1		1029	3.0		1260759179
2	1		1061	3.0		1260759182</pre></div>



<p>As we can see, most of the ratings in our Dataset are positive.</p>



<h3 class="wp-block-heading" id="h-step-2-preprocessing-and-cleaning-the-data">Step #2 Preprocessing and Cleaning the Data</h3>



<p>We continue with the preprocessing of the data. The recommendations of a User-based Collaborative Filtering Approach rely solely on the interactions between users and items. This means training a prediction model does not require the meta-information of the movies. Nevertheless, we will load the metadata because it is just nicer to display the recommendations, movie title, release year, and so on, instead of just ids.</p>



<h4 class="wp-block-heading" id="h-2-1-clean-the-movie-data">2.1 Clean the Movie Data</h4>



<p>Unfortunately, the data quality of the movies&#8217; metadata is not excellent, so we need to fix a few things. The following operations will change some data types to integers, extract the release year and genres, and remove some records with incorrect data.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># remove invalid records with invalid ids
df_mmeta = df_moviesmetadata.drop([19730, 29503, 35587])

df_movies = pd.DataFrame()

# extract the release year 
df_movies['year'] = pd.to_datetime(df_mmeta['release_date'], errors='coerce').apply(lambda x: str(x).split('-')[0] if x != np.nan else np.nan)

# extract genres
df_movies['genres'] = df_mmeta['genres'].fillna('[]').apply(literal_eval).apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])

# change the index to movie_id
df_movies['movieId'] = pd.to_numeric(df_mmeta['id'])
df_movies = df_movies.set_index('movieId')

# add vote count
df_movies['vote_count'] = df_movies['vote_count'].astype('int')
df_movies</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">title									vote_count	vote_average	year	genres
		movieId					
862		Toy Story						5415			7.7	1995	[Animation, Comedy, Family]
8844	Jumanji							2413			6.9	1995	[Adventure, Fantasy, Family]
15602	Grumpier Old Men				92				6.5	1995	[Romance, Comedy]
31357	Waiting to Exhale				34				6.1	1995	[Comedy, Drama, Romance]
11862	Father of the Bride Part II		173				5.7	1995	[Comedy]
...	...	...	...	...	...
49279	The Man with the Rubber Head	29				7.6	1901	[Comedy, Fantasy, Science Fiction]
49271	The Devilish Tenant				12				6.7	1909	[Fantasy, Comedy]
49280	The One-Man Band				22				6.5	1900	[Fantasy, Action, Thriller]
404604	Mom								14				6.6	2017	[Crime, Drama, Thriller]
30840	Robin Hood						26				5.7	1991	[Drama, Action, Romance]
22931 rows × 5 columns</pre></div>



<h4 class="wp-block-heading" id="h-2-2-clean-the-ratings-data">2.2 Clean the Ratings Data</h4>



<p>Compared to the movie metadata, not much more needs to be done to the rating data. Here we just put the timestamp into a readable format.</p>



<p>One of the following steps is to use the Reader class from the Surprise library to parse the ratings and put them into a format compatible with standard recommendation algorithms from the Surprise library. The Reader needs the data in the form where each line contains only one rating and respects the following structure:</p>



<p>user; item ; rating ; timestamp</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># drop na values
df_ratings_temp = df_ratings.dropna()

# convert datetime
df_ratings_temp['timestamp'] = pd. to_datetime(df_ratings_temp['timestamp'], unit='s')

print(f'unique users: {len(df_ratings_temp.userId.unique())}, ratings: {len(df_ratings_temp)}')
df_ratings_temp.head()</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">unique users: 671, ratings: 100004
	userId		movieId	rating	timestamp
0	1			31		2.5		2009-12-14 02:52:24
1	1			1029	3.0		2009-12-14 02:52:59
2	1			1061	3.0		2009-12-14 02:53:02
3	1			1129	2.0		2009-12-14 02:53:05
4	1			1172	4.0		2009-12-14 02:53:25</pre></div>



<h3 class="wp-block-heading" id="h-step-3-split-the-data-in-train-and-test">Step #3: Split the Data in Train and Test</h3>



<p>Next, we will split the data into train and test sets. In this way, we ensure that we can later evaluate the performance of our recommender model on data that the model has not yet seen. </p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># The Reader class is used to parse a file containing ratings.
# The file is assumed to specify only one rating per line, such as in the df_ratings_temp file above.
reader = Reader()
ratings_by_users = Dataset.load_from_df(df_ratings_temp[['userId', 'movieId', 'rating']], reader)

# Split the Data into train and test
train_df, test_df = train_test_split(ratings_by_users, test_size=.2)</pre></div>



<p>Once we have split the data into train and test, we can train the recommender model.</p>



<h3 class="wp-block-heading" id="h-step-4-train-a-movie-recommender-using-collaborative-filtering">Step #4: Train a Movie Recommender using Collaborative Filtering</h3>



<p>Training the SVD model requires only lines of code. The first line creates an untrained model that uses Probabilistic Matrix Factorization for dimensionality reduction. The second line will fit this model to the training data. </p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># train an SVD model
svd_model = SVD()
svd_model_trained = svd_model.fit(train_df)</pre></div>



<h3 class="wp-block-heading" id="h-step-5-evaluate-prediction-performance-using-cross-validation">Step #5: Evaluate Prediction Performance using Cross-Validation</h3>



<p>Next, it is time to validate the performance of our movie recommendation program. For this, we use k-fold cross-validation. As a reminder, cross-validation involves splitting the Dataset into different folds and then measuring the prediction performance based on each fold.</p>



<p>We can measure model performance using indicators such as mean absolute error (MAE) or mean squared error (MSE). The MAE is the average difference between predicting a movie and the actual ratings. We chose this measure because it is easy to understand.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># 10-fold cross validation 
cross_val_results = cross_validate(svd_model_trained, ratings_by_users, measures=['RMSE', 'MAE', 'MSE'], cv=10, verbose=False)
test_mae = cross_val_results['test_mae']

# mean squared errors per fold
df_test_mae = pd.DataFrame(test_mae, columns=['Mean Absolute Error'])
df_test_mae.index = np.arange(1, len(df_test_mae) + 1)
df_test_mae.sort_values(by='Mean Absolute Error', ascending=False).head(15)

# plot an overview of the performance per fold
plt.figure(figsize=(6,4))
sns.set_theme(style=&quot;whitegrid&quot;)
sns.barplot(y='Mean Absolute Error', x=df_test_mae.index, data=df_test_mae, color=&quot;b&quot;)
# plt.title('Mean Absolute Error')</pre></div>



<figure class="wp-block-image size-large is-resized"><img decoding="async" data-attachment-id="4385" data-permalink="https://www.relataly.com/building-a-movie-recommender-using-collaborative-filtering/4376/image-41-4/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2021/05/image-41.png" data-orig-size="598,366" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-41" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2021/05/image-41.png" src="https://www.relataly.com/wp-content/uploads/2021/05/image-41.png" alt="movie recommender performance" class="wp-image-4385" width="651" height="399" srcset="https://www.relataly.com/wp-content/uploads/2021/05/image-41.png 598w, https://www.relataly.com/wp-content/uploads/2021/05/image-41.png 300w, https://www.relataly.com/wp-content/uploads/2021/05/image-41.png 80w" sizes="(max-width: 651px) 100vw, 651px" /></figure>



<p>The chart above shows that the mean deviation of our predictions from the actual rating is a little below 0.7. The result is not terrific but ok for a first model. In addition, there are no significant differences between the performance in the different folds. Let&#8217;s keep in mind that the MAE says little about possible outliers in the predictions. However, since we are dealing with ordinal predictions (1-5), the influence of outliers is naturally limited.</p>



<h3 class="wp-block-heading" id="h-step-6-generate-predictions">Step #6: Generate Predictions </h3>



<p>Finally, we will use our movie recommender to generate a list of suggested movies for a specific test user. The predictions will be based on the user&#8217;s previous movie ratings.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># predict ratings for a single user_id and for all movies
user_id = 400 # some test user from the ratings file

# create the predictions
pred_series= []
df_ratings_filtered = df_ratings[df_ratings['userId'] == user_id]

print(f'number of ratings: {df_ratings_filtered.shape[0]}')
for movie_id, name in zip(df_movies.index, df_movies['title']):
    # check if the user has already rated a specific movie from the list
    rating_real = df_ratings.query(f'movieId == {movie_id}')['rating'].values[0] if movie_id in df_ratings_filtered['movieId'].values else 0
    # generate the prediction
    rating_pred = svd_model_trained.predict(user_id, movie_id, rating_real, verbose=False)
    # add the prediction to the list of predictions
    pred_series.append([movie_id, name, rating_pred.est, rating_real])

# print the results
df_recommendations = pd.DataFrame(pred_series, columns=['movieId', 'title', 'predicted_rating', 'actual_rating'])
df_recommendations.sort_values(by='predicted_rating', ascending=False).head(15)</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">		movieId	title								predicted_rating	actual_rating
4234	4993	5 Card Stud							4.721481			0.0
3194	318		The Million Dollar Hotel			4.648623			5.0
442		858		Sleepless in Seattle				4.506962			0.0
236		527		Once Were Warriors					4.484758			4.0
2426	926		Galaxy Quest						4.465653			0.0
6532	905		Pandora's Box						4.452688			0.0
710		260		The 39 Steps						4.390318			0.0
8787	3683	Flags of Our Fathers				4.386821			0.0
5400	899		Broken Blossoms						4.384220			0.0
5068	296		Terminator 3: Rise of the Machines	4.383057			4.0
372		2019	Hard Target							4.365834			0.0
7254	919		Blood: The Last Vampire				4.356012			0.0
3295	4973	Under the Sand						4.355750			0.0
3869	194		Amélie								4.353614			0.0
8631	1948	Crank								4.344286			0.0</pre></div>



<p>Alternatively, we can predict how well a specific user will rate a movie by handing the user_id and the movie_id to the model. </p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># predict ratings for the combination of user_id and movie_id
user_id = 217 # some test user from the ratings file
movie_id = 4002
rating_real = df_ratings.query(f'movieId == {movie_id} &amp; userId == {user_id}')['rating'].values[0]
movie_title = df_movies[df_movies.index == 862]['title'].values[0]

print(f'Movie title: {movie_title}')
print(f'Actual rating: {rating_real}')

# predict and show the result
rating_pred = svd_model_trained.predict(user_id, movie_id, rating_real, verbose=True)</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">Movie title: Toy Story
Actual rating: 4.5
user: 217        item: 4002       r_ui = 4.50   est = 3.98   {'was_impossible': False}</pre></div>



<h2 class="wp-block-heading" id="h-summary">Summary</h2>



<p>Congratulations on learning how to develop a movie recommendation system in Python! In this article, you learned about the SVD model, which uses matrix factorization and collaborative filtering to predict movie ratings for a given user. We also demonstrated how to perform cross-validation on a movie dataset and use the model to generate movie recommendations.</p>



<p>Several other approaches can be used to develop a movie recommendation system, <a href="https://wp.me/pbUnJO-17g" target="_blank" rel="noreferrer noopener">including content-based filtering</a>, which uses features of the movies themselves (such as genre, director, and cast) to make recommendations, and hybrid systems, which combine the strengths of multiple approaches.</p>



<p>Regardless of the approach used, building a movie recommendation system can be a useful tool for recommending movies to users based on their past preferences and can help increase engagement and satisfaction with a movie streaming service or website.</p>



<p>If you like the post, please let me know in the comments, and don&#8217;t forget to subscribe to our <a href="https://twitter.com/home" target="_blank" rel="noreferrer noopener">Twitter account</a> to stay up to date on upcoming articles.</p>



<h2 class="wp-block-heading">Sources and Further Reading</h2>



<p>Below are some resources for further reading on recommender systems and content-based models.</p>



<p><strong>Books</strong></p>



<ol class="wp-block-list"><li><a href="https://amzn.to/3T3Sl2V" target="_blank" rel="noreferrer noopener">Charu C. Aggarwal (2016) Recommender Systems: The Textbook</a></li><li><a href="https://amzn.to/3D0P8eX" target="_blank" rel="noreferrer noopener">Kin Falk (2019) Practical Recommender Systems</a></li><li><a href="https://amzn.to/3MAy8j5" target="_blank" rel="noreferrer noopener">Andriy Burkov (2020) Machine Learning Engineering</a></li><li><a href="https://amzn.to/3D0gB0e" target="_blank" rel="noreferrer noopener">Oliver Theobald (2020) Machine Learning For Absolute Beginners: A Plain English Introduction</a></li><li><a href="https://amzn.to/3S9Nfkl" target="_blank" rel="noreferrer noopener">Aurélien Géron (2019) Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems </a></li><li><a href="https://amzn.to/3EKidwE" target="_blank" rel="noreferrer noopener">David Forsyth (2019) Applied Machine Learning Springer</a></li></ol>



<p class="has-contrast-2-color has-base-3-background-color has-text-color has-background"><em>The links above to Amazon are affiliate links. By buying through these links, you support the Relataly.com blog and help to cover the hosting costs. Using the links does not affect the price.</em></p>



<p><strong>Articles</strong></p>



<ol class="wp-block-list"><li><a href="https://surprise.readthedocs.io/en/stable/getting_started.html" target="_blank" rel="noreferrer noopener">Getting started with Suprise</a></li><li><a href="https://onespire.hu/sap-news-en/history-of-recommender-systems/" target="_blank" rel="noreferrer noopener">About the history of Recommender Systems</a></li><li><a href="https://www.freecodecamp.org/news/singular-value-decomposition-vs-matrix-factorization-in-recommender-systems-b1e99bc73599/" target="_blank" rel="noreferrer noopener"></a><a href="https://www.freecodecamp.org/news/singular-value-decomposition-vs-matrix-factorization-in-recommender-systems-b1e99bc73599/" target="_blank" rel="noreferrer noopener">Singular value decomposition vs. matrix factorization</a></li><li><a href="https://proceedings.neurips.cc/paper/2007/file/d7322ed717dedf1eb4e6e52a37ea7bcd-Paper.pdf" target="_blank" rel="noreferrer noopener">Probabilistic Matrix Factorization</a></li></ol>
<p>The post <a href="https://www.relataly.com/building-a-movie-recommender-using-collaborative-filtering/4376/">Build a High-Performing Movie Recommender System using Collaborative Filtering in Python</a> appeared first on <a href="https://www.relataly.com">relataly.com</a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://www.relataly.com/building-a-movie-recommender-using-collaborative-filtering/4376/feed/</wfw:commentRss>
			<slash:comments>1</slash:comments>
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">4376</post-id>	</item>
	</channel>
</rss>
