<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Fraud Detection Archives - relataly.com</title>
	<atom:link href="https://www.relataly.com/category/use-case/anomaly-detection/fraud-detection/feed/" rel="self" type="application/rss+xml" />
	<link>https://www.relataly.com/category/use-case/anomaly-detection/fraud-detection/</link>
	<description>The Business AI Blog</description>
	<lastBuildDate>Sat, 27 May 2023 10:18:05 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	<generator>https://wordpress.org/?v=7.0</generator>

<image>
	<url>https://www.relataly.com/wp-content/uploads/2023/04/cropped-AI-cat-Icon-White.png</url>
	<title>Fraud Detection Archives - relataly.com</title>
	<link>https://www.relataly.com/category/use-case/anomaly-detection/fraud-detection/</link>
	<width>32</width>
	<height>32</height>
</image> 
<site xmlns="com-wordpress:feed-additions:1">175977316</site>	<item>
		<title>Multivariate Anomaly Detection on Time-Series Data in Python: Using Isolation Forests to Detect Credit Card Fraud</title>
		<link>https://www.relataly.com/multivariate-outlier-detection-using-isolation-forests-in-python-detecting-credit-card-fraud/4233/</link>
					<comments>https://www.relataly.com/multivariate-outlier-detection-using-isolation-forests-in-python-detecting-credit-card-fraud/4233/#comments</comments>
		
		<dc:creator><![CDATA[Florian Follonier]]></dc:creator>
		<pubDate>Wed, 16 Jun 2021 09:33:00 +0000</pubDate>
				<category><![CDATA[Anomaly Detection]]></category>
		<category><![CDATA[Classification (two-class)]]></category>
		<category><![CDATA[Finance]]></category>
		<category><![CDATA[Fraud Detection]]></category>
		<category><![CDATA[K-Nearest Neighbors (KNN)]]></category>
		<category><![CDATA[Local Outlier Factor]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[Random Isolation Forest]]></category>
		<category><![CDATA[Scikit-Learn]]></category>
		<category><![CDATA[Seaborn]]></category>
		<category><![CDATA[AI in Business]]></category>
		<category><![CDATA[AI in Finance]]></category>
		<category><![CDATA[Intermediate Tutorials]]></category>
		<category><![CDATA[Multivariate Models]]></category>
		<category><![CDATA[Unsupervised Learning]]></category>
		<guid isPermaLink="false">https://www.relataly.com/?p=4233</guid>

					<description><![CDATA[<p>Credit card fraud has become one of the most common use cases for anomaly detection systems. The number of fraud attempts has risen sharply, resulting in billions of dollars in losses. Early detection of fraud attempts with machine learning is therefore becoming increasingly important. In this article, we take on the fight against international credit ... <a title="Multivariate Anomaly Detection on Time-Series Data in Python: Using Isolation Forests to Detect Credit Card Fraud" class="read-more" href="https://www.relataly.com/multivariate-outlier-detection-using-isolation-forests-in-python-detecting-credit-card-fraud/4233/" aria-label="Read more about Multivariate Anomaly Detection on Time-Series Data in Python: Using Isolation Forests to Detect Credit Card Fraud">Read more</a></p>
<p>The post <a href="https://www.relataly.com/multivariate-outlier-detection-using-isolation-forests-in-python-detecting-credit-card-fraud/4233/">Multivariate Anomaly Detection on Time-Series Data in Python: Using Isolation Forests to Detect Credit Card Fraud</a> appeared first on <a href="https://www.relataly.com">relataly.com</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow">
<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">Credit card fraud has become one of the most common use cases for anomaly detection systems. The number of fraud attempts has risen sharply, resulting in billions of dollars in losses. Early detection of fraud attempts with <a href="https://www.relataly.com/category/machine-learning/" target="_blank" rel="noreferrer noopener">machine learning</a> is therefore becoming increasingly important. In this article, we take on the fight against international credit card fraud and develop a multivariate anomaly detection model in Python that spots fraudulent payment transactions. The model will use the Isolation Forest algorithm, one of the most effective techniques for detecting outliers. Isolation Forests are so-called ensemble models. They have various hyperparameters with which we can optimize model performance. However, we will not do this manually but instead, use grid search for hyperparameter tuning. To assess the performance of our model, we will also compare it with other models.</p>



<p class="wp-block-paragraph">The remainder of this article is structured as follows: We start with a brief introduction to anomaly detection and look at the Isolation Forest algorithm. Equipped with these theoretical foundations, we then turn to the practical part, in which we train and validate an isolation forest that detects credit card fraud. We use an unsupervised learning approach, where the model learns to distinguish regular from suspicious card transactions. We will train our model on a public dataset from Kaggle that contains credit card transactions. Finally, we will compare the performance of our model against two nearest neighbor algorithms (LOF and <a href="https://www.relataly.com/category/machine-learning-algorithms/k-nearest-neighbors-knn/" target="_blank" rel="noreferrer noopener">KNN</a>).</p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%">
<figure class="wp-block-image size-large"><img fetchpriority="high" decoding="async" width="512" height="512" data-attachment-id="12702" data-permalink="https://www.relataly.com/financial-crime-serious-business-python-relataly-midjourney-min/" data-orig-file="https://www.relataly.com/wp-content/uploads/2023/03/financial-crime-serious-business-python-relataly-midjourney-min.png" data-orig-size="1024,1024" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="financial crime serious business python relataly midjourney-min" data-image-description="&lt;p&gt;Financial crime is a real problem, although its usually not commited by cats. Image created with Midjourney&lt;/p&gt;
" data-image-caption="&lt;p&gt;Financial crime is a real problem, although its usually not commited by cats. Image created with Midjourney&lt;/p&gt;
" data-large-file="https://www.relataly.com/wp-content/uploads/2023/03/financial-crime-serious-business-python-relataly-midjourney-min.png" src="https://www.relataly.com/wp-content/uploads/2023/03/financial-crime-serious-business-python-relataly-midjourney-min-512x512.png" alt="Financial crime is a real problem, although its usually not commited by cats. Image created with Midjourney" class="wp-image-12702" srcset="https://www.relataly.com/wp-content/uploads/2023/03/financial-crime-serious-business-python-relataly-midjourney-min.png 512w, https://www.relataly.com/wp-content/uploads/2023/03/financial-crime-serious-business-python-relataly-midjourney-min.png 300w, https://www.relataly.com/wp-content/uploads/2023/03/financial-crime-serious-business-python-relataly-midjourney-min.png 140w, https://www.relataly.com/wp-content/uploads/2023/03/financial-crime-serious-business-python-relataly-midjourney-min.png 768w, https://www.relataly.com/wp-content/uploads/2023/03/financial-crime-serious-business-python-relataly-midjourney-min.png 1024w" sizes="(max-width: 512px) 100vw, 512px" /><figcaption class="wp-element-caption">Financial crime is a real problem, although cats are relatively seldom involved. Image created with <a href="http://www.midjourney.com" target="_blank" rel="noreferrer noopener">Midjourney</a></figcaption></figure>
</div>
</div>
</div>
</div>



<h2 class="wp-block-heading" id="h-multivariate-anomaly-detection">Multivariate Anomaly Detection</h2>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">Before we take a closer look at the use case and our unsupervised approach, let&#8217;s briefly discuss anomaly detection. Anomaly detection deals with finding points that deviate from legitimate data regarding their mean or median in a distribution. In machine learning, the term is often used synonymously with outlier detection. </p>



<p class="wp-block-paragraph">Some anomaly detection models work with a single feature (univariate data), for example, in monitoring electronic signals. However, most anomaly detection models use multivariate data, which means they have two (bivariate) or more (multivariate) features. They find a wide range of applications, including the following:</p>



<ul class="wp-block-list">
<li>Predictive Maintenance and Detection of Malfunctions and Decay</li>



<li>Detection of Retail Bank Credit Card Fraud</li>



<li>Detection of Pricing Errors</li>



<li>Cyber Security, for example, Network Intrusion Detection</li>



<li>Detecting Fraudulent Market Behavior in Investment Banking</li>
</ul>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%">
<figure class="wp-block-image size-large"><img decoding="async" width="1024" height="410" data-attachment-id="9019" data-permalink="https://www.relataly.com/multivariate-outlier-detection-using-isolation-forests-in-python-detecting-credit-card-fraud/4233/output4/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/07/output4.png" data-orig-size="1184,474" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="identifying anomalous datapoints in two dimensions, credit card fraud detection" data-image-description="&lt;p&gt;identifying anomalous datapoints in two dimensions, credit card fraud detection&lt;/p&gt;
" data-image-caption="&lt;p&gt;identifying anomalous datapoints in two dimensions, credit card fraud detection&lt;/p&gt;
" data-large-file="https://www.relataly.com/wp-content/uploads/2022/07/output4.png" src="https://www.relataly.com/wp-content/uploads/2022/07/output4-1024x410.png" alt="identifying anomalous data points in two dimensions, credit card fraud detection" class="wp-image-9019" srcset="https://www.relataly.com/wp-content/uploads/2022/07/output4.png 1024w, https://www.relataly.com/wp-content/uploads/2022/07/output4.png 300w, https://www.relataly.com/wp-content/uploads/2022/07/output4.png 768w, https://www.relataly.com/wp-content/uploads/2022/07/output4.png 1184w" sizes="(max-width: 1024px) 100vw, 1024px" /><figcaption class="wp-element-caption">Identifying anomalous data points in two dimensions in credit card fraud detection</figcaption></figure>
</div>
</div>



<h3 class="wp-block-heading" id="h-unsupervised-algorithms-for-anomaly-detection">Unsupervised Algorithms for Anomaly Detection</h3>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">Outlier detection is a classification problem. However, the field is more diverse as outlier detection is a problem we can approach with supervised and unsupervised machine learning techniques. It would go beyond the scope of this article to explain the multitude of outlier detection techniques. Still, the following chart provides a good overview of standard algorithms that learn unsupervised.</p>



<figure class="wp-block-image size-large is-resized"><img decoding="async" data-attachment-id="7649" data-permalink="https://www.relataly.com/multivariate-outlier-detection-using-isolation-forests-in-python-detecting-credit-card-fraud/4233/image-21-3/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/04/image-21.png" data-orig-size="1983,994" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-21" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/04/image-21.png" src="https://www.relataly.com/wp-content/uploads/2022/04/image-21-1024x513.png" alt="Unsupervised Algorithms for Anomaly Detection. Outlier detection using random isolation forests" class="wp-image-7649" width="759" height="380" srcset="https://www.relataly.com/wp-content/uploads/2022/04/image-21.png 1024w, https://www.relataly.com/wp-content/uploads/2022/04/image-21.png 300w, https://www.relataly.com/wp-content/uploads/2022/04/image-21.png 768w, https://www.relataly.com/wp-content/uploads/2022/04/image-21.png 1536w, https://www.relataly.com/wp-content/uploads/2022/04/image-21.png 1983w" sizes="(max-width: 759px) 100vw, 759px" /><figcaption class="wp-element-caption">Unsupervised Algorithms for Anomaly Detection. </figcaption></figure>



<p class="wp-block-paragraph">A prerequisite for supervised learning is that we have information about which data points are outliers and belong to regular data. In credit card fraud detection, this information is available because banks can validate with their customers whether a suspicious transaction is a fraud or not. In many other outlier detection cases, it remains unclear which outliers are legitimate and which are just noise or other uninteresting events in the data. </p>



<p class="wp-block-paragraph">Whether we know which classes in our dataset are outliers and which are not affects the selection of possible algorithms we could use to solve the outlier detection problem. Unsupervised learning techniques are a natural choice if the class labels are unavailable. And if the class labels are available, we could use both unsupervised and supervised learning algorithms.</p>



<p class="wp-block-paragraph">In the following, we will focus on Isolation Forests.</p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%"></div>
</div>



<h3 class="wp-block-heading" id="h-the-isolation-forest-iforest-algorithm">The Isolation Forest (&#8220;iForest&#8221;) Algorithm</h3>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">Isolation forests (sometimes called iForests) are among the most powerful techniques for identifying anomalies in a dataset. They belong to the group of so-called ensemble models. The predictions of ensemble models do not rely on a single model. Instead, they combine the results of multiple independent models (decision trees). Nevertheless, isolation forests should not be confused with traditional <a href="https://www.relataly.com/category/machine-learning-algorithms/decision-forests/" target="_blank" rel="noreferrer noopener">random decision forests</a>. While random forests predict given class labels (supervised learning), isolation forests learn to distinguish outliers from inliers (regular data) in an unsupervised learning process.</p>



<p class="wp-block-paragraph">An Isolation Forest contains multiple independent isolation trees. The algorithm invokes a process that recursively divides the training data at random points to isolate data points from each other to build an Isolation Tree. The number of partitions required to isolate a point tells us whether it is an anomalous or regular point. The underlying assumption is that random splits can isolate an anomalous data point much sooner than nominal ones.</p>



<p class="wp-block-paragraph">Also: <a href="https://www.relataly.com/stock-market-prediction-using-multivariate-time-series-in-python/1815/" target="_blank" rel="noreferrer noopener">Stock Market Prediction using Multivariate Time Series Data</a> </p>



<figure class="wp-block-image size-large is-resized"><img decoding="async" data-attachment-id="4616" data-permalink="https://www.relataly.com/multivariate-outlier-detection-using-isolation-forests-in-python-detecting-credit-card-fraud/4233/image-40-5/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2021/06/image-40.png" data-orig-size="1106,461" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-40" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2021/06/image-40.png" src="https://www.relataly.com/wp-content/uploads/2021/06/image-40-1024x427.png" alt="Isolation Tree and Isolation Forest (Tree Ensemble), Outlier detection using random isolation forests" class="wp-image-4616" width="754" height="314" srcset="https://www.relataly.com/wp-content/uploads/2021/06/image-40.png 1024w, https://www.relataly.com/wp-content/uploads/2021/06/image-40.png 300w, https://www.relataly.com/wp-content/uploads/2021/06/image-40.png 768w, https://www.relataly.com/wp-content/uploads/2021/06/image-40.png 1106w" sizes="(max-width: 754px) 100vw, 754px" /><figcaption class="wp-element-caption">Isolation Tree and Isolation Forest (Tree Ensemble)</figcaption></figure>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%"></div>
</div>



<h3 class="wp-block-heading" id="h-how-the-isolation-forest-algorithm-works">How the Isolation Forest Algorithm Works</h3>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">The illustration below shows exemplary training of an Isolation Tree on univariate data, i.e., with only one feature. The algorithm has already split the data at five random points between the minimum and maximum values of a random sample. The isolated points are colored in purple. The example below has taken two partitions to isolate the point on the far left. The other purple points were separated after 4 and 5 splits.</p>



<p class="wp-block-paragraph">The partitioning process ends when the algorithm has isolated all points from each other or when all remaining points have equal values. The algorithm has calculated and assigned an outlier score to each point at the end of the process, based on how many splits it took to isolate it.</p>



<p class="wp-block-paragraph">When using an isolation forest model on unseen data to detect outliers, the algorithm will assign an anomaly score to the new data points. These scores will be calculated based on the ensemble trees we built during model training.</p>



<p class="wp-block-paragraph">So how does this process work when our dataset involves multiple features? For multivariate anomaly detection, partitioning the data remains almost the same. The significant difference is that the algorithm selects a random feature in which the partitioning will occur before each partitioning. Consequently, multivariate isolation forests split the data along multiple dimensions (features).</p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%">
<figure class="wp-block-image size-large is-resized"><img decoding="async" data-attachment-id="4615" data-permalink="https://www.relataly.com/multivariate-outlier-detection-using-isolation-forests-in-python-detecting-credit-card-fraud/4233/image-39-5/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2021/06/image-39.png" data-orig-size="685,430" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-39" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2021/06/image-39.png" src="https://www.relataly.com/wp-content/uploads/2021/06/image-39.png" alt="Exemplary partitioning process of an isolation tree (5 Steps); Outlier detection using random isolation forests;  Outlier detection using isolation forests" class="wp-image-4615" width="375" height="235" srcset="https://www.relataly.com/wp-content/uploads/2021/06/image-39.png 685w, https://www.relataly.com/wp-content/uploads/2021/06/image-39.png 300w, https://www.relataly.com/wp-content/uploads/2021/06/image-39.png 80w" sizes="(max-width: 375px) 100vw, 375px" /><figcaption class="wp-element-caption">Exemplary partitioning process of an isolation tree (5 Steps)</figcaption></figure>
</div>
</div>



<h2 class="wp-block-heading" id="h-credit-card-fraud-detection-using-isolation-forests">Credit Card Fraud Detection using Isolation Forests</h2>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">Monitoring transactions has become a crucial task for financial institutions. In 2019 alone, more than 271,000 cases of credit card theft were reported in the U.S., causing billions of dollars in losses and making credit card fraud one of the most common types of identity theft. The vast majority of fraud cases are attributable to organized crime, which often specializes in this particular crime.</p>



<p class="wp-block-paragraph">Anything that deviates from the customer&#8217;s normal payment behavior can make a transaction suspicious, including an unusual location, time, or country in which the customer conducted the transaction. Credit card providers use similar anomaly detection systems to monitor their customers&#8217; transactions and look for potential fraud attempts. They can halt the transaction and inform their customer as soon as they detect a fraud attempt. We train an Isolation Forest algorithm for credit card fraud detection using Python in the following.</p>



<p class="wp-block-paragraph">Now that we have established the context for our machine learning problem, we can begin implementing an anomaly detection model in Python.</p>



<p class="wp-block-paragraph">The code is available on the GitHub repository.</p>



<div class="wp-block-kadence-advancedbtn kb-buttons-wrap kb-btns_60179c-c7"><a class="kb-button kt-button button kb-btn_ef16c1-1b kt-btn-size-standard kt-btn-width-type-full kb-btn-global-inherit kt-btn-has-text-true kt-btn-has-svg-true wp-block-button__link wp-block-kadence-singlebtn" href="https://github.com/flo7up/relataly-public-python-tutorials/blob/master/04%20Anomaly%20Detection/026%20Credit%20Card%20Fraud%20Detection%20with%20Isolation%20Forests.ipynb" target="_blank" rel="noreferrer noopener"><span class="kb-svg-icon-wrap kb-svg-icon-fe_eye kt-btn-icon-side-left"><svg viewBox="0 0 24 24"  fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"  aria-hidden="true"><path d="M1 12s4-8 11-8 11 8 11 8-4 8-11 8-11-8-11-8z"/><circle cx="12" cy="12" r="3"/></svg></span><span class="kt-btn-inner-text">View on GitHub </span></a>

<a class="kb-button kt-button button kb-btn_adf11b-23 kt-btn-size-standard kt-btn-width-type-full kb-btn-global-inherit kt-btn-has-text-true kt-btn-has-svg-true wp-block-button__link wp-block-kadence-singlebtn" href="https://github.com/flo7up/relataly-public-python-API-tutorials" target="_blank" rel="noreferrer noopener"><span class="kb-svg-icon-wrap kb-svg-icon-fa_github kt-btn-icon-side-left"><svg viewBox="0 0 496 512"  fill="currentColor" xmlns="http://www.w3.org/2000/svg"  aria-hidden="true"><path d="M165.9 397.4c0 2-2.3 3.6-5.2 3.6-3.3.3-5.6-1.3-5.6-3.6 0-2 2.3-3.6 5.2-3.6 3-.3 5.6 1.3 5.6 3.6zm-31.1-4.5c-.7 2 1.3 4.3 4.3 4.9 2.6 1 5.6 0 6.2-2s-1.3-4.3-4.3-5.2c-2.6-.7-5.5.3-6.2 2.3zm44.2-1.7c-2.9.7-4.9 2.6-4.6 4.9.3 2 2.9 3.3 5.9 2.6 2.9-.7 4.9-2.6 4.6-4.6-.3-1.9-3-3.2-5.9-2.9zM244.8 8C106.1 8 0 113.3 0 252c0 110.9 69.8 205.8 169.5 239.2 12.8 2.3 17.3-5.6 17.3-12.1 0-6.2-.3-40.4-.3-61.4 0 0-70 15-84.7-29.8 0 0-11.4-29.1-27.8-36.6 0 0-22.9-15.7 1.6-15.4 0 0 24.9 2 38.6 25.8 21.9 38.6 58.6 27.5 72.9 20.9 2.3-16 8.8-27.1 16-33.7-55.9-6.2-112.3-14.3-112.3-110.5 0-27.5 7.6-41.3 23.6-58.9-2.6-6.5-11.1-33.3 2.6-67.9 20.9-6.5 69 27 69 27 20-5.6 41.5-8.5 62.8-8.5s42.8 2.9 62.8 8.5c0 0 48.1-33.6 69-27 13.7 34.7 5.2 61.4 2.6 67.9 16 17.7 25.8 31.5 25.8 58.9 0 96.5-58.9 104.2-114.8 110.5 9.2 7.9 17 22.9 17 46.4 0 33.7-.3 75.4-.3 83.6 0 6.5 4.6 14.4 17.3 12.1C428.2 457.8 496 362.9 496 252 496 113.3 383.5 8 244.8 8zM97.2 352.9c-1.3 1-1 3.3.7 5.2 1.6 1.6 3.9 2.3 5.2 1 1.3-1 1-3.3-.7-5.2-1.6-1.6-3.9-2.3-5.2-1zm-10.8-8.1c-.7 1.3.3 2.9 2.3 3.9 1.6 1 3.6.7 4.3-.7.7-1.3-.3-2.9-2.3-3.9-2-.6-3.6-.3-4.3.7zm32.4 35.6c-1.6 1.3-1 4.3 1.3 6.2 2.3 2.3 5.2 2.6 6.5 1 1.3-1.3.7-4.3-1.3-6.2-2.2-2.3-5.2-2.6-6.5-1zm-11.4-14.7c-1.6 1-1.6 3.6 0 5.9 1.6 2.3 4.3 3.3 5.6 2.3 1.6-1.3 1.6-3.9 0-6.2-1.4-2.3-4-3.3-5.6-2z"/></svg></span><span class="kt-btn-inner-text">Relataly GitHub Repo </span></a></div>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%"><div class="wp-block-image">
<figure class="aligncenter size-full"><img decoding="async" width="512" height="496" data-attachment-id="12451" data-permalink="https://www.relataly.com/multivariate-outlier-detection-using-isolation-forests-in-python-detecting-credit-card-fraud/4233/credit-card-data-analytics-machine-learning-cyber-crime-fraud-detection/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2023/02/credit-card-data-analytics-machine-learning-cyber-crime-fraud-detection.png" data-orig-size="512,496" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="credit-card-data-analytics-machine-learning-cyber-crime-fraud-detection" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2023/02/credit-card-data-analytics-machine-learning-cyber-crime-fraud-detection.png" src="https://www.relataly.com/wp-content/uploads/2023/02/credit-card-data-analytics-machine-learning-cyber-crime-fraud-detection.png" alt="" class="wp-image-12451" srcset="https://www.relataly.com/wp-content/uploads/2023/02/credit-card-data-analytics-machine-learning-cyber-crime-fraud-detection.png 512w, https://www.relataly.com/wp-content/uploads/2023/02/credit-card-data-analytics-machine-learning-cyber-crime-fraud-detection.png 300w" sizes="(max-width: 512px) 100vw, 512px" /><figcaption class="wp-element-caption">Each application of a credit card creates a new data point to review. Image created with <a href="http://www.midjourney.com" target="_blank" rel="noreferrer noopener">Midjourney</a>.</figcaption></figure>
</div></div>
</div>



<h3 class="wp-block-heading" id="h-prerequisites">Prerequisites</h3>



<p class="wp-block-paragraph">Before starting the coding part, make sure that you have set up your <a href="https://www.python.org/downloads/" target="_blank" rel="noreferrer noopener">Python 3</a> environment and required packages. If you don&#8217;t have an environment, consider the&nbsp;<a href="https://www.anaconda.com/products/individual" target="_blank" rel="noreferrer noopener">Anaconda Python environment</a>. To set it up, you can follow the steps in&nbsp;<a href="https://www.relataly.com/category/data-science/setup-anaconda-environment/" target="_blank" rel="noreferrer noopener">this tutorial</a>.</p>



<p class="wp-block-paragraph">Also, make sure you install all required packages. In this tutorial, we will be working with the following standard packages:&nbsp;</p>



<ul class="wp-block-list">
<li><em><a href="https://pandas.pydata.org/" target="_blank" rel="noreferrer noopener">pandas</a></em></li>



<li><em><a href="https://numpy.org/" target="_blank" rel="noreferrer noopener">NumPy</a></em></li>



<li><em><a href="https://matplotlib.org/" target="_blank" rel="noreferrer noopener">matplotlib</a></em></li>
</ul>



<p class="wp-block-paragraph">In addition, we will be using the machine learning library <a href="https://scikit-learn.org/stable/" target="_blank" rel="noreferrer noopener">Scikit-learn</a> and <a data-type="URL" data-id="https://seaborn.pydata.org/" href="https://seaborn.pydata.org/" target="_blank" rel="noreferrer noopener">Seaborn</a> for visualization. </p>



<p class="wp-block-paragraph">You can install packages using console commands:</p>



<ul class="wp-block-list">
<li><em>pip install &lt;package name&gt;</em></li>



<li><em>conda install &lt;package name&gt;</em>&nbsp;(if you are using the anaconda packet manager)</li>
</ul>



<h3 class="wp-block-heading" id="h-dataset-credit-card-transactions">Dataset: Credit Card Transactions</h3>



<p class="wp-block-paragraph">In the following, we will work with a public dataset containing anonymized credit card transactions made by European cardholders in September 2013. You can download the dataset from <a href="https://www.kaggle.com/mlg-ulb/creditcardfraud" target="_blank" rel="noreferrer noopener">Kaggle.com.</a> </p>



<p class="wp-block-paragraph">The dataset contains 28 features (V1-V28) obtained from the source data using Principal Component Analysis (PCA). In addition, the data includes the date and the amount of the transaction. </p>



<p class="wp-block-paragraph">Transactions are labeled fraudulent or genuine, with 492 fraudulent cases out of 284,807 transactions. The positive class (frauds) accounts for only 0.172% of all credit card transactions, so the classes are highly unbalanced.</p>



<p class="wp-block-paragraph"></p>



<h3 class="wp-block-heading" id="h-step-1-load-the-data">Step #1: Load the Data</h3>



<p class="wp-block-paragraph">In the following, we will go through several steps of training an Anomaly detection model for credit card fraud. We will carry out several activities, such as:</p>



<ol class="wp-block-list">
<li>Loading and preprocessing the data: this involves cleaning, transforming, and preparing the data for analysis, in order to make it suitable for use with the isolation forest algorithm.</li>



<li>Feature engineering: this involves extracting and selecting relevant features from the data, such as transaction amounts, merchant categories, and time of day, in order to create a set of inputs for the anomaly detection algorithm.</li>



<li>Model training: We will train several machine learning models on different algorithms (incl. the isolation forest) on the preprocessed and engineered data. The models will learn the normal patterns and behaviors in credit card transactions. This activity includes hyperparameter tuning.</li>



<li>Model evaluation and testing: this involves evaluating the performance of the trained model on a test dataset in order to assess its accuracy, precision, recall, and other metrics and to identify any potential issues or improvements. As part of this activity, we compare the performance of the isolation forest to other models.</li>
</ol>



<p class="wp-block-paragraph">We begin by setting up imports and loading the data into our Python project. Then we&#8217;ll quickly verify that the dataset looks as expected. </p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}">import math 
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
from datetime import date, timedelta, datetime
import seaborn as sns
from sklearn.ensemble import IsolationForest
from sklearn.model_selection import train_test_split
from sklearn.neighbors import LocalOutlierFactor, KNeighborsClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, plot_confusion_matrix

# The Data can be downloaded from Kaggle.com: https://www.kaggle.com/mlg-ulb/creditcardfraud?select=creditcard.csv
path = 'data/credit-card-transactions/'
df = pd.read_csv(f'{path}creditcard.csv')
df</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">		Time	V1			V2			V3			V4			V5			V6			V7			V8			V9			...	V21	V22	V23	V24	V25	V26	V27	V28	Amount	Class
0		0.0		-1.359807	-0.072781	2.536347	1.378155	-0.338321	0.462388	0.239599	0.098698	0.363787	...	-0.018307	0.277838	-0.110474	0.066928	0.128539	-0.189115	0.133558	-0.021053	149.62	0
1		0.0		1.191857	0.266151	0.166480	0.448154	0.060018	-0.082361	-0.078803	0.085102	-0.255425	...	-0.225775	-0.638672	0.101288	-0.339846	0.167170	0.125895	-0.008983	0.014724	2.69	0
2		1.0		-1.358354	-1.340163	1.773209	0.379780	-0.503198	1.800499	0.791461	0.247676	-1.514654	...	0.247998	0.771679	0.909412	-0.689281	-0.327642	-0.139097	-0.055353	-0.059752	378.66	0
3		1.0		-0.966272	-0.185226	1.792993	-0.863291	-0.010309	1.247203	0.237609	0.377436	-1.387024	...	-0.108300	0.005274	-0.190321	-1.175575	0.647376	-0.221929	0.062723	0.061458	123.50	0
4		2.0		-1.158233	0.877737	1.548718	0.403034	-0.407193	0.095921	0.592941	-0.270533	0.817739	...	-0.009431	0.798278	-0.137458	0.141267	-0.206010	0.502292	0.219422	0.215153	69.99	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
284802	172786.0	-11.881118	10.071785	-9.834783	-2.066656	-5.364473	-2.606837	-4.918215	7.305334	1.914428	...	0.213454	0.111864	1.014480	-0.509348	1.436807	0.250034	0.943651	0.823731	0.77	0
284803	172787.0	-0.732789	-0.055080	2.035030	-0.738589	0.868229	1.058415	0.024330	0.294869	0.584800	...	0.214205	0.924384	0.012463	-1.016226	-0.606624	-0.395255	0.068472	-0.053527	24.79	0
284804	172788.0	1.919565	-0.301254	-3.249640	-0.557828	2.630515	3.031260	-0.296827	0.708417	0.432454	...	0.232045	0.578229	-0.037501	0.640134	0.265745	-0.087371	0.004455	-0.026561	67.88	0
284805	172788.0	-0.240440	0.530483	0.702510	0.689799	-0.377961	0.623708	-0.686180	0.679145	0.392087	...	0.265245	0.800049	-0.163298	0.123205	-0.569159	0.546668	0.108821	0.104533	10.00	0
284806	172792.0	-0.533413	-0.189733	0.703337	-0.506271	-0.012546	-0.649617	1.577006	-0.414650	0.486180	...	0.261057	0.643078	0.376777	0.008797	-0.473649	-0.818267	-0.002415	0.013649	217.00	0</pre></div>



<p class="wp-block-paragraph">Everything should look good so that we can continue.</p>



<h3 class="wp-block-heading" id="h-step-2-data-exploration">Step #2: Data Exploration</h3>



<p class="wp-block-paragraph">The purpose of data exploration in anomaly detection is to gain a better understanding of the data and the underlying patterns and trends that it contains. This can help to identify potential anomalies or outliers in the data and to determine the appropriate approaches and algorithms for detecting them.</p>



<p class="wp-block-paragraph">In the following, we will create histograms that visualize the distribution of the different features. </p>



<h4 class="wp-block-heading" id="h-2-1-features">2.1 Features</h4>



<p class="wp-block-paragraph">First, we will create a series of frequency histograms for our dataset&#8217;s features (V1 &#8211; V28). We will subsequently take a different look at the Class, Time, and Amount so that we can drop them at the moment. </p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># create histograms on all features
df_hist = df_base.drop(['Time','Amount', 'Class'], 1)
df_hist.hist(figsize=(20,20), bins = 50, color = &quot;c&quot;, edgecolor='black')
plt.show()</pre></div>



<figure class="wp-block-image size-large is-resized"><img decoding="async" data-attachment-id="9013" data-permalink="https://www.relataly.com/multivariate-outlier-detection-using-isolation-forests-in-python-detecting-credit-card-fraud/4233/output1-1/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/07/output1-1.png" data-orig-size="1170,1131" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="output1-1" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/07/output1-1.png" src="https://www.relataly.com/wp-content/uploads/2022/07/output1-1-1024x990.png" alt="Multivariate Anomaly Detection on Time-Series Data in Python: Feature frequency distributions on credit card data" class="wp-image-9013" width="1011" height="976" srcset="https://www.relataly.com/wp-content/uploads/2022/07/output1-1.png 1024w, https://www.relataly.com/wp-content/uploads/2022/07/output1-1.png 300w, https://www.relataly.com/wp-content/uploads/2022/07/output1-1.png 768w, https://www.relataly.com/wp-content/uploads/2022/07/output1-1.png 1170w" sizes="(max-width: 1011px) 100vw, 1011px" /></figure>



<p class="wp-block-paragraph">Next, we will look at the correlation between the 28 features. We expect the features to be uncorrelated due to the use of PCA. Let&#8217;s verify that by creating a heatmap on their correlation values.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># feature correlation
f_cor = df_hist.corr()
sns.heatmap(f_cor)</pre></div>



<figure class="wp-block-image size-full is-resized"><img decoding="async" data-attachment-id="9014" data-permalink="https://www.relataly.com/multivariate-outlier-detection-using-isolation-forests-in-python-detecting-credit-card-fraud/4233/output2-1/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/07/output2-1.png" data-orig-size="782,251" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="output2-1" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/07/output2-1.png" src="https://www.relataly.com/wp-content/uploads/2022/07/output2-1.png" alt="" class="wp-image-9014" width="796" height="255" srcset="https://www.relataly.com/wp-content/uploads/2022/07/output2-1.png 782w, https://www.relataly.com/wp-content/uploads/2022/07/output2-1.png 300w, https://www.relataly.com/wp-content/uploads/2022/07/output2-1.png 768w" sizes="(max-width: 796px) 100vw, 796px" /></figure>



<p class="wp-block-paragraph">As we expected, our features are uncorrelated. </p>



<h4 class="wp-block-heading" id="h-2-2-class-labels">2.2 Class Labels</h4>



<p class="wp-block-paragraph">Next, let&#8217;s print an overview of the class labels to understand better how balanced the two classes are.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Plot the balance of class labels
fig1, ax1 = plt.subplots(figsize=(14, 7))
plt.pie(df[['Class']].value_counts(), explode=[0,0.1], labels=[0,1], autopct='%1.2f%%', shadow=True, startangle=45)</pre></div>



<figure class="wp-block-image size-full is-resized"><img decoding="async" data-attachment-id="9015" data-permalink="https://www.relataly.com/multivariate-outlier-detection-using-isolation-forests-in-python-detecting-credit-card-fraud/4233/output6/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/07/output6.png" data-orig-size="394,394" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="output6" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/07/output6.png" src="https://www.relataly.com/wp-content/uploads/2022/07/output6.png" alt="balance of class labels in the credit card fraud dataset" class="wp-image-9015" width="494" height="494" srcset="https://www.relataly.com/wp-content/uploads/2022/07/output6.png 394w, https://www.relataly.com/wp-content/uploads/2022/07/output6.png 300w, https://www.relataly.com/wp-content/uploads/2022/07/output6.png 150w" sizes="(max-width: 494px) 100vw, 494px" /></figure>



<p class="wp-block-paragraph">We see that the data set is highly unbalanced. While this would constitute a problem for traditional classification techniques, it is a predestined use case for outlier detection algorithms like the Isolation Forest.</p>



<h4 class="wp-block-heading" id="h-2-3-time-and-amount">2.3 Time and Amount</h4>



<p class="wp-block-paragraph">Finally, we will create some plots to gain insights into time and amount. Let&#8217;s first have a look at the time variable. </p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Plot istribution of the Time variable, which contains transaction data for two days
fig, ax = plt.subplots(nrows=1, ncols=2, sharex=True, figsize=(14, 7))
sns.histplot(data=df_base[df_base['Class'] == 0], x='Time', kde=True, ax=ax[0])
sns.histplot(data=df_base[df_base['Class'] == 1], x='Time', kde=True, ax=ax[1])
plt.show()</pre></div>



<figure class="wp-block-image size-full is-resized"><img decoding="async" data-attachment-id="9016" data-permalink="https://www.relataly.com/multivariate-outlier-detection-using-isolation-forests-in-python-detecting-credit-card-fraud/4233/output3/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/07/output3.png" data-orig-size="842,423" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="output3" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/07/output3.png" src="https://www.relataly.com/wp-content/uploads/2022/07/output3.png" alt="Multivariate Anomaly Detection on Time-Series Data in Python: histogram of credit card transactions" class="wp-image-9016" width="933" height="469" srcset="https://www.relataly.com/wp-content/uploads/2022/07/output3.png 842w, https://www.relataly.com/wp-content/uploads/2022/07/output3.png 300w, https://www.relataly.com/wp-content/uploads/2022/07/output3.png 768w" sizes="(max-width: 933px) 100vw, 933px" /></figure>



<p class="wp-block-paragraph">The time frame of our dataset covers two days, which reflects the distribution graph well. We can see that most transactions happen during the day &#8211; which is only plausible.</p>



<p class="wp-block-paragraph">Next, let&#8217;s examine the correlation between transaction size and fraud cases. To do this, we create a scatterplot that distinguishes between the two classes.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Plot time against amount
x = df_base['Amount']
y = df_base['Time']
fig, ax = plt.subplots(figsize=(20, 7))
ax.set(xlim=(0, 1500))
sns.scatterplot(data=df_base[df_base['Class']==0][::15], x=x, y=y, hue=&quot;Class&quot;, palette=[&quot;#BECEE9&quot;], alpha=.5, ax=ax)
sns.scatterplot(data=df_base[df_base['Class']==1][::15], x=x, y=y, hue=&quot;Class&quot;, palette=[&quot;#EF1B1B&quot;], zorder=100, ax=ax)
plt.legend(['no fraud', 'fraud'], loc='lower right')
fig.suptitle('Transaction Amount over Time split by Class')</pre></div>



<figure class="wp-block-image size-large is-resized"><img decoding="async" data-attachment-id="9019" data-permalink="https://www.relataly.com/multivariate-outlier-detection-using-isolation-forests-in-python-detecting-credit-card-fraud/4233/output4/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/07/output4.png" data-orig-size="1184,474" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="identifying anomalous datapoints in two dimensions, credit card fraud detection" data-image-description="&lt;p&gt;identifying anomalous datapoints in two dimensions, credit card fraud detection&lt;/p&gt;
" data-image-caption="&lt;p&gt;identifying anomalous datapoints in two dimensions, credit card fraud detection&lt;/p&gt;
" data-large-file="https://www.relataly.com/wp-content/uploads/2022/07/output4.png" src="https://www.relataly.com/wp-content/uploads/2022/07/output4-1024x410.png" alt="Transaction Amount over Time splot by Class; Credit card fraud detection with isolation forests" class="wp-image-9019" width="1053" height="422" srcset="https://www.relataly.com/wp-content/uploads/2022/07/output4.png 1024w, https://www.relataly.com/wp-content/uploads/2022/07/output4.png 300w, https://www.relataly.com/wp-content/uploads/2022/07/output4.png 768w, https://www.relataly.com/wp-content/uploads/2022/07/output4.png 1184w" sizes="(max-width: 1053px) 100vw, 1053px" /></figure>



<p class="wp-block-paragraph">The scatterplot provides the insight that suspicious amounts tend to be relatively low. In other words, there is some inverse correlation between class and transaction amount. </p>



<h3 class="wp-block-heading" id="h-step-3-preprocessing">Step #3: Preprocessing</h3>



<p class="wp-block-paragraph">Now that we have a rough idea of the data, we will prepare it for training the model. For the training of the isolation forest, we drop the class label from the base dataset and then divide the data into separate datasets for training (70%) and testing (30%). We do not have to normalize or standardize the data when using a decision tree-based algorithm. </p>



<p class="wp-block-paragraph">We will use all features from the dataset. So our model will be a multivariate anomaly detection model. </p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Separate the classes from the train set
df_classes = df_base['Class']
df_train = df_base.drop(['Class'], axis=1)

# split the data into train and test 
X_train, X_test, y_train, y_test = train_test_split(df_train, df_classes, test_size=0.30, random_state=42)</pre></div>



<h3 class="wp-block-heading" id="h-step-4-model-training">Step #4: Model Training</h3>



<p class="wp-block-paragraph">Once we have prepared the data, it&#8217;s time to start training the Isolation Forest. However, to compare the performance of our model with other algorithms, we will train several different models. In total, we will prepare and compare the following five outlier detection models:</p>



<ul class="wp-block-list">
<li>Isolation Forest (default)</li>



<li>Isolation Forest (hypertuned)</li>



<li>Local Outlier Factor (default)</li>



<li>K Neared Neighbour (default)</li>



<li>K Nearest Neighbour (hypertuned)</li>
</ul>



<p class="wp-block-paragraph">For hyperparameter tuning of the models, we use Grid Search.</p>



<h4 class="wp-block-heading" id="h-4-1-train-an-isolation-forest">4.1 Train an Isolation Forest</h4>



<p class="wp-block-paragraph">Next, we train our isolation forest algorithm. An isolation forest is a type of machine learning algorithm for anomaly detection. It is a variant of the random forest algorithm, which is a widely-used ensemble learning method that uses multiple decision trees to make predictions.</p>



<p class="wp-block-paragraph">The isolation forest algorithm works by randomly selecting a feature and a split value for the feature, and then using the split value to divide the data into two subsets. This process is repeated for each decision tree in the ensemble, and the trees are combined to make a final prediction.</p>



<p class="wp-block-paragraph">The isolation forest algorithm is designed to be efficient and effective for detecting anomalies in high-dimensional datasets. It has a number of advantages, such as its ability to handle large and complex datasets, and its high accuracy and low false positive rate. It is widely used in a variety of applications, such as fraud detection, intrusion detection, and anomaly detection in manufacturing.</p>



<h5 class="wp-block-heading" id="h-4-1-1-isolation-forest-baseline">4.1.1 Isolation Forest (baseline)</h5>



<p class="wp-block-paragraph">First, we train a baseline model. A baseline model is a simple or reference model used as a starting point for evaluating the performance of more complex or sophisticated models in machine learning. It provides a baseline or benchmark for comparison, which allows us to assess the relative performance of different models and to identify which models are more accurate, effective, or efficient.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># train the model on the nominal train set
model_isf = IsolationForest().fit(X_train)</pre></div>



<p class="wp-block-paragraph">We create a function to measure the performance of our baseline model and illustrate the results in a confusion matrix. Later, when we go into hyperparameter tuning, we can use this function to objectively compare the performance of more sophisticated models.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}">def measure_performance(model, X_test, y_true, map_labels):
    # predict on testset
    df_pred_test = X_test.copy()
    #df_pred_test['Class'] = y_test
    df_pred_test['Pred'] = model.predict(X_test)
    if map_labels:
        df_pred_test['Pred'] = df_pred_test['Pred'].map({1: 0, -1: 1})
    #df_pred_test['Outlier_Score'] = model.decision_function(X_test)

    # measure performance
    #y_true = df_pred_test['Class']
    x_pred = df_pred_test['Pred'] 
    matrix = confusion_matrix(x_pred, y_true)

    sns.heatmap(pd.DataFrame(matrix, columns = ['Actual', 'Predicted']),
                xticklabels=['Regular [0]', 'Fraud [1]'], 
                yticklabels=['Regular [0]', 'Fraud [1]'], 
                annot=True, fmt=&quot;d&quot;, linewidths=.5, cmap=&quot;YlGnBu&quot;)
    plt.ylabel('Predicted')
    plt.xlabel('Actual')
    
    print(classification_report(x_pred, y_true))
    
    model_score = score(x_pred, y_true,average='macro')
    print(f'f1_score: {np.round(model_score[2]*100, 2)}%')
    
    return model_score

model_name = 'Isolation Forest (baseline)'
print(f'{model_name} model')

map_labels = True
model_score = measure_performance(model_isf, X_test, y_test, map_labels)

performance_df = pd.DataFrame().append({'model_name':model_name, 
                                    'f1_score': model_score[0], 
                                    'precision': model_score[1], 
                                    'recall': model_score[2]}, ignore_index=True)</pre></div>



<figure class="wp-block-image size-large is-resized"><img decoding="async" data-attachment-id="4585" data-permalink="https://www.relataly.com/multivariate-outlier-detection-using-isolation-forests-in-python-detecting-credit-card-fraud/4233/image-31-6/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2021/06/image-31.png" data-orig-size="668,699" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-31" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2021/06/image-31.png" src="https://www.relataly.com/wp-content/uploads/2021/06/image-31.png" alt="confusion matrix for our credit card fraud detection algorithm" class="wp-image-4585" width="490" height="514" srcset="https://www.relataly.com/wp-content/uploads/2021/06/image-31.png 668w, https://www.relataly.com/wp-content/uploads/2021/06/image-31.png 287w" sizes="(max-width: 490px) 100vw, 490px" /></figure>



<h5 class="wp-block-heading" id="h-4-1-2-isolation-forest-hypertuning">4.1.2 Isolation Forest (Hypertuning)</h5>



<p class="wp-block-paragraph">Next, we will train another Isolation Forest Model using grid search hyperparameter tuning to test different parameter configurations. </p>



<p class="wp-block-paragraph">The hyperparameters of an isolation forest include:</p>



<ul class="wp-block-list">
<li><strong>n_estimators</strong>: The number of decision trees in the forest.</li>



<li><strong>max_samples</strong>: The number of samples to draw from the dataset to train each decision tree.</li>



<li><strong>contamination: </strong>The expected proportion of anomalies in the dataset.</li>



<li><strong>max_features: </strong>The number of features to consider when choosing the split points in the decision trees.</li>



<li><strong>bootstrap: </strong>Whether or not to use bootstrap sampling when drawing samples to train the decision trees.</li>
</ul>



<p class="wp-block-paragraph">These hyperparameters can be adjusted to improve the performance of the isolation forest. The optimal values for these hyperparameters will depend on the specific characteristics of the dataset and the task at hand, which is why we require several experiments.</p>



<p class="wp-block-paragraph">The code below will evaluate the different parameter configurations based on their f1_score and automatically choose the best-performing model. </p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Define the parameter grid
n_estimators=[50, 100]
max_features=[1.0, 5, 10]
bootstrap=[True]
param_grid = dict(n_estimators=n_estimators, max_features=max_features, bootstrap=bootstrap)

# Build the gridsearch
model_isf = IsolationForest(n_estimators=n_estimators, 
                            max_features=max_features, 
                            contamination=contamination_rate, 
                            bootstrap=False, 
                            n_jobs=-1)

# Define an f1_scorer
f1sc = make_scorer(f1_score, average='macro')

grid = GridSearchCV(estimator=model_isf, param_grid=param_grid, cv = 3, scoring=f1sc)
grid_results = grid.fit(X=X_train, y=y_train)

# Summarize the results in a readable format
print(&quot;Best: {0}, using {1}&quot;.format(grid_results.cv_results_['mean_test_score'], grid_results.best_params_))
results_df = pd.DataFrame(grid_results.cv_results_)

# Evaluate model performance
model_name = 'KNN (tuned)'
print(f'{model_name} model')

best_model = grid_results.best_estimator_
map_labels = True # if True - maps 1 to 0 and -1 to 1 - not required for scikit-learn knn models
model_score = measure_performance(best_model, X_test, y_test, map_labels)
performance_df = performance_df.append({'model_name':model_name, 
                                    'f1_score': model_score[0], 
                                    'precision': model_score[1], 
                                    'recall': model_score[2]}, ignore_index=True)
results_df</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">Best: [0.61083219 0.55718259 0.55912644 0.52670328 0.5317127 ], using {'n_neighbors': 1}
KNN (tuned) model
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     85385
           1       0.21      0.48      0.29        58

    accuracy                           1.00     85443
   macro avg       0.60      0.74      0.64     85443
weighted avg       1.00      1.00      1.00     85443

f1_score: 64.39%</pre></div>



<h4 class="wp-block-heading" id="h-4-2-lof-model">4.2 LOF Model</h4>



<p class="wp-block-paragraph">We train the Local Outlier Factor Model using the same training data and evaluation procedure. The local outlier factor (LOF) is a measure of the local deviation of a data point with respect to its neighbors. It is used to identify points in a dataset that are significantly different from their surrounding points and that may therefore be considered outliers. </p>



<p class="wp-block-paragraph">The LOF is a useful tool for detecting outliers in a dataset, as it considers the local context of each data point rather than the global distribution of the data. This makes it more robust to outliers that are only significant within a specific region of the dataset. However, isolation forests can often outperform LOF models.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Train a tuned local outlier factor model
model_lof = LocalOutlierFactor(n_neighbors=3, contamination=contamination_rate, novelty=True)
model_lof.fit(X_train)

# Evaluate model performance
model_name = 'LOF (baseline)'
print(f'{model_name} model')

map_labels = True 
model_score = measure_performance(model_lof, X_test, y_test, map_labels)
performance_df = performance_df.append({'model_name':model_name, 
                                    'f1_score': model_score[0], 
                                    'precision': model_score[1], 
                                    'recall': model_score[2]}, ignore_index=True)</pre></div>



<h4 class="wp-block-heading" id="h-4-3-knn-model">4.3 KNN Model</h4>



<p class="wp-block-paragraph">Next, we train the KNN models. KNN is a type of machine learning algorithm for classification and regression. It is a type of instance-based learning, which means that it stores and uses the training data instances themselves to make predictions, rather than building a model that summarizes or generalizes the data.</p>



<p class="wp-block-paragraph">Below we add two K-Nearest Neighbor models to our list. We use the default parameter hyperparameter configuration for the first model. The second model will most likely perform better because we optimize its hyperparameters using the grid search technique.</p>



<h5 class="wp-block-heading" id="h-4-3-1-knn-default">4.3.1 KNN (default)</h5>



<p class="wp-block-paragraph">First, we train the default model using the same training data as before. By experimenting with different values of this parameter, you can try to identify the optimal number of neighbors that maximize the model&#8217;s performance on the given dataset. This approach could help to achieve better results compared to the default settings of the KNN algorithm, which may not be the most appropriate for the specific dataset we are working with.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Train a KNN Model
model_knn = KNeighborsClassifier(n_neighbors=5)
model_knn.fit(X=X_train, y=y_train)

# Evaluate model performance
model_name = 'KNN (baseline)'
print(f'{model_name} model')

map_labels = False # if True - maps 1 to 0 and -1 to 1 - set to False for classification models (e.g., KNN)
model_score = measure_performance(model_knn, X_test, y_test, map_labels)
performance_df = performance_df.append({'model_name':model_name, 
                                    'f1_score': model_score[0], 
                                    'precision': model_score[1], 
                                    'recall': model_score[2]}, ignore_index=True)</pre></div>



<figure class="wp-block-image size-large is-resized"><img decoding="async" data-attachment-id="4595" data-permalink="https://www.relataly.com/multivariate-outlier-detection-using-isolation-forests-in-python-detecting-credit-card-fraud/4233/image-35-6/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2021/06/image-35.png" data-orig-size="533,532" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-35" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2021/06/image-35.png" src="https://www.relataly.com/wp-content/uploads/2021/06/image-35.png" alt="Multivariate Anomaly Detection on Time-Series Data in Python: Confusion matrix for our credit card fraud detection algorithm" class="wp-image-4595" width="586" height="585" srcset="https://www.relataly.com/wp-content/uploads/2021/06/image-35.png 533w, https://www.relataly.com/wp-content/uploads/2021/06/image-35.png 300w, https://www.relataly.com/wp-content/uploads/2021/06/image-35.png 150w, https://www.relataly.com/wp-content/uploads/2021/06/image-35.png 120w" sizes="(max-width: 586px) 100vw, 586px" /></figure>



<h5 class="wp-block-heading" id="h-4-3-1-knn-hypertuned">4.3.1 KNN (hypertuned)</h5>



<p class="wp-block-paragraph">In the next step, we will train a second KNN model to improve its performance by fine-tuning its hyperparameters. Despite having only a few parameters, hyperparameter tuning can enhance the model&#8217;s ability to make accurate predictions. In this case, we will concentrate on optimizing the number of nearest neighbors considered in the KNN algorithm. </p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Define hypertuning parameters
n_neighbors=[1, 2, 3, 4, 5]
param_grid = dict(n_neighbors=n_neighbors)

# Build the gridsearch
model_knn = KNeighborsClassifier(n_neighbors=n_neighbors)
grid = GridSearchCV(estimator=model_knn, param_grid=param_grid, cv = 5)
grid_results = grid.fit(X=X_train, y=y_train)

# Summarize the results in a readable format
print(&quot;Best: {0}, using {1}&quot;.format(grid_results.cv_results_['mean_test_score'], grid_results.best_params_))
results_df = pd.DataFrame(grid_results.cv_results_)

# Evaluate model performance
model_name = 'KNN (tuned)'
print(f'{model_name} model')

best_model = grid_results.best_estimator_
map_labels = False # if True - maps 1 to 0 and -1 to 1 - set to False for classification models (e.g., KNN)
model_score = measure_performance(best_model, X_test, y_test, map_labels)
performance_df = performance_df.append({'model_name':model_name, 
                                    'f1_score': model_score[0], 
                                    'precision': model_score[1], 
                                    'recall': model_score[2]}, ignore_index=True)
results_df</pre></div>



<figure class="wp-block-image size-large is-resized"><img decoding="async" data-attachment-id="4596" data-permalink="https://www.relataly.com/multivariate-outlier-detection-using-isolation-forests-in-python-detecting-credit-card-fraud/4233/image-36-5/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2021/06/image-36.png" data-orig-size="379,262" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-36" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2021/06/image-36.png" src="https://www.relataly.com/wp-content/uploads/2021/06/image-36.png" alt="confusion matrix for our credit card fraud detection algorithm" class="wp-image-4596" width="521" height="360" srcset="https://www.relataly.com/wp-content/uploads/2021/06/image-36.png 379w, https://www.relataly.com/wp-content/uploads/2021/06/image-36.png 300w" sizes="(max-width: 521px) 100vw, 521px" /></figure>



<h3 class="wp-block-heading" id="h-step-5-measuring-and-comparing-performance">Step #5: Measuring and Comparing Performance</h3>



<p class="wp-block-paragraph">Finally, we will compare the performance of our models with a bar chart that shows the f1_score, precision, and recall. If you want to learn more about classification performance, <a href="https://www.relataly.com/measuring-classification-performance-with-python-and-scikit-learn/846/" target="_blank" rel="noreferrer noopener">this tutorial discusses the different metrics in more detail</a>.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}">print(performance_df)

performance_df = performance_df.sort_values('model_name')

fig, ax = plt.subplots(figsize=(12, 4))
tidy = performance_df.melt(id_vars='model_name').rename(columns=str.title)
sns.barplot(y='Model_Name', x='Value', hue='Variable', data=tidy, ax=ax, palette='nipy_spectral', linewidth=1, edgecolor=&quot;w&quot;)
plt.title('Model Outlier Detection Performance (Macro)')</pre></div>



<figure class="wp-block-image size-large is-resized"><img decoding="async" data-attachment-id="4679" data-permalink="https://www.relataly.com/multivariate-outlier-detection-using-isolation-forests-in-python-detecting-credit-card-fraud/4233/image-54-3/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2021/06/image-54.png" data-orig-size="1446,650" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-54" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2021/06/image-54.png" src="https://www.relataly.com/wp-content/uploads/2021/06/image-54-1024x460.png" alt="Multivariate Anomaly Detection on Time-Series Data in Python: Performance comparison between different algorithms for credit card fraud detection" class="wp-image-4679" width="982" height="441" srcset="https://www.relataly.com/wp-content/uploads/2021/06/image-54.png 1024w, https://www.relataly.com/wp-content/uploads/2021/06/image-54.png 300w, https://www.relataly.com/wp-content/uploads/2021/06/image-54.png 768w, https://www.relataly.com/wp-content/uploads/2021/06/image-54.png 1446w" sizes="(max-width: 982px) 100vw, 982px" /></figure>



<p class="wp-block-paragraph">All three metrics play an important role in evaluating performance because, on the one hand, we want to capture as many fraud cases as possible, but we also don&#8217;t want to raise false alarms too frequently. </p>



<ul class="wp-block-list">
<li>As we can see, the optimized Isolation Forest performs particularly well-balanced.</li>



<li>The default Isolation Forest has a high f1_score and detects many fraud cases but frequently raises false alarms.</li>



<li>The opposite is true for the KNN model. Only a few fraud cases are detected here, but the model is often correct when noticing a fraud case.</li>



<li>The default LOF model performs slightly worse than the other models. Compared to the optimized Isolation Forest, it performs worse in all three metrics.</li>
</ul>



<h2 class="wp-block-heading" id="h-summary">Summary</h2>



<p class="wp-block-paragraph">Credit card fraud detection is important because it helps to protect consumers and businesses, to maintain trust and confidence in the financial system, and to reduce financial losses. It is a critical part of ensuring the security and reliability of credit card transactions. </p>



<p class="wp-block-paragraph">This article has shown how to use Python and the Isolation Forest Algorithm to implement a credit card fraud detection system. We developed a multivariate anomaly detection model to spot fraudulent credit card transactions. You learned how to prepare the data for testing and training an isolation forest model and how to validate this model. Finally, we have proven that the Isolation Forest is a robust algorithm for anomaly detection that outperforms traditional techniques.</p>



<p class="wp-block-paragraph">I hope you enjoyed the article and can apply what you learned to your projects. Have a great day!</p>



<h2 class="wp-block-heading">Sources and Further Reading</h2>



<div style="display: inline-block;">
  <iframe sandbox="allow-popups allow-scripts allow-modals allow-forms allow-same-origin" style="width:120px;height:240px;" marginwidth="0" marginheight="0" scrolling="no" frameborder="0" src="//ws-eu.amazon-adsystem.com/widgets/q?ServiceVersion=20070822&amp;OneJS=1&amp;Operation=GetAdHtml&amp;MarketPlace=DE&amp;source=ss&amp;ref=as_ss_li_til&amp;ad_type=product_link&amp;tracking_id=flo7up-21&amp;language=de_DE&amp;marketplace=amazon&amp;region=DE&amp;placement=3030181162&amp;asins=3030181162&amp;linkId=669e46025028259138fbb5ccec12dfbe&amp;show_border=true&amp;link_opens_in_new_window=true"></iframe>
<iframe sandbox="allow-popups allow-scripts allow-modals allow-forms allow-same-origin" style="width:120px;height:240px;" marginwidth="0" marginheight="0" scrolling="no" frameborder="0" src="//ws-eu.amazon-adsystem.com/widgets/q?ServiceVersion=20070822&amp;OneJS=1&amp;Operation=GetAdHtml&amp;MarketPlace=DE&amp;source=ss&amp;ref=as_ss_li_til&amp;ad_type=product_link&amp;tracking_id=flo7up-21&amp;language=de_DE&amp;marketplace=amazon&amp;region=DE&amp;placement=1999579577&amp;asins=1999579577&amp;linkId=91d862698bf9010ff4c09539e4c49bf4&amp;show_border=true&amp;link_opens_in_new_window=true"></iframe>
<iframe sandbox="allow-popups allow-scripts allow-modals allow-forms allow-same-origin" style="width:120px;height:240px;" marginwidth="0" marginheight="0" scrolling="no" frameborder="0" src="//ws-eu.amazon-adsystem.com/widgets/q?ServiceVersion=20070822&amp;OneJS=1&amp;Operation=GetAdHtml&amp;MarketPlace=DE&amp;source=ss&amp;ref=as_ss_li_til&amp;ad_type=product_link&amp;tracking_id=flo7up-21&amp;language=de_DE&amp;marketplace=amazon&amp;region=DE&amp;placement=1839217715&amp;asins=1839217715&amp;linkId=356ba074068849ff54393f527190825d&amp;show_border=true&amp;link_opens_in_new_window=true"></iframe>
<iframe sandbox="allow-popups allow-scripts allow-modals allow-forms allow-same-origin" style="width:120px;height:240px;" marginwidth="0" marginheight="0" scrolling="no" frameborder="0" src="//ws-eu.amazon-adsystem.com/widgets/q?ServiceVersion=20070822&amp;OneJS=1&amp;Operation=GetAdHtml&amp;MarketPlace=DE&amp;source=ss&amp;ref=as_ss_li_til&amp;ad_type=product_link&amp;tracking_id=flo7up-21&amp;language=de_DE&amp;marketplace=amazon&amp;region=DE&amp;placement=1492032646&amp;asins=1492032646&amp;linkId=2214804dd039e7103577abd08722abac&amp;show_border=true&amp;link_opens_in_new_window=true"></iframe>
</div>



<p class="has-contrast-2-color has-base-3-background-color has-text-color has-background wp-block-paragraph"><em>The links above to Amazon are affiliate links. By buying through these links, you support the Relataly.com blog and help to cover the hosting costs. Using the links does not affect the price.</em></p>
<p>The post <a href="https://www.relataly.com/multivariate-outlier-detection-using-isolation-forests-in-python-detecting-credit-card-fraud/4233/">Multivariate Anomaly Detection on Time-Series Data in Python: Using Isolation Forests to Detect Credit Card Fraud</a> appeared first on <a href="https://www.relataly.com">relataly.com</a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://www.relataly.com/multivariate-outlier-detection-using-isolation-forests-in-python-detecting-credit-card-fraud/4233/feed/</wfw:commentRss>
			<slash:comments>2</slash:comments>
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">4233</post-id>	</item>
	</channel>
</rss>
