anomaly detection using python and isolation forests

Multivariate Anomaly Detection on Time-Series Data in Python: Using Isolation Forests to Detect Credit Card Fraud

Credit card fraud has become one of the most common use cases for anomaly detection systems. The number of fraud attempts has risen sharply in recent years, resulting in billions of dollars in losses. Early detection of fraud attempts with machine learning is therefore becoming increasingly important. In this article, we take on the fight against international credit card fraud and develop a multivariate anomaly detection model in Python that spots fraudulent payment transactions. Our model will use the Isolation Forest algorithm – an ensemble model, which is currently one of the most effective techniques for detecting outliers.

The remainder of this article is structured as follows: We start with a brief introduction to anomaly detection and take a closer look at the Isolation Forest algorithm. Equipped with these theoretical foundations, we will then turn to the practical part, in which we train and validate an isolation forest that detects credit card fraud. We use an unsupervised learning approach, where the model itself learns to distinguish regular from suspicious card transactions. We will train our model on a public dataset from Kaggle that contains credit card transactions. Finally, we will compare the performance of our model against two nearest neighbor algorithms (LOF and KNN).

The challenge is to detect outliers among a set of data points. Here demonstrated in two dimensions
The challenge is to detect outliers among a set of regular data points: Here illustrated in two dimensions/features.

Multivariate Anomaly Detection

Before we take a closer look at the use case and our unsupervised approach, let’s first briefly discuss what anomaly detection is. Anomaly detection deals with finding points that deviate from the legitimate data regarding their mean or median in a distribution. In the context of machine learning, the term is often used synonymously with outlier detection.

Some anomaly detection models work with a single feature (univariate data), for example, in monitoring electronic signals. However, most anomaly detection models use multivariate data, which means they have two (bivariate) or more (multivariate) features. They find a wide range of applications, including the following:

  • Predictive Maintenance and Detection of Malfunctions and Decay
  • Detection of Retail Bank Credit Card Fraud
  • Detection of Pricing Errors
  • Cyber Security, for example, Network Intrusion Detection
  • Detecting Fraudulent Market Behavior in Investment Banking

Unsupervised Algorithms for Anomaly Detection

Outlier detection is basically a classification problem. However, the field is more diverse as outlier detection is a problem that we can approach with supervised and unsupervised machine learning techniques. It would go beyond the scope of this article to explain the multitude of outlier detection techniques. Still, the following chart provides a good overview of standard algorithms that learn unsupervised.

A prerequisite for using supervised learning is that we have the information available as to which data points are outliers and which belong to regular data. In credit card fraud detection, this information is available because banks can validate with their customers whether a suspicious transaction is actually a fraud or not. In many other outlier detection cases, it remains unclear which outliers are legitimate and which are just noise or other uninteresting events in the data.

Whether we know the classes in our dataset or not impacts our choice of potential algorithms that we could employ to solve the outlier detection problem. If the class labels are unavailable, the unsupervised learning techniques are a natural choice. And if the class labels are available, we could use both unsupervised and supervised learning algorithms.

In the following, we will focus on Isolation Forests.

The Isolation Forest (“iForest”) Algorithm

Isolation forests (sometimes called iForests) are currently among the most powerful techniques for identifying anomalies in a dataset. They belong to the group of so-called ensemble models. This means their predictions do not rely on a single model and instead combine the results of a set of multiple independent models (decision trees). Nevertheless, isolation forests are not to be confused with traditional random forests. While random forests predict given class labels (supervised learning), isolation forests learn to distinguish outliers from inliers (regular data) in an unsupervised learning process.

Isolation Tree and Isolation Forest (Tree Ensemble)
Isolation Tree and Isolation Forest (Tree Ensemble)

An Isolation Forest contains multiple independent isolation trees. The algorithm invokes a process that recursively divides the training data at random points to isolate data points from each other to build an Isolation Tree. The number of partitions required to isolate a point tells us whether it is an anomalous point or a regular point. The underlying assumption is that random splits will isolate an anomalous point much sooner than nominal points.

How the Isolation Forest Algorithm Works

The illustration below shows exemplary training of an Isolation Tree on univariate data, i.e., with only one feature. As shown, the algorithm has already split the data at five random points between the minimum and maximum values of a random sample. The points that were already isolated are colored in purple. In the example below, it has taken two partitions to isolate the point on the far left. The other purple points were isolated after 4 and 5 splits.

The partitioning process ends when the algorithm has isolated all points from each other or when all remaining points have equal values. The algorithm has calculated and assigned an outlier score to each point at the end of the process, based on how many splits it took to isolate it.

Exemplary partitioning process of an isolation tree (5 Steps)

When we use an isolation forest model on unseen data to detect outliers, the algorithm will assign an anomaly score to the new data points. These scores will be calculated based on the ensemble trees that were previously built during model training.

So how does this process work when our dataset involves multiple features? Well, for multivariate anomaly detection, the process of partitioning the data remains almost the same. The major difference is that before each partitioning, the algorithm also selects a random feature in which the partitioning will occur. Consequently, multivariate isolation forests split the data along multiple dimensions (features).

Credit Card Fraud Detection using Isolation Forests

In the following, we will train an Isolation Forest algorithm for credit card fraud detection using Python. Credit card providers use similar anomaly detection systems to monitor their customers’ transactions and look for potential fraud attempts. In general, anything that deviates from the customer’s normal payment behavior can make a transaction suspicious, including an unusual location, time, or country in which the transaction is conducted. As soon as they detect a fraud attempt, they can stop the transaction and inform their customer.

Monitoring transactions has become a crucial task for financial institutions. In 2019 alone, more than 271,000 cases of credit card theft were reported in the U.S., causing billions of dollars in losses and making credit card fraud one of the most common types of identity theft. The vast majority of fraud cases are attributable to organized crime, which often specializes in this particular crime.

Each application of a credit card creates a new data point to review.

Now that we have established the context for our machine learning problem, we are ready to begin with, the implementation. As always, you can find the code in the relataly GitHub repository:

Prerequisites

Before we start the coding part, make sure that you have set up your Python 3 environment and required packages. If you don’t have an environment set up yet, consider the Anaconda Python environment. To set it up, you can follow the steps in this tutorial.

Also, make sure you install all required packages. In this tutorial, we will be working with the following standard packages: 

In addition, we will be using the machine learning library scikit-learn and seaborn for visualization.

You can install packages using console commands:

  • pip install <package name>
  • conda install <package name> (if you are using the anaconda packet manager)

Dataset: Credit Card Transactions

In the following, we will be working with a public dataset that contains anonymized credit card transactions made by European cardholders in September 2013. You can download the dataset from Kaggle.com.

The dataset contains 28 features (V1-V28) obtained from the source data using Principal Component Analysis (PCA). In addition, the data includes the date and the amount of the transaction.

Transactions are labeled as fraudulent or genuine, with 492 frauds out of 284,807 transactions. The positive class (frauds) accounts for only 0.172% of all transactions, so that the classes are highly unbalanced.

Step #1: Load the Data

We begin by setting up imports and loading the data into our Python project. Then we’ll quickly verify that the dataset looks as expected.

import math 
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
from datetime import date, timedelta, datetime
import seaborn as sns
from sklearn.ensemble import IsolationForest
from sklearn.model_selection import train_test_split
from sklearn.neighbors import LocalOutlierFactor, KNeighborsClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, plot_confusion_matrix

# The Data can be downloaded from Kaggle.com: https://www.kaggle.com/mlg-ulb/creditcardfraud?select=creditcard.csv
path = 'data/credit-card-transactions/'
df = pd.read_csv(f'{path}creditcard.csv')
df
credit card transactions

Everything should look good so that we can continue.

Step #2: Data Exploration

Moving on, we will create some histograms that visualize the distribution of the different features. In this way, we develop a better understanding of the data.

2.1 Features

First, we will create a series of frequency histograms for the features (V1 – V28) in our dataset. We will subsequently take a separate look at the Class, Time, and Amount, so for the moment, we can drop them.

# create histograms on all features
df_hist = df_base.drop(['Time','Amount', 'Class'], 1)
df_hist.hist(figsize=(20,20), bins = 50, color = "c", edgecolor='black')
plt.show()
Feature frequency distributions on credit card data

Next, we will look at the correlation between the 28 features. We expect the features to be uncorrelated due to the use of PCA. Let’s verify that by creating a heatmap on their correlation values.

# feature correlation
f_cor = df_hist.corr()
sns.heatmap(f_cor)
The features of credit card transactions are uncorrelated due to the use of PCA

As we have expected, our features are uncorrelated.

2.2 Class Labels

Next, let’s print an overview of the class labels. This is to get an idea of the balance between the two class labels.

# Plot the balance of class labels
df_base = df.copy()
nominal_count = len(df_base.loc[df['Class'] == 0, 'Class'])
outlier_count = len(df_base.loc[df['Class'] == 1, 'Class'])
print(f'size of class 1 (outliers): {outlier_count}, size of class 0: {nominal_count}')

plt.figure(figsize=(15,2))
fig = sns.countplot(y="Class", data=df_base, color='b')
plt.show()

We see that the data set is highly unbalanced. While this would constitute a problem for traditional classification techniques, it is a predestined use case for outlier detection algorithms like the Isolation Forest.

2.3 Time and Amount

Finally, we will create some plots to gain insights into time and amount. Let’s first have a look at the time variable.

# Plot istribution of the Time variable, which contains transaction data for two days
plt.figure(figsize=(20,4))
fig = sns.displot(df_base['Time'], kde=False, color="c", height=4, aspect=11.7/4.27)
plt.show()

The time frame of our dataset covers two days, which reflects the distribution graph well. We can see that most transactions happen during the day – which is only plausible.

Next, let’s look at the correlation between transaction size and fraud cases. To do this, we create a scatterplot that distinguishes between the two classes.

# Plot time against amount
x = df_base['Time']
y = df_base['Amount']
rp = sns.relplot(data=df_base, x=x, y=y, col="Class", kind="scatter", hue="Class")
rp.fig.subplots_adjust(top=0.9)
rp.fig.suptitle('Transaction Amount over Time split by Class')

The scatterplot provides the insight that suspicious amounts tend to be relatively low. In other words, there is some inverse correlation between class and transaction amount.

Step #3: Preprocessing

Now that we have a rough idea of the data, we will prepare it for training the model. For the training of the isolation forest, we drop the class label from the base dataset and then divide the data into separate datasets for training (70%) and test (30%). When we use a decision tree-based algorithm, there is no need to normalize or standardize the data.

We will use all features from the base dataset, which will result in a multivariate anomaly detection model.

# Separate the classes from the train set
df_classes = df_base['Class']
df_train = df_base.drop(['Class'], axis=1)

# split the data into train and test 
X_train, X_test, y_train, y_test = train_test_split(df_train, df_classes, test_size=0.30, random_state=42)

Step #4: Model Training

Once we have prepared the data, it’s time to start training the Isolation Forest. However, to compare the performance of our model with other algorithms, we will train several different models. In total, we will prepare and compare the following five outlier detection models:

  • Isolation Forest (default)
  • Isolation Forest (hypertuned)
  • Local Outlier Factor (default)
  • K Neared Neighbour (default)
  • K Nearest Neighbour (hypertuned)

For hyperparameter tuning of the models, we use Grid Search.

4.1 Train an Isolation Forest

4.1.1 Isolation Forest (baseline)
# train the model on the nominal train set
model_isf = IsolationForest().fit(X_train)
def measure_performance(model, X_test, y_true, map_labels):
    # predict on testset
    df_pred_test = X_test.copy()
    #df_pred_test['Class'] = y_test
    df_pred_test['Pred'] = model.predict(X_test)
    if map_labels:
        df_pred_test['Pred'] = df_pred_test['Pred'].map({1: 0, -1: 1})
    #df_pred_test['Outlier_Score'] = model.decision_function(X_test)

    # measure performance
    #y_true = df_pred_test['Class']
    x_pred = df_pred_test['Pred'] 
    matrix = confusion_matrix(x_pred, y_true)

    sns.heatmap(pd.DataFrame(matrix, columns = ['Actual', 'Predicted']),
                xticklabels=['Regular [0]', 'Fraud [1]'], 
                yticklabels=['Regular [0]', 'Fraud [1]'], 
                annot=True, fmt="d", linewidths=.5, cmap="YlGnBu")
    plt.ylabel('Predicted')
    plt.xlabel('Actual')
    
    print(classification_report(x_pred, y_true))
    
    model_score = score(x_pred, y_true,average='macro')
    print(f'f1_score: {np.round(model_score[2]*100, 2)}%')
    
    return model_score

model_name = 'Isolation Forest (baseline)'
print(f'{model_name} model')

map_labels = True
model_score = measure_performance(model_isf, X_test, y_test, map_labels)

performance_df = pd.DataFrame().append({'model_name':model_name, 
                                    'f1_score': model_score[0], 
                                    'precision': model_score[1], 
                                    'recall': model_score[2]}, ignore_index=True)
4.1.2 Isolation Forest (Hypertuning)

Next, we will train another Isolation Forest Model using grid search hyperparameter tuning to test different parameter configurations. The code below will evaluate the different parameter configurations based on their f1_score and automatically choose the best-performing model.

# Define the parameter grid
n_estimators=[50, 100]
max_features=[1.0, 5, 10]
bootstrap=[True]
param_grid = dict(n_estimators=n_estimators, max_features=max_features, bootstrap=bootstrap)

# Build the gridsearch
model_isf = IsolationForest(n_estimators=n_estimators, 
                            max_features=max_features, 
                            contamination=contamination_rate, 
                            bootstrap=False, 
                            n_jobs=-1)

# Define an f1_scorer
f1sc = make_scorer(f1_score, average='macro')

grid = GridSearchCV(estimator=model_isf, param_grid=param_grid, cv = 3, scoring=f1sc)
grid_results = grid.fit(X=X_train, y=y_train)

# Summarize the results in a readable format
print("Best: {0}, using {1}".format(grid_results.cv_results_['mean_test_score'], grid_results.best_params_))
results_df = pd.DataFrame(grid_results.cv_results_)

# Evaluate model performance
model_name = 'KNN (tuned)'
print(f'{model_name} model')

best_model = grid_results.best_estimator_
map_labels = True # if True - maps 1 to 0 and -1 to 1 - not required for scikit-learn knn models
model_score = measure_performance(best_model, X_test, y_test, map_labels)
performance_df = performance_df.append({'model_name':model_name, 
                                    'f1_score': model_score[0], 
                                    'precision': model_score[1], 
                                    'recall': model_score[2]}, ignore_index=True)
results_df

4.2 LOF Model

We train the Local Outlier Factor Model using the same training data as before and the same evaluation procedure.

# Train a tuned local outlier factor model
model_lof = LocalOutlierFactor(n_neighbors=3, contamination=contamination_rate, novelty=True)
model_lof.fit(X_train)

# Evaluate model performance
model_name = 'LOF (baseline)'
print(f'{model_name} model')

map_labels = True 
model_score = measure_performance(model_lof, X_test, y_test, map_labels)
performance_df = performance_df.append({'model_name':model_name, 
                                    'f1_score': model_score[0], 
                                    'precision': model_score[1], 
                                    'recall': model_score[2]}, ignore_index=True)

4.3 KNN Model

Below we add two K-Nearest Neighbor models to our list. The first model uses the default parameter configuration. The second model has been slightly optimized with hyperparameter tuning.

4.3.1 KNN (default)

First, we train the default model using the same training data as before.

# Train a KNN Model
model_knn = KNeighborsClassifier(n_neighbors=5)
model_knn.fit(X=X_train, y=y_train)

# Evaluate model performance
model_name = 'KNN (baseline)'
print(f'{model_name} model')

map_labels = False # if True - maps 1 to 0 and -1 to 1 - set to False for classification models (e.g., KNN)
model_score = measure_performance(model_knn, X_test, y_test, map_labels)
performance_df = performance_df.append({'model_name':model_name, 
                                    'f1_score': model_score[0], 
                                    'precision': model_score[1], 
                                    'recall': model_score[2]}, ignore_index=True)
4.3.1 KNN (hypertuned)

Next, we will train a second KNN model that is slightly optimized using hyperparameter tuning. KNN models have only a few parameters. Therefore, we limit ourselves to optimize the model for the number of neighboring points considered.

# Define hypertuning parameters
n_neighbors=[1, 2, 3, 4, 5]
param_grid = dict(n_neighbors=n_neighbors)

# Build the gridsearch
model_knn = KNeighborsClassifier(n_neighbors=n_neighbors)
grid = GridSearchCV(estimator=model_knn, param_grid=param_grid, cv = 5)
grid_results = grid.fit(X=X_train, y=y_train)

# Summarize the results in a readable format
print("Best: {0}, using {1}".format(grid_results.cv_results_['mean_test_score'], grid_results.best_params_))
results_df = pd.DataFrame(grid_results.cv_results_)

# Evaluate model performance
model_name = 'KNN (tuned)'
print(f'{model_name} model')

best_model = grid_results.best_estimator_
map_labels = False # if True - maps 1 to 0 and -1 to 1 - set to False for classification models (e.g., KNN)
model_score = measure_performance(best_model, X_test, y_test, map_labels)
performance_df = performance_df.append({'model_name':model_name, 
                                    'f1_score': model_score[0], 
                                    'precision': model_score[1], 
                                    'recall': model_score[2]}, ignore_index=True)
results_df

Step #5: Measuring and Comparing Performance

Finally, we will take a look at the model performance. We do this by plotting a comparison chart that shows the f1_score, precision, and recall of the five different models.

print(performance_df)

performance_df = performance_df.sort_values('model_name')

fig, ax = plt.subplots(figsize=(12, 4))
tidy = performance_df.melt(id_vars='model_name').rename(columns=str.title)
sns.barplot(y='Model_Name', x='Value', hue='Variable', data=tidy, ax=ax, palette='nipy_spectral', linewidth=1, edgecolor="w")
plt.title('Model Outlier Detection Performance (Macro)')

All three metrics play an important role in evaluating performance because, on the one hand, we want to capture as many of the fraud cases as possible, but we also don’t want to raise false alarms too frequently.

  • As we can see, the optimized Isolation Forest performs particularly well balanced.
  • The default Isolation Forest has a high f1_score and detects many fraud cases, but it also raises false alarms very frequently.
  • The opposite is true for the KNN model. Here, only a few fraud cases are detected, but the model is comparatively often correct when it detects a fraud case.
  • The default LOF model performs slightly worse than the other models. Compared to the optimized Isolation Forest, it performs worse in all three metrics.

Summary

In this article, we have developed a multivariate anomaly detection model to spot fraudulent credit card transactions. You learned how to prepare the data for testing and training an isolation forest model and how to validate this model. Finally, we have proven that the Isolation Forest is a powerful algorithm for anomaly detection that outperforms traditional techniques.

I hope you enjoyed the article and can apply what you learned to your projects. Have a nice week!

Author

  • Hi, I am Florian, a Zurich-based consultant for AI and Data. Since the completion of my Ph.D. in 2017, I have been working on the design and implementation of ML use cases in the Swiss financial sector. I started this blog in 2020 with the goal in mind to share my experiences and create a place where you can find key concepts of machine learning and materials that will allow you to kick-start your own Python projects.

Leave a Reply