Decision Trees Archives - relataly.com

Using Fairlearn to Build Fair Machine Machine Learning Models with Python: Step-by-Step Towards More Responsible AI

Florian Follonier — Thu, 09 Mar 2023 23:45:16 +0000

As we enter an era where intelligent systems are increasingly relied upon to make key decisions, responsible AI has become more critical than ever before. It’s not enough to simply rely on data-driven decision-making; we must also ensure that these systems are fair and just. But how can we do this when the same bias and unfairness that exists in our society is reflected in these very algorithms?

The consequences of biased algorithms are far-reaching and can be devastating. They can perpetuate discriminatory practices, deny individuals opportunities or services, and lead to a range of negative outcomes. It’s no longer just a moral obligation to avoid perpetuating inequalities; it’s a necessity to prevent negative consequences such as reputational damage and legal repercussions.

This article aims to drive home the importance of fairness in AI and provide practical tips for ensuring that algorithms are unbiased and free from discrimination. We will delve into the fairlearn library, a powerful open-source toolkit, to assess and mitigate bias in machine learning. Using fairlearn, we will implement a classification model in Python to predict income on the adult consensus set and evaluate its fairness.

Think about it: just as we strive to ensure fairness in our daily lives, we must also do the same with AI. We must do our best to eliminate bias and discrimination from our algorithms, just as we work to eliminate these issues from our society. By using fairlearn, we can ensure that our models are as fair and unbiased as possible, minimizing the risk of perpetuating discrimination and inequality. We can take steps to ensure that our algorithms are not influenced by factors such as race, gender, or age, and that they are making predictions based on merit alone.

What is Responsible AI?

Responsible AI is an approach that emphasizes fairness, transparency, accountability, and ethical considerations in the development and deployment of AI systems. It involves identifying potential sources of bias in AI and taking steps to address them proactively. To ensure responsible AI, organizations must develop policies and processes that prioritize fairness, transparency, and accountability throughout the AI development lifecycle.

How Real-World Bias Manifests in Biased Data

As we continue to develop increasingly sophisticated AI models like GPT-4, it is crucial to acknowledge that real-world data is not always objective and unbiased. Training models on biased data can lead to the perpetuation and even amplification of existing biases and disparities. Failing to address these issues can cement and exacerbate existing inequalities, with far-reaching consequences for society and individuals alike.

Bias in AI models often stems from the data they are trained on. Historical data, social norms, and cultural practices can all introduce bias into datasets, which AI models may then unwittingly learn and propagate. This can lead to AI systems making biased decisions, producing biased content, or perpetuating stereotypes and prejudices.

Ways in which AI systems can be biased

There are many ways in which AI systems can be biased and thus behave unfairly, including the following:

Gender Bias in Hiring: The data used by companies to train their hiring algorithms may be biased against certain genders or demographics. As a result, these algorithms may perpetuate existing gender inequalities by disproportionately rejecting female candidates.
Racial Bias in Healthcare: Medical data may be biased against certain racial or ethnic groups, which can lead to misdiagnosis, mistreatment, or under-treatment of certain populations. For instance, studies have shown that Black patients are less likely to receive appropriate pain management than white patients.
Economic Bias in Credit Scoring: Financial data used to determine credit scores may be biased against low-income individuals or marginalized communities, making it harder for them to access credit or loans. The result is a cycle of poverty, where individuals cannot access credit and are thus unable to build wealth.
Age Bias in Predictive Policing: Predictive policing algorithms may be biased against certain age groups, such as younger people, who may be more likely to be falsely flagged as potential criminals. This can lead to increased surveillance and potential harassment of certain populations.

These examples highlight the importance of identifying and addressing bias in real-world data. By doing so, we can ensure that our models and algorithms are fair, just, and equitable for all individuals, regardless of their background or demographic.

One of many biases is that in many industries, Women still earn less than their male colleagues. Image created with Midjourney.

Model Fairness and Performance

In machine learning models, there is often a delicate balance between fairness and performance. For instance, optimizing a model for high accuracy might inadvertently lead to discrimination against certain groups of people. To avoid these adverse effects, we must actively work to reduce the influence of unfairness and bias in our models.

Achieving this balance involves multiple strategies, such as gathering diverse and representative data, identifying and addressing sources of bias within the data, and devising techniques to counteract the effects of bias. Moreover, it’s essential to consistently monitor and evaluate our models to ensure they maintain fairness and impartiality over time.

Conversely, when prioritizing fairness, some degree of accuracy may be sacrificed. Striking the right balance between fairness and performance depends on the specific use case and consideration of the potential consequences of any decisions made. The ultimate goal is to create a model that is both fair and effective in performing the task it was designed for.

In summary, to create a well-balanced machine learning model, it’s crucial to acknowledge the trade-off between fairness and performance. By employing various strategies, such as using diverse data, addressing bias, and continuously monitoring the models, we can work towards developing models that are not only accurate but also fair and equitable for all users.

Organizations must take the necessary steps to ensure their algorithms are unbiased and not perpetuate the same inequalities still plaguing our society. Image created with Midjourney.

Assessing Model Fairness with the Fairlearn Library

Fairlearn is an open-source Python package, developed by Microsoft, that offers tools for assessing and mitigating unfairness in machine learning models. The library empowers developers and data scientists to pinpoint and address potential sources of bias in their data and models, ensuring that predictions and decisions are made fairly and transparently.

Fairlearn encompasses a variety of algorithms and metrics for evaluating and mitigating unfairness, including:

Group Fairness: This metric measures the difference in a model’s performance across various subgroups of the population, such as race or gender.
Demographic Parity: This metric works to ensure that the model’s predictions are distributed fairly across different subgroups of the population.
Equalized Odds: This metric guarantees that the model’s predictions maintain equal accuracy across different subgroups of the population.

In addition to these metrics, Fairlearn also offers various techniques for mitigating unfairness. These include reweighing data to balance different subgroups, adjusting the model’s predictions to meet specific fairness criteria, and enforcing fairness constraints during the training process. By utilizing Fairlearn, developers and data scientists can be confident that their machine learning models are more equitable and unbiased.

Assessing Model Fairness with the Fairlearn Library. Image created with Midjourney.

Mitigating Unfairness in Machine Learning

There are two options to address unfairness in machine learning models: consider fairness during training or add a post-processing step to mitigate unfairness as an additional layer. Both methods have their merits and drawbacks. The ideal approach depends on the context and objectives of the problem you’re tackling.

Considering fairness during training involves incorporating fairness constraints into the learning process. This approach can lead to more interpretable models, as fairness is embedded from the start. However, it may require specialized algorithms and could result in reduced performance.

On the other hand, adding a post-processing step to mitigate unfairness involves adjusting the model’s predictions after training. This method offers flexibility, as it can be applied to any pre-trained model. It may also preserve performance better. However, it might not provide as much interpretability, as fairness is addressed separately from the original model.

Ultimately, the choice depends on your specific goals, the model’s purpose, and the importance of interpretability versus performance.

Considering Fairness during Training with Fairness Constraints

Considering fairness during training involves incorporating fairness constraints or objectives into the model’s training process. This can be done in different ways:

by balancing the training dataset to ensure equal representation of different groups.
by adjusting the model’s loss function to penalize unfair predictions,
or by adding fairness constraints to the optimization problem.

The main advantage of this approach is that it can lead to models that are inherently fair without requiring any additional post-processing steps. However, it can be challenging to define and operationalize fairness. In addition, the fairness objectives may conflict with other objectives, such as accuracy or generalization.

There are several fairness constraints that can be used to ensure that machine learning models and decision-making processes do not discriminate against certain groups. Some of the most commonly used constraints include:

Demographic Parity: This constraint ensures that the proportion of positive outcomes is equal across different demographic groups.
Equalized Odds: This constraint ensures that the true positive rate and false positive rate are equal across different demographic groups.
Disparate Impact: This constraint ensures that the ratio of positive outcomes to the total number of outcomes is the same across different demographic groups.

For a full list of constraints, please view the library documentation.

The choice of fairness constraint depends on the specific context and goals of the decision-making process, and the constraints may need to be customized or combined to achieve the desired level of fairness.

Fairness Mitigation as a Postprocessing Layer

Another approach is to add a post-processing step to mitigate the unfairness of existing models. This method involves applying a fairness algorithm to the model’s outputs after training. Examples include reweighting predictions to ensure equal treatment of different groups, calibrating the model’s scores to remove bias, or using a fairness metric to adjust predictions.

The primary advantage of this approach is its applicability to existing models without the need for retraining. However, it might not tackle the root causes of unfairness in the model, and the fairness algorithm could introduce its own biases or trade-offs.

We can consider fairness towards sensitive features during the training of a machine learning model by using the fairness constraints. Image created with Midjourney.

Building A More Inclusive Machine Learning Model using FairLearn and Python

Let’s explore building a fair machine learning model using the census dataset, which includes demographic attributes to predict if a person earns more or less than $50k per year. As income prediction is sensitive, we’ll ensure fairness in our model.

We’ll create an initial fairness-unaware model, evaluate its fairness, and then use grid search to identify alternative models with varying performance and unfairness. This process consists of:

Load the data: Load the census dataset containing demographic attributes like age, education level, race, and gender.
Initial preprocessing and visualization: Preprocess the data by removing missing values, encoding categorical variables, scaling numerical features, and visualizing the data for insights.
Splitting and scaling the data: Split the data into training and testing sets and scale the features to make them comparable.
Training a fairness-unaware model and assessing fairness: Train a baseline model unaware of unfairness and evaluate its fairness using metrics like disparate impact and statistical parity difference.
Mitigating bias by training fairness-aware models with fairness constraints: Train models using demographic parity as a fairness constraint to mitigate bias and improve fairness.
Comparing the performance of the models: Compare the models’ performance in accuracy, fairness, and other metrics like false positives and false negatives.

By following these steps, we’ll build a fair machine learning model that is accurate, equitable, and just.

The code is available on the GitHub repository.

View on GitHub Relataly GitHub Repo

Responsible AI, and fairness in machine learning will gain further importance as the role of this exciting technology continues to become more influential in society.

Prerequisites

Before starting the coding part, make sure that you have set up your Python 3 environment and required packages. This will ensure a seamless learning experience and prevent any potential roadblocks or issues that may arise due to an improperly configured environment.

If you don’t have an environment, follow this tutorial to set up the Anaconda environment.

Make sure you install all required packages. In this tutorial, we will be working with the following packages:

Pandas
NumPy
Matplotlib
Seaborn
Plotly
Fairlearn

You can install the Fairlearn library and other packages using console commands:

pip install 
conda install  (if you are using the anaconda packet manager)

About the Adult Consensus Dataset

The Adult Consensus Dataset is a highly regarded and extensively used resource in machine learning and statistics, offering a comprehensive view of over 32,000 individuals across the United States. This invaluable dataset features 14 unique attributes, such as age, education, occupation, work class, marital status, and gender, among others. It also includes a binary target variable, identifying individuals earning over $50,000 annually.

However, the dataset presents challenges, as it contains sensitive demographic features like race and gender. If not handled properly, these features may lead to biased predictions and unfair outcomes. As a result, models trained on this dataset require careful design and evaluation to prevent perpetuating or amplifying existing biases and discrimination.

Despite its complexities, the Adult Consensus Dataset remains a popular choice for training and testing binary classification models. Its value extends further, serving as an excellent benchmark for testing and showcasing fair AI models. The dataset raises important questions about fairness, bias, and discrimination in machine learning, making it an essential resource for creating ethical and unbiased AI solutions.

Step #1 Load Preprocess the Data

We begin by loading the data using the fetch_openml function, making the necessary imports, and performing some initial preprocessing. The dataset is loaded into two data frames, X_df and y_df, where X_df contains the features and y_df contains the binary target variable, which is income with classes “>50k” and “<=50k”. The code below encodes these classes with 0 and 1.

Assessing model fairness is always done with respect to one or several sensitive variables. Although there are several features in the dataset that could be considered sensitive, optimizing the model to consider all of these features simultaneously would lead to significant complexity and increase training times. Therefore, in this tutorial, we will focus specifically on assessing model fairness with respect to sex and race. The sex feature contains two classes, “male” and “female,” while the race feature contains classes for “Black,” “White,” and several further classes.

# A tutorial for this file will be shortly available at www.relataly.com
# Tested with python 3.9.13, matplotlib 3.6.2, numpy 1.23.3, seaborn=0.12.1, fairlearn=0.8.0, plotly 5.11.0

from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn import metrics as skm
from fairlearn.metrics import MetricFrame, selection_rate, selection_rate_difference, count, plot_model_comparison
from fairlearn.reductions import DemographicParity, ErrorRate, GridSearch
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt
sns.set_style('white', { 'axes.spines.right': False, 'axes.spines.top': False})

# read the adult data
# predict whether income exceeds $50K/yr based on census data. Also known as "Census Income" dataset.
# https://archive.ics.uci.edu/ml/datasets/adult

from sklearn.datasets import fetch_openml
data = fetch_openml(data_id=1590, parser='auto', as_frame=True)

X_df = data.data.copy()
data.target = data.target.replace({' <=50K': 0, ' >50K': 1})
y_df = data.target

# Declare sensitive features
sensitive_variables = ["sex", 'race']

# join X_df and y_df and name y_df as 'income'
X_df['salary >50k'] = y_df

# show df that contains the target column
X_df.head()

	age	workclass	fnlwgt	education	race	sex		...		hours-per-week	salary >50k
0	25	4			226802	1			Black	Male	...		40				0
1	38	4			89814	11			White	Male	...		50				0
2	28	2			336951	7			White	Male	...		40				1
3	44	4			160323	15			Black	Male	...		40				1
4	18	0			103497	15			White	Female	...		30				0

Step #2 Initial Preprocessing and Visualization

Let’s continue with some initial preprocessing and data visualization. First, we encode the categorical features in the X_df data frame using the LabelEncoder function from scikit-learn. This is important, as the fairlearn gridsearch technique that we will use in this tutorial only works with numeric features.

Next, to simplify the tutorial and reduce model training time, we will filter the dataset to only include people of races “Black” and “White.” This will result in four sensitive feature permutations:

Black, Woman
Black, Man
White, Woman
White, Man

Finally, we create two data frames. The X_df data frame contains the features we will be using to train our model, but it does not contain the target label anymore. The other data frame, df_visu, does contain the target label and will be used for visualization purposes. Finally, we use df_visu to visualize how the target variable is distributed with respect to our sensitive features.

# encode the categorical features
X_df['workclass'] = X_df['workclass'].cat.codes
X_df['education'] = X_df['education'].cat.codes
X_df['marital-status'] = X_df['marital-status'].cat.codes
X_df['occupation'] = X_df['occupation'].cat.codes
X_df['relationship'] = X_df['relationship'].cat.codes

# filter some values to reduce tutorial complexity and make it easier to visualize
X_df = X_df.loc[X_df["race"].astype(str).isin([" White", " Black"])]
X_df.race = X_df.race.cat.remove_unused_categories()

# ensure that the target column has the same filter applied
y_df = X_df['salary >50k']

# create a copy of the df for visualization
df_visu = X_df.copy()

# drop the target column from X_df
X_df.drop(['salary >50k'], axis=1, inplace=True)

# visualize the distributions of the cateogical variables
fig, ax = plt.subplots(1, 3, figsize=(16, 5))
sns.countplot(x='sex', hue='salary >50k', data=df_visu, ax=ax[0])
sns.countplot(x='race', hue='salary >50k', data=df_visu, ax=ax[1])
sns.kdeplot(x='age', hue='salary >50k', data=df_visu, ax=ax[2])

# rotate x_labels
plt.sca(ax[1])
plt.xticks(rotation=30)

All three plots reflect inequality in our society:

The percentage of women with a yearly income above 50k is much lower than that of men.
With respect to race, this bias is even more extreme, with way more white people earning more than 50k than black people.
Although we won’t use age as a sensitive label, it is noteworthy that the percentage of people earning more than 50k declines with growing age.

Step #3 Splitting and Scaling the Data

The code performs several preprocessing steps. First, the data is split into the features (X_df) and the target variable (y_df). Next, the categorical features in X_df are encoded using the LabelEncoder function from scikit-learn. The data is then split into training and test sets. Records are stratified based on the target variable to ensure that the proportion of each class is maintained in both sets.

It is important to note that the sensitive variables, sex and race, are included in the splitting process to ensure that the data is divided in a fair manner. Specifically, we create a new data frame A containing only the sensitive variables. We then split the data into training and test sets using the stratify parameter based on Y_encoded (the encoded version of y_df) and A.

Finally, the features in X_df are scaled using the StandardScaler function from scikit-learn to ensure that all features are on the same scale. The resulting scaled features are then converted back into a pandas data frame, X_scaled.

# create a dataframe with sensitive features
A = X_df[sensitive_variables]
X = X_df.drop(labels=sensitive_variables, axis=1)
X = pd.get_dummies(X)

# Scale the data
sc = StandardScaler()
X_scaled = sc.fit_transform(X)
X_scaled = pd.DataFrame(X_scaled, columns=X.columns)

# Encode the target variable
le = LabelEncoder()
Y_encoded = le.fit_transform(y_df)

# We split the data into training and test sets:
X_train, X_test, Y_train, Y_test, A_train, A_test = train_test_split(
    X_scaled, Y_encoded, A, test_size=0.4, random_state=0, stratify=Y_encoded
)

# Work around indexing bug
X_train = X_train.reset_index(drop=True)
A_train = A_train.reset_index(drop=True)
X_test = X_test.reset_index(drop=True)
A_test = A_test.reset_index(drop=True)

Step #4 Train a Fairness-Unaware Model and Assess its Fairness

Now it’s time to train our first model. This initial model will be entirely fairness unaware, meaning its predictions will reflect any patterns of unfairness present in the data.

Model training

We’ll begin by training a baseline fairness unaware decision tree classifier with default settings. The model will be trained on the training set (X_train and Y_train) and then tested on the test set (X_test and Y_test).

# Define evaluation metrics for performance and fairness
performance_metric = skm.accuracy_score
fairness_metric = selection_rate_difference

unmitigated_predictor = DecisionTreeClassifier(random_state=0)
unmitigated_predictor.fit(X_train, Y_train)

Evaluation and Visualization

To evaluate the model’s performance, we’ll use two crucial metrics: the accuracy score and the “selection rate difference”. The latter is a vital fairness metric that measures the difference between selection rates of two distinct groups for a specific outcome or prediction.

In machine learning, this metric often helps determine the fairness of a model, especially with sensitive features like race, gender, or age. A selection rate difference of zero indicates that the model is making predictions fairly and without bias, as both groups have equal selection rates.

However, we must pay attention to a positive or negative selection rate difference. It implies that the model is favoring one group over the other, which could signal bias or discrimination. Carefully monitoring this metric is critical to ensure the model is performing fairly and justly.

Lastly, we’ll visualize the results in a bar chart using the accuracy and selection rate by group. This visualization will provide a comprehensive view of the model’s performance and fairness, allowing us to make informed decisions about potential improvements or adjustments needed to strike the right balance between accuracy and fairness.

metric_frame = MetricFrame(
    metrics={
        "accuracy": performance_metric,
        "selection_rate": fairness_metric,
        "count": count,
    },
    sensitive_features=A_test,
    y_true=Y_test,
    y_pred=unmitigated_predictor.predict(X_test),
)

# plot the metrics
print(metric_frame.overall)
print(metric_frame.by_group)
metric_frame.by_group.plot.bar(
    subplots=True,
    layout=[1, 3],
    legend=False,
    figsize=[12, 4],
    title="Accuracy and selection rate by group",
)

accuracy              0.854136
selection_rate        0.197373
count             18579.000000
dtype: float64
                accuracy  selection_rate    count
sex     race                                     
 Female  Black  0.948665        0.053388    974.0
         White  0.921083        0.083873   5246.0
 Male    Black  0.877378        0.161734    946.0
         White  0.813371        0.264786  11413.0
array([[,
        ,
        ]],
      dtype=object)

When we look at the selection rate, we can see that the model is intrinsically biased. To be explicit, if this model would go into production, a Black Woman would have a significantly lower chance (5%) to be attributed a +50k income than a White Man (above 25%), even if all other attributes (age, education, and so on) are the same.

Step #5 Train Fairness-Aware Models with Gridsearch

We perform a grid search with demographic parity as a fairness constraint to find the best decision tree classifier model that is both accurate and fair. We evaluate different models with different hyperparameters for the decision tree classifier. The best predictors with the lowest error rates and lowest disparities (in terms of demographic parity) are selected as the dominant models.

We call these models “dominant” because they stand out as superior solutions in multi-objective optimization. “Dominant” implies that these models outshine others across multiple evaluation metrics without falling short in any of the metrics. Dominant models are preferred for their optimal balance of performance and fairness.

During model training, we’ll incorporate demographic parity as a fairness constraint. Demographic parity seeks to ensure equal proportions of positive outcomes (e.g., being hired, approved for a loan, etc.) across various demographic groups (e.g., gender, race, or age) in a population. In simpler terms, it requires the decision-making process not to discriminate against any group based on their demographic characteristics. Mathematically, it’s expressed as P(prediction=1|sensitive_feature=0) = P(prediction=1|sensitive_feature=1), where sensitive_feature is a binary variable indicating membership in a specific demographic group. Achieving demographic parity helps reduce discrimination and promote fairness in decision-making processes. However, depending on the context and goals of the decision-making process, demographic parity might not always be the most appropriate fairness constraint.

Depending on the hardware you are using, training can take several minutes.

# Define and run GridSearch with DemographicParity 
sweep = GridSearch(
    DecisionTreeClassifier(random_state=0),
    constraints=DemographicParity(), # DemographicParity() is the default
    grid_size=100, # number of models to evaluate
)
sweep.fit(X_train, Y_train, sensitive_features=A_train)

# The best predictor is the one with the lowest error rate
predictors = sweep.predictors_

errors, disparities = [], []
for p in predictors:
    
    def classifier(X):
        return p.predict(X)

    # Calculate the error rate of the classifier
    error = ErrorRate()
    error.load_data(X_train, pd.Series(Y_train), sensitive_features=A_train)
    errors.append(error.gamma(classifier)[0])

    # Calculate the disparity in the predictions
    disparity = DemographicParity()
    disparity.load_data(X_train, pd.Series(Y_train), sensitive_features=A_train)
    disparities.append(disparity.gamma(classifier).max())

# Consolidate the results into a single dataframe
grid_results = pd.DataFrame(
    {"predictor": predictors, 
     "error": errors, 
     "disparity": disparities}
)

# Loop through the grid and select the relevant models 
non_dominated_models = []
for model in grid_results.itertuples():
    # only select models with lower disparity compared to the other models
    if model.error < grid_results["error"][grid_results["disparity"] < model.disparity].min():
        non_dominated_models.append(model.predictor)

That’s it; now we have trained several fairness-aware models. Next, let’s look at the performance and fairness of these models.

Step #6 Comparing Dominant and Unmitigated Models

Finally, we’ll evaluate and compare our models’ performance. We’ll assess the dominant models and the unmitigated model using evaluation metrics for performance and fairness. The performance metric we’ll use is the accuracy score, and for fairness, we’ll use the selection rate difference. To do this, we’ll predict on the test data using the unmitigated model, then add predictions for all relevant dominant models.

Next, we’ll create a scatterplot for the models based on their accuracy and selection rate. We’ll use the x-axis for performance metrics and the y-axis for fairness metrics. We’ll also include point labels and display the plot. This visualization will help us understand the trade-offs between performance and fairness in our models.

# We can evaluate the dominant models along with the unmitigated model.

# Use the unmitigated model to predict on the test data
predictions = {"unmitigated_model": unmitigated_predictor.predict(X_test)}

# Add predictions for all relevant dominant models
metric_frames = {"unmitigated_model": metric_frame}
for i in range(len(non_dominated_models)):
    key = "dominant_model_{0}".format(i)
    predictions[key] = non_dominated_models[i].predict(X_test)

    metric_frames[key] = MetricFrame(
        metrics={
            "accuracy": performance_metric,
            "selection_rate": selection_rate,
            "count": count,
        },
        sensitive_features=A_test,
        y_true=Y_test,
        y_pred=predictions[key],
    )

# scatterplot for the models along their accuracy and selection rate
plot_model_comparison(
    x_axis_metric=performance_metric,
    y_axis_metric=fairness_metric,
    y_true=Y_test,
    y_preds=predictions,
    sensitive_features=A_test,
    point_labels=True,
    show_plot=True,
)

Upon examination, it becomes apparent that we are presented with multiple dominant models that have a marginal difference in their selection rates as well as accuracy levels. These models, while exhibiting slightly lower accuracy, are less influenced by race or gender than the fairness-unaware model. As such, these models may be considered viable candidates for deployment in the model selection process.

It is worth noting, however, that while these models appear to be less prone to bias based on sensitive features, they may still exhibit unfairness towards other features that have not been declared sensitive.

Summary

The significance of responsible AI and fairness in machine learning cannot be overstated. With AI systems becoming increasingly integrated into our lives, it’s essential that we proactively address issues of bias and discrimination.

In this article, we’ve taken a small but vital step towards a more equitable world by developing machine learning models that are fair with respect to sensitive features. Our efforts have been focused on ensuring that our models are less biased and more equitable, using techniques such as reweighting and threshold optimization.

The negative consequences of biased models cannot be ignored. They can lead to discrimination and reputational damage, making it even more important to ensure that our AI systems are fair and just. As AI technology continues to advance, there is an increased need for responsible AI and fairness in AI systems, particularly with the development of more powerful generative models that have the potential to revolutionize various industries.

By taking steps to address issues of bias and discrimination in AI systems, we can ensure that these systems promote positive outcomes and benefit everyone equally. We must strive towards a more equitable and just world through responsible AI and fairness in machine learning.

When used wisely, AI and machine learning can help create a fairer and more equitable world. Image created with Midjourney.

Sources and Further Reading

Books on Responsible AI and Applied Machine Learning

The links above to Amazon are affiliate links. By buying through these links, you support the Relataly.com blog and help to cover the hosting costs. Using the links does not affect the price.

The post Using Fairlearn to Build Fair Machine Machine Learning Models with Python: Step-by-Step Towards More Responsible AI appeared first on relataly.com.

Predictive Policing: Preventing Crime in San Francisco using XGBoost and Python

Florian Follonier — Sun, 07 Mar 2021 16:16:19 +0000

In this tutorial, we’ll be using machine learning to predict and map out crime in San Francisco. We’ll be working with a dataset from Kaggle that contains information on 39 different types of crimes, including everything from vehicle theft to drug offenses. Using Python and the powerful Scikit-Learn library, we’ll train a classification model using the XGboost algorithm to predict 39 types of crimes based on when and where it occurred. We’ll then use the Plotly library to visualize the results on a map of the city, highlighting areas with higher rates of certain crimes. This type of prediction and mapping is similar to what the San Francisco Police Department uses in their practice of predictive policing, where they allocate resources to at-risk areas in an effort to prevent crime.

As we embark on this thrilling journey, we’ll start by downloading and preprocessing the San Francisco crime data. Next, we’ll channel the data to train two distinct classification models. The first model will utilize a standard Random Forest Classifier, while the second will leverage the exceptional XGBoost package. We’ll experiment with various models that boast different hyperparameters. Ultimately, we’ll visualize our predictions on a striking SF crime map and assess the performance of our diverse models. So, buckle up and let’s dive into the exhilarating world of crime prediction and mapping!

Predictive policing can make police work much more efficient and effective. Image generated using Midjourney.

What is Predictive Policing?

The use case we are looking at in this article falls into predictive policing. Predictive policing uses data, algorithms, and other technological tools to predict where and when crimes are likely to occur. The goal of predictive policing is to help law enforcement agencies better allocate their resources and focus their efforts on areas where crime is likely to happen, with the ultimate goal of reducing crime and improving public safety. This approach to policing is based on the idea that by using data and other tools to identify patterns and trends, law enforcement agencies can better anticipate where crimes are likely to occur and take steps to prevent them from happening.

The benefits of predictive policing include the ability to allocate law enforcement resources better, the potential to reduce crime and improve public safety, and the ability to identify trends and patterns that may not be immediately obvious to law enforcement officers. Additionally, by using data and other tools to anticipate where crimes are likely to occur, law enforcement agencies can take proactive steps to prevent those crimes from happening, which can save time and money.

Creating a Crime Map for Predictive Policing using XGBoost in Python

In this practical tutorial, we’ll construct an XGBoost multi-label classifier to predict crime types in San Francisco. Urban crime, such as in San Francisco, is a dynamic and multifaceted issue that can dramatically vary based on location, time, and other factors. Our aim is to develop a predictive algorithm capable of forecasting specific crime types based on a given location and time parameters. The end product is an interactive San Francisco crime map providing a snapshot of crime hotspots throughout the city.

Law enforcement agencies, like the San Francisco Police Department, use similar maps for strategic resource allocation to curb crime rates effectively. Additionally, this SF crime map will underscore crime clusters – areas notorious for particular types of crime incidents. By the end of this tutorial, you’ll have a deeper understanding of using machine learning in practical scenarios and aiding real-world decision-making.

The code is available on the GitHub repository.

View on GitHub Relataly Github Repo

Crime doesn’t sleep in San Francisco. That’s why predictive policing can make a real impact. Image generated with Midjourney

Prerequisites

Before starting the Python coding part, ensure that you have set up your Python 3 environment and required packages. If you don’t have an environment, follow this tutorial to set up the Anaconda environment.

Also, make sure you install all required packages. In this tutorial, we will be working with the following standard packages:

pandas
NumPy
matplotlib
Seaborn

In addition, we will be using XGBoost (‘xgboost’) and the machine learning library scikit-learn.

You can install packages using console commands:

pip install
conda install (if you are using the anaconda packet manager)

Step #1 Load the Data

We begin by downloading the San Francisco crime challenge data on kaggle.com. Once you have downloaded the dataset, place the CSV files (train.csv) into your Python working folder.

The dataset was collected by the SFO police department between 2003 and 2015. According to the data description from the SF crime challenge, the dataset contains the following variables:

Dates: timestamp of the crime incident
Category: Category of the crime incident (only in train.csv) that we will use as the target variable
Descript: detailed description of the crime incident (only in train.csv)
DayOfWeek: the day of the week
PdDistrict: the name of the Police Department District
Resolution: how the crime incident was resolved (only in train.csv)
Address: the approximate street address of the crime incident
X: Longitude
Y: Latitude

The next step is to load the data into a dataframe. Then we use the head() command to print the first five lines and ensure you can see the data.

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import matplotlib.pyplot as plt 
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix
from xgboost import XGBClassifier
import plotly.express as px

# The Data is part of the Kaggle Competition: https://www.kaggle.com/c/sf-crime/data
df_base = pd.read_csv("data/crime/sf-crime/train.csv")

print(df_base.describe())
df_base.head()

		X              Y
count  	878049.000000  878049.000000
mean     -122.422616      37.771020
std         0.030354       0.456893
min      -122.513642      37.707879
25%      -122.432952      37.752427
50%      -122.416420      37.775421
75%      -122.406959      37.784369
max      -120.500000      90.000000

	Dates				Category	Descript			DayOfWeek	PdDistrict	Resolution	Address				X			Y
0	2015-05-13 23:53:00	WARRANTS	WARRANT ARREST		Wednesday	NORTHERN	ARREST, 	OAK ST / ...		-122.425892	37.774599
1	2015-05-13 23:53:00	OTHER ...	TRAFFIC ...			Wednesday	NORTHERN	ARREST, 	OAK ST / ...		-122.425892	37.774599
2	2015-05-13 23:33:00	OTHER ...	TRAFFIC ...			Wednesday	NORTHERN	ARREST, 	VANNESS AV... ST	-122.424363	37.800414
3	2015-05-13 23:30:00	LARCENY/THEFT	GRAND THEFT...	Wednesday	NORTHERN	NONE		1500 Block... ST	-122.426995	37.800873
4	2015-05-13 23:30:00	LARCENY/THEFT	GRAND THEFT ...	Wednesday	PARK		NONE		100 Block... ST		-122.438738	37.771541

If the data was loaded correctly, you should see the first five records of the dataframe, as shown above.

Step #2 Explore the Data

At the beginning of a new project, we usually don’t understand the data well and need to acquire that understanding. Therefore, next, we will explore the data and familiarize ourselves with its characteristics.

The following examples will help us better understand our data’s characteristics. For example, you can use whisker charts and a correlation matrix to understand better the correlation between variables, such as between weekdays and prediction categories. Feel free to create more charts.

2.1 Prediction Labels

Running the code below shows a bar plot of the prediction labels. The plot shows the frequency in which the class labels occur in the data.

# print the value counts of the categories
plt.figure(figsize=(15,5))
ax = sns.countplot(x = df_base['Category'], orient='v', order = df_base['Category'].value_counts().index)
ax.set_xticklabels(ax.get_xticklabels(),rotation = 90)

As shown above, our class labels are highly imbalanced, affecting model accuracy. When we evaluate the performance of our model, we need to consider this.

2.2 When a Crime Occured – Considering Dates and Time

We assume that when a crime occurs impacts the type of crime. For this reason, we look at how crimes distribute across different days of the week and times of the day. First, we look at crime numbers per weekday.

# Print Crime Counts per Weekday
plt.figure(figsize=(6,3))
ax = sns.countplot(y = df_base['DayOfWeek'], orient='h', order = df_base['DayOfWeek'].value_counts().index)
ax.set_xticklabels(ax.get_xticklabels(),rotation = 90)

Fewer crimes happen on Sundays, and most are on Fridays. So it seems that even criminals like to have a weekend. For the sake of clarity, we thereby limit the categories. Let’s take a look at the time when certain crimes are reported.

# Convert the time to minutes
df_base['Hour_Min'] = pd.to_datetime(df_base['Dates']).dt.hour  + pd.to_datetime(df_base['Dates']).dt.minute / 60

# Print Crime Counts per Time and Category
df_base_filtered = df_base[df_base['Category'].isin([
    'PROSTITUTION', 
    'VEHICLE THEFT', 
    'DRUG/NARCOTIC', 
    'WARRENTS', 
    'BURGLERY', 
    'FRAUD', 
    'ASSAULT',
    'LARCENY/THEFT',
    'VANDALISM'])]

plt.figure(figsize=(16,10))
ax = sns.displot(x = 'Hour_Min', hue="Category", data = df_base_filtered, kind="kde", height=8, aspect=1.5)

In addition, the time when a crime happens affects the likelihood of certain types. For example, we can see that FRAUD rarely occurs at night and usually during the day. We can see that criminals often go to work in the afternoon and at midnight. On the other hand, certain crimes, such as VEHICLE THEFT, mainly occur at night and late afternoon but less often in the morning.

If you want to gain an overview of additional features, you can use the pair plot function. Because our dataset is large, we reduce the computation time by plotting 1/100 of the data.

sns.pairplot(data = df_base_filtered[0::100], height=4, aspect=1.5, hue='Category')

2.3 Where a Crime Occured – Considering Address

Next, we look at the address information, from which we can often extract additional information. We do this by printing some sample address values.

# Extracting information from the streetnames
for i in df_base['Address'][0:10]:
    print(i)

OAK ST / LAGUNA ST
OAK ST / LAGUNA ST
VANNESS AV / GREENWICH ST
1500 Block of LOMBARD ST
100 Block of BRODERICK ST
0 Block of TEDDY AV
AVALON AV / PERU AV
KIRKWOOD AV / DONAHUE ST
600 Block of 47TH AV
JEFFERSON ST / LEAVENWORTH ST

The street names alone are not so helpful. However, the address data does provide additional information. For example, it tells us whether the location is a street intersection or not. In addition, it contains the type of street. This information is valuable because now we can extract parts of the text and use them as separate features.

We could do a lot more, but we’ve got a good enough idea of the data.

Step #3 Data Preprocessing

Probably the most exciting and important aspect of model development is feature engineering. Compared to model parameterization, the right features can often achieve more significant leaps in performance.

3.1 Remarks on Data Preprocessing for XGBoost

When preprocessing the data, it is helpful to know which algorithms to use because some algorithms are picky about the shape of the data. We will prepare the data to train a gradient-boosting model (XGBoost). This algorithm uses a random forest ensemble, which can only handle integer and Boolean values, but no categorical data. Therefore we need to encode our values. We also need to map the categorical labels to integer values.

We don’t need to scale the continuous feature variables because gradient boosting and decision trees, generally, are not sensitive to variables that have different scales.

3.2 Feature Engineering

Based on the data exploration that we have done in the previous section, we create three feature types:

Date & Time: When a crime happens is essential. For example, when there is a lot of traffic on the street, there is a higher likelihood of traffic-related crimes. For example, when it is Saturday, more people will usually come to the nightlife district, which attracts certain crimes, e.g., drug-related. Therefore, we will create different features for the time, the day, the month, and the year.
Address: As mentioned, we will extract additional features from the address column. First, we create different features for the street type (for example, ST, AV, WY, TR, DR). In addition, we check whether the address contains the word “Block.” In addition, we will let our model know whether the address is a street crossing.
Latitude & Longitude: We will transform the latitude and longitude values into polar coordinates. We will also remove some outliers from the dataset whose latitude is far off the grid. Above all, this will make it easier for our model to make sense of the location.

Considering these features, the primary input to our crime-type prediction model is the information on when and where a crime occurs.

# Processing Function for Features
def cart2polar(x, y):
    dist = np.sqrt(x**2 + y**2)
    phi = np.arctan2(y, x)
    return dist, phi

def preprocessFeatures(dfx):
    
    # Time Feature Engineering
    df = pd.get_dummies(dfx[['DayOfWeek' , 'PdDistrict']])
    df['Hour_Min'] = pd.to_datetime(dfx['Dates']).dt.hour + pd.to_datetime(dfx['Dates']).dt.minute / 60
    # We add a feature that contains the expontential time
    df['Hour_Min_Exp'] = np.exp(df['Hour_Min'])
    
    df['Day'] = pd.to_datetime(dfx['Dates']).dt.day
    df['Month'] = pd.to_datetime(dfx['Dates']).dt.month
    df['Year'] = pd.to_datetime(dfx['Dates']).dt.year

    month_one_hot_encoded = pd.get_dummies(pd.to_datetime(dfx['Dates']).dt.month, prefix='Month')
    df = pd.concat([df, month_one_hot_encoded], axis=1, join="inner")
    
    # Convert Carthesian Coordinates to Polar Coordinates
    df[['X', 'Y']] = dfx[['X', 'Y']] # we maintain the original coordindates as additional features
    df['dist'], df['phi'] = cart2polar(dfx['X'], dfx['Y'])
  
    # Extracting Street Types
    df['Is_ST'] = dfx['Address'].str.contains(" ST", case=True)
    df['Is_AV'] = dfx['Address'].str.contains(" AV", case=True)
    df['Is_WY'] = dfx['Address'].str.contains(" WY", case=True)
    df['Is_TR'] = dfx['Address'].str.contains(" TR", case=True)
    df['Is_DR'] = dfx['Address'].str.contains(" DR", case=True)
    df['Is_Block'] = dfx['Address'].str.contains(" Block", case=True)
    df['Is_crossing'] = dfx['Address'].str.contains(" / ", case=True)
    
    return df

# Processing Function for Labels
def encodeLabels(dfx):
    df = pd.DataFrame (columns = [])
    factor = pd.factorize(dfx['Category'])
    return factor

# Remove Outliers by Longitude
df_cleaned = df_base[df_base['Y']<70]

# Encode Labels as Integer
factor = encodeLabels(df_cleaned)
y_df = factor[0]
labels = list(factor[1])
# for val, i in enumerate(labels):
#     print(val, i)

We could also try to further improve our features by using additional data sources, such as weather data. However, there is no guarantee that this will improve the model results, and it did not in the case of criminal records. Therefore, we have omitted this part.

Step #4 Visualize Crime Types on a Map of San Francisco

Next, we create a San Francisco crime map using the cartesian coordinates indicating where a crime has occurred. First, we only plot the data without a geographical map. Later we will use these spatial data to create a dot plot and overlay it with a map of San Francisco. Visualizing the crime types on a map helps us understand how crime types distribute across the city.

4.1 Plot Crime Types using a Scatter Plot

Next, we want to gain an overview of possible spatial patterns and hotspots. We expect to see streets and neighborhoods where certain crimes are more common than in the more expensive areas of the city. In addition, we expect to see places in the city where certain crime types occur relatively rarely. To gain an overview of the crime distribution in San Francisco, we use a scatter plot to display the crime coordinates on a blank chart.

Running the code below creates the crime map of San Francisco with all crime types. Depending on the speed of your machine, the creation of the map may take several minutes.

# Plot Criminal Activities by Lat and Long
df_filtered = df_cleaned.sample(frac=0.05)  
#df_filtered = df_cleaned[df_cleaned['Category'].isin(['PROSTITUTION', 'VEHICLE THEFT', 'FRAUD'])].sample(frac=0.05) # to filter 

groups = df_filtered.groupby('Category')

fig, ax = plt.subplots(sharex=False, figsize=(20, 12))
ax.margins(0.05) # Optional, just adds 5% padding to the autoscaling
for name, group in groups:
    ax.plot(group['X'], group['Y'], marker='.', linestyle='', label=name, alpha=0.9)
ax.legend()
plt.show()

The plot shows that certain streets in San Francisco are more prone to specific crime types than others. It is also clear that there are certain crime hotspots in the city, especially in the center. We can also see that few crimes are reported in public park areas.

4.2 Create a Crime Map of San Francisco using Plotly

Next, we will create a San Francisco crime map using the Plotly Python library. Because the plugin can handle a limited amount of data simultaneously, we will reduce our data to a fraction of 1% and a few selected crime types.

Running the code below opens a _map.html file in your browser that displays the SF crime map. The result is a zoomable geographic map of San Francisco that shows how the selected crime types distribute across the city.

# 4.2 Create a Crime Map of San Francisco using Plotly
# Limit the data to a fraction and selected categories
df_filtered = df_cleaned.sample(frac=0.01) 
fig = px.scatter_mapbox(df_filtered, lat="Y", lon="X", hover_name="Category", color='Category', hover_data=["Y", "X"], zoom=12, height=800)
fig.update_layout(mapbox_style="open-street-map")
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show()

The SF crime map shows different types of crimes, including prostitution, vehicle theft, and fraud. The interactive map allows you to change zoom levels and filter the type of crime displayed on the map. For example, if you filter DRUG/NARCOTIC-related crimes, you can see that these crimes mainly occur in the city center near the financial district and the nightlife area.

Step #5 Split the Data

Before training our predictive model, we will split our data into separate datasets for training and testing. For this purpose, we use the train_test_split function of scikit-learn and configure a split ratio of 70%. Then we output the data, which we employ in the next step to train and validate a model.

# Create train_df & test_df
x_df = preprocessFeatures(df_cleaned).copy()

# Split the data into x_train and y_train data sets
x_train, x_test, y_train, y_test = train_test_split(x_df, y_df, train_size=0.7, random_state=0)
x_train

		DayOfWeek_Friday	DayOfWeek_Monday	DayOfWeek_Saturday	DayOfWeek_Sunday	DayOfWeek_Thursday	DayOfWeek_Tuesday	DayOfWeek_Wednesday	PdDistrict_BAYVIEW	PdDistrict_CENTRAL	PdDistrict_INGLESIDE	...	Y			dist		phi			Is_ST	Is_AV	Is_WY	Is_TR	Is_DR	Is_Block	Is_crossing
276998	0					0					0					0					0					1					0					0					0					0						...	37.785023	128.110900	2.842200	True	False	False	False	False	True		False
81579	0					0					0					0					0					1					0					0					0					0						...	37.748470	128.185052	2.842677	False	True	False	False	False	True		False
206676	0					0					0					1					0					0					0					0					0					0						...	37.762744	128.113657	2.842389	True	False	False	False	False	True		False
732006	0					0					0					0					0					0					1					0					0					0						...	37.784140	128.109653	2.842204	True	False	False	False	False	False		True
796194	1					0					0					0					0					0					0					0					0					0						...	37.791333	128.125982	2.842185	True	False	False	False	False	True		False
5 rows × 45 columns

Step #6 Train a Random Forest Classifier

We can train the predictive models now that we have prepared the data. We train a basic model based on the Random Forest algorithm in the first step. The Random Forest is a robust algorithm that can handle regression and classification problems. One of our recent articles provides more information on Random Forests and how you can find the optimal configuration of their hyperparameters. In this tutorial, we use the Random Forest to establish a baseline against which we can measure the performance of our XGboost model. We, therefore, use the Random Forest with a simple parameter configuration without tuning the hyperparameters.

# Train a single random forest classifier - parameters are a best guess
clf = RandomForestClassifier(max_depth=100, random_state=0, n_estimators = 200)
clf.fit(x_train, y_train.ravel())
y_pred = clf.predict(x_test)

results_log = classification_report(y_test, y_pred)
print(results_log)

Output exceeds the size limit. Open the full output data in a text editor
              precision    recall  f1-score   support

           0       0.15      0.10      0.12     12657
           1       0.29      0.35      0.32     37898
           2       0.38      0.63      0.47     52237
           3       0.46      0.40      0.43     16136
           4       0.16      0.08      0.10     13426
           5       0.25      0.21      0.23     27798
           6       0.10      0.04      0.06      6850
           7       0.23      0.22      0.23     23087
           8       0.19      0.12      0.15      2586
           9       0.20      0.13      0.15     10942
          10       0.08      0.03      0.05      9559
          11       0.00      0.00      0.00      1300
          12       0.20      0.10      0.14      3200
          13       0.37      0.43      0.40     16282
          14       0.02      0.02      0.02      1350
          15       0.01      0.00      0.00      2912
          16       0.05      0.03      0.04      2217
          17       0.61      0.52      0.56      7865
          18       0.11      0.06      0.08      4954
          19       0.04      0.03      0.03       723
          20       0.28      0.19      0.23       581
          21       0.05      0.02      0.03       708
          22       0.25      0.13      0.17      1333
...
    accuracy                           0.31    263395
   macro avg       0.15      0.12      0.13    263395
weighted avg       0.28      0.31      0.28    263395

The baseline model is a random forest classifier with 31% percent accuracy on the test dataset.

Step #7 Train an XGBoost Classifier

Now that we have a baseline model, we can train our gradient boosting classifier using the XGBoost package. We expect this model to perform better than the baseline.

7.1 About Gradient Boosting

XGBoost is an implementation of a gradient-boosting algorithm that uses a decision-tree-based ensemble machine learning algorithm. The algorithm searches for an optimal ensemble of trees. In this process, the algorithm iteratively adds trees to the model or removes them to reduce the prediction error of the previous tree constellation. The algorithm repeats these steps until it can make no further improvements. Thus, training does not optimize the model against the predictions but the previous model’s residuals (prediction errors).

But XGBoost does more! It is an extreme version of gradient boosting that uses additional optimization techniques to achieve the best result with minimal effort. In contrast to the random decision forest, the XGBoost classification algorithm determines an optimal number of trees in the training process. We do not have to specify this number in advance.

A disadvantage of XGBoost is that it tends to overfit the data. Therefore, testing against unseen data is essential. This tutorial will test only against a single test sample for simplicity, but using cross-validation would be a better choice.

7.2 Train the XGBoost Classifier

Various Gradient Boosting Algorithms are available for Python, including one from scikit-learn. However, scikit-learn does not support multi-threading, which makes the training process slower than necessary. For this reason, we will use the gradient boosting classifier from the XGBoost package.

# Configure the XGBoost model
param = {'booster': 'gbtree', 
         'tree_method': 'gpu_hist',
         'predictor': 'gpu_predictor',
         'max_depth': 140, 
         'eta': 0.3, 
         'objective': '{multi:softmax}', 
         'eval_metric': 'mlogloss', 
         'num_round': 30,
         'feature_selector ': 'cyclic'
        }

xgb_clf = XGBClassifier(param)
xgb_clf.fit(x_train, y_train.ravel())
score = xgb_clf.score(x_test, y_test.ravel())
print(score)

# Create predictions on the test dataset
y_pred = xgb_clf.predict(x_test)

# Print a classification report
results_log = classification_report(y_test, y_pred)
print(results_log)

Output exceeds the size limit. Open the full output data in a text editor
0.30852142219859907
              precision    recall  f1-score   support

           0       0.17      0.01      0.02     12657
           1       0.30      0.42      0.35     37898
           2       0.33      0.72      0.46     52237
           3       0.31      0.27      0.29     16136
           4       0.21      0.03      0.05     13426
           5       0.24      0.18      0.21     27798
           6       0.17      0.01      0.01      6850
           7       0.21      0.19      0.20     23087
           8       0.26      0.01      0.02      2586
           9       0.22      0.08      0.12     10942
          10       0.13      0.00      0.00      9559
          11       0.07      0.00      0.01      1300
          12       0.20      0.08      0.11      3200
          13       0.34      0.43      0.38     16282
          14       0.00      0.00      0.00      1350
          15       0.12      0.00      0.01      2912
          16       0.15      0.02      0.03      2217
          17       0.57      0.34      0.43      7865
          18       0.19      0.03      0.05      4954
          19       0.00      0.00      0.00       723
          20       0.50      0.24      0.32       581
          21       0.10      0.01      0.01       708
...
    accuracy                           0.31    263395
   macro avg       0.18      0.11      0.11    263395
weighted avg       0.27      0.31      0.25    263395

Now that we have trained our classification model, let’s see how it performs. For this purpose, we will generate predictions (y_pred) on the test dataset (x_test). Afterward, we use the predictions and the valid values (y_test) to create a classification report.

Our model achieves an accuracy score of 31%. At first hand, this might not look so good, but considering that we have 39 categories and only sparse information available, this performance is quite impressive.

Step #8 Measure Model Performance

So how well does our XGboost model perform? To measure the performance of our model, we create a confusion matrix that visualizes the performance of the XGboost classifier. If you want to learn more about measuring the performance of classification models, check out this tutorial on measuring classification performance.

Running the code below creates the confusion matrix that shows the number of correct and false predictions for each crime category.

# Print a multi-Class Confusion Matrix
cnf_matrix = confusion_matrix(y_test.reshape(-1), y_pred)
df_cm = pd.DataFrame(cnf_matrix, columns=np.unique(y_test), index = np.unique(y_test))
df_cm.index.name = 'Actual'
df_cm.columns.name = 'Predicted'
plt.figure(figsize = (16,12))
plt.tight_layout()
sns.set(font_scale=1.4) #for label size
sns.heatmap(df_cm, cbar=True, cmap= "inferno", annot=False, fmt='.0f' #, annot_kws={"size": 13}
           )

The confusion matrix shows that our model frequently predicts crime category two and neglects the other crime types. The reason is the uneven distribution of crime types in the training data. As a result, when we evaluate the model, we need to pay attention to the importance of the different crime types. For example, we might train the model to predict certain crime types accurately, although this might come at a lower accuracy when predicting other crime types. However, such optimizations depend on the technical context and the goals one wants to achieve with the prediction model.

Summary

This tutorial has presented the machine learning use case “Predictive Policing” and showed how to implement it in Python. We have trained an XGBoost model that predicts crime types in San Francisco based on the information on when and where specific crimes have occurred. We also illustrated our data on an interactive crime map of San Francisco with the Plotly Python library. The Crime Map is an intuitive way of visualizing crime in a city and highlighting particular hotspots. Finally, we have used the prediction model to make test predictions and evaluate the model performance against other algorithms, such as a classic Random Decision Forest. The XGBoost model achieves a prediction accuracy of about 31%—a respectable performance, considering that the prediction problem involves 39 crime classes.

We hope this tutorial was helpful. If you have any questions or suggestions on what we could improve, feel free to post them in the comments. We appreciate your feedback.

Predictive policing with machine learning – Crime map of San Francisco, created with Python and Plotly

Sources and Further Reading

Looking for more esciting map vizualizations? Consider the relataly tutorial on visualizing COVID-19 data on geographic heatmaps using GeoPandas.

The links above to Amazon are affiliate links. By buying through these links, you support the Relataly.com blog and help to cover the hosting costs. Using the links does not affect the price.

The post Predictive Policing: Preventing Crime in San Francisco using XGBoost and Python appeared first on relataly.com.

Flight Delay Prediction using Azure Machine Learning

Florian Follonier — Tue, 15 Oct 2019 18:10:23 +0000

If you travel a lot, you’ve probably already experienced this – you’re in a hurry on your way to the airport trying to catch a flight, only to find out that your flight is delayed. Wasn’t it great to know when a flight will be delayed in advance? We can use past flight delay data to develop a classification model that predicts whether a flight will be on time or delayed in the future. This tutorial shows how this works by creating a flight-delay prediction model in the Azure Machine Learning Studio workbench.

The model uses a decision tree algorithm to predict whether a flight will be more or less than 30 minutes late. In this approach, the algorithm searches for patterns in the data of past flight connections and applies these patterns to classify flights into two classes: delayed and not delayed. Travel service providers use similar models to warn customers when a flight is likely to be delayed.

What is Flight Delay Prediction?

Flight delay prediction is the use of data and analytics to forecast whether a flight is likely to be delayed. This information can be used by airlines, airports, and passengers to plan and prepare for potential delays and to make informed decisions about travel plans. Models for predicting flight punctuality typically consider a range of factors, including historical data on past flight delays, weather conditions, the performance of the airline and the specific aircraft involved, and other relevant information. Machine learning algorithms are often used to analyze and process this data, and to generate predictions about the likelihood and duration of flight delays. These predictions can be made in real-time, as the flight is in progress, or in advance, based on the scheduled departure time and other factors.

Developing a Flight Prediction Model in Azure Machine Learning

In the following hands-on tutorial, we will develop a flight prediction model in Azure Machine Learning. To develop the model, we will carry out the following steps:

Gather and prepare the data: We will load data on past flights, including information on their departure and arrival times, the airlines and aircraft involved, the airports and routes, and other relevant factors. This data needs to be cleaned and preprocessed to remove any errors or inconsistencies and to format it in a way that can be used by the machine learning model.
Explore and analyze the data: This involves using tools and techniques, such as data visualization and statistical analysis, to understand the patterns and trends in the data and to identify any factors that may be associated with flight delays.
Train a machine learning algorithm: We train a decision forest algorithm on the data to learn the patterns and relationships that exist in the data.
Evaluate the model: We test our trained model on a separate set of data to see how accurately it can predict flight delays. This can be done using a range of metrics, such as accuracy, precision, and recall, to measure the performance of the model and to identify any areas for improvement.
Deploy and use the model: We don’t cover this step explicitly. However, we broadly discuss how we could deploy the model into production.

Prerequisites: Signing Up for a Free Azure Account

In order to use the azure machine learning studio, you require access to a valid Azure subscription. Don’t worry, you can sign up for a free Azure account that offers beginners a sufficient number of credits. Simply follow these steps:

Go to the Azure website (https://azure.microsoft.com/).
Click on the “Try azure for free” button in the middle of the page. Then select “start free” on the next page.
Enter your email address and password, and create a new Microsoft account, if you don’t already have one.
Follow the on-screen instructions to complete the sign-up process, including verifying your email address and phone number.
Once your account has been created, you can log in to the Azure portal (https://portal.azure.com/) using your Microsoft account credentials.
From the Azure portal, you can access a range of services and resources, including machine learning, data analytics, and other cloud-based tools and services.

Note that the free Azure account includes a limited amount of free credit and services, which you can use to try out Azure and learn more about its capabilities. Once your free credit has been used up, you can choose to upgrade to a paid subscription to continue using Azure.

Step #1 Access Microsoft Azure Machine Learning Studio

This article uses Microsoft’s data science workbench Azure Machine Learning Studio (classic). There is also a new version of the Machine Learning Studio available, which provides more functionality but works in a similar way. The classic workbench provides comprehensive functions such as creating data pipelines, training and testing machine learning models, and publishing trained models as a web service via an API.

The studio is available via free trial access. You can create a free test account (8h valid) via “Sign up here” on Azure Machine Learning Studio or log in with an existing Microsoft Live account. After the successful login, you will see the experiments section:

Experiments section of the Azure Machine Learning Studio

Step #2 Importing Training Data into Azure ML

This tutorial will work with the CSV dataset FlightDelayData.csv

After downloading the dataset, you can import it into Azure Machine Learning Studio. Navigate to “Experiments” and click on “+ New” at the bottom left. On the following page, select “Dataset” on the left and then “Upload Local File.” Select the file FlightDelayData and confirm the upload.

Uploading a new dataset

Confirm the dialog to access the experiment workspace. Here, you will find a list of different modules (highlighted in light blue) to the left of the workspace. The modules provide all central functions in Azure Machine Learning Studio, such as transforming and exploring the data and using them in machine learning.

The module tab of Azure Machine Learning

Step #3 Exploring the Data

Now that the data set is available in Azure Machine Learning, we will prepare it for its use in our flight delay prediction model. First, we will drag and drop the FlightDelayData dataset from “Saved Datasets” into the grey workspace of the experiment. Next, we will visualize the data by right-clicking on FlightDelayData –> “dataset” –> “Visualize” in the grey work area.

Clicking on the individual columns will give you an overview of the data sets’ characteristics and distribution. In the upper left corner, you can see that the dataset contains 135970 entries for flight connections. Each entry or line represents one flight. All flights took place in 2013. Furthermore, the data includes the departure and arrival locations of flights, time and day of departure and arrival, the airline, and the deviation from the planned take-off and landing time.

Step #4 Creating a Data Pipeline

Before we can train the model, we need to split the data into two parts: train and test. We will use the first part of the data to train the Machine Learning model and the second to evaluate its predictions. This approach is known as supervised learning. Search for the “Split Data module” in the search list on the left and drag and drop it into the grey workspace to split the data. After this, you can connect the two modules by clicking on the output of the data set (FlightDelayData) and dragging it to the input of the “Split Data module” (see screenshot).

Next, we configure the Split Data module. Click on the module and make the following settings on the right side under “Properties”: Fraction of rows in the first output dataset: 0.7 and Random seed: 123.

In this way, we divide the data randomly in a 70/30 ratio. You can leave the others as they are.

Splitting the data into train and test

(In practice, the compilation and preparation of the data are much more complex. I have already carried out some steps in advance.)

Step #5 Creating a Classification Model

Now we will create a classification model. Therefore, we will pull further models into the grey area of the workbench. Our model will use a boosted decision tree classifier. We can use this algorithm by dragging the module “Two-Class Boosted Decision Tree” into the grey workspace below the other modules. You can leave the settings of the module unchanged.

Next, we select the module “Train Model” and drag it into the grey workspace under the other modules. In the workspace, connect the output of the “Two-Class Boosted Decision Tree” module to the left input of the “Train Model” module.

Remember, we want to predict whether flights will be more or less than 15 minutes late. Select “Train Model” in the grey workspace and click on “Launch Column Selector” under Properties on the right. In the Column Selector, enter “ArrDel15” under “Column Name”. This column contains the so-called “prediction label,” which is the information on whether flights were more or less than 15 minutes late. Don’t forget to connect the left output of the Split Data module to the right input of the Train Model.

To evaluate the model’s predictions later, we will add a “Score Model.” We do this by selecting the module “Score Model” and dragging it into the workspace below the other modules. Finally, we need to create two connections. First, connect the left input of the “Score Model” with the left output of the “Train Model.” Second, connect the input node on the right of the “Score Model” to the output on the right of “Split Data,” which is the 30% of the original data set we use to test the model.

Selecting the prediction label

Step #6 Training the Model

Before we can train the model, we add the module “Evaluate Model” by searching it in the module tab and dragging it into the workspace. Finally, we connect the (left) input of the “Evaluate Model” with the output of the “Score Model.”

Creating a machine learning model

Once we have created the data transformation pipelines, we are ready to train the model. Start the training process by clicking the “Run button” in the dark bar at the bottom. It may take a few minutes until the process has finished. Meanwhile, you can monitor the processing progress by the green checkmarks shown on the modules.

Model after successful training

Step #7 Evaluating Model Performance

So far, we have built a statistical model for flight delay prediction. Of course, we want to know how often our model is right or wrong with the predictions. Evaluating the performance of prediction models is thus an essential step in their development. To evaluate the model performance, right-click on “Evaluate Model” -> “Evaluation results” -> “Visualize.” Below you will find the receiver operating characteristic (ROC) of the trained model:

Metrics used to evaluate the performance of a classification model

Let’s look at the different metrics at the bottom.

The test data set contains 40791 flights, 30% of the original data.
The model correctly predicted for 2098 flights that, they would have more than 15 delays (true positives).
The model was wrong in 1310 cases (false positives).
According to the model’s prediction, six thousand eight hundred twenty-five flights were more than 15 minutes delayed (false negatives).
The model was correct in 30558 cases, estimating that these flights will have less than 15 minutes delays.
Overall, the model is correct in about 80% of the cases (Accuracy = 0.801).

Finally, we look at the ROC curve. The curve illustrates the reliability of the model depending on the prediction threshold. The larger the area under the curve, the better the prediction model. The curve is sloped upwards and lies above the grey line, which means that the model works better than random assumptions. The gray diagonal line corresponds to a 50% chance to lie correctly, i.e., easy to guess. With a perfect correct model for every flight, the area would be 1.0.

Step #8 Deploying the Model as a Web Service

Now that we have our model available, of course, we would like to make it accessible to others. Let’s quickly discuss how we could deploy our model as a web service from the designer.

We won’t go into too much detail here and will only cover the basic steps for deploying a model. To deploy a machine learning pipeline in Azure Machine Learning, you need to convert the training pipeline into an inference pipeline. The pipeline can process prediction requests in real time or in batch mode. Deploying the model from the designer involves removing the training components, such as the Train Model and Split Data modules, and adding web service inputs and outputs to handle user requests. When you create an inference pipeline, azure ml stores the trained model as a Dataset component in the component palette. From there, you can access it under “My Datasets.” The saved trained model is then added to the pipeline, along with the web service input and output components. These components define the points where user data enters the pipeline and where the pipeline returns the results. Once you have deployed the model, other applications and systems can access it via a REST API.

Summary

We have created a flight delay prediction model in Azure Machine Learning Studio in this article. The model can predict with 80% certainty whether flights on specific routes will be more or less than 15 minutes late. This can help airlines, airports, and passengers to plan and prepare for potential delays and to make informed decisions about travel plans. We have also discussed how we can use the studio designer to deploy our model as web services to provide predictions in real-time or batch mode.

The prediction model is only the first version and still offers room for optimization. One option to further improve the model would be to add features such as the weather, the aircraft type, etc. Another option would be to test different algorithms and hyperparameters.

I hope this article was helpful. If you have remarks or questions, please write them in the comments.

The post Flight Delay Prediction using Azure Machine Learning appeared first on relataly.com.