Forecasting Criminal Activity in San Francisco using XGBoost and Python

I recently came across an interesting Kaggle contest that involves predicting different types of criminal activity in San Francisco. As you can imagine, in a massive city like San Francisco, numerous crimes happen every day. Crime statistics distinguish between 39 different types of crimes. The most commonly reported ones include vehicle theft, assault, and drug-related activities. The goal of the Kaggle competition is to predict these types of crimes based on when and where they are reported. This article presents our approach to this prediction challenge based on Python and the powerful XGboost classification algorithm. In addition, we use the mplleaflet Python library to visualize crime types on a geographic map of San Francisco.

The remainder of this article proceeds with the code implementation. We describe how we preprocess the data and how we configure the parameters of the XGboost model. Our model achieves a prediction accuracy of about 31%—a respectable performance, considering that the prediction problem involves a total of 39 different crime classes.

Python Implementation: Predicting Crime Types with XGBoost

In the following, we develop a gradient-boosting multi-label classifier (XGboost) that predicts crime types in San Francisco. As usual, you can find the code in the relataly GitHub Repo.

Prerequisites

Before we start the coding part, make sure that you have set up your Python 3 environment and required packages. If you don’t have an environment set up yet, you can follow this tutorial to set up the Anaconda environment.

Also, make sure you install all required packages. In this tutorial, we will be working with the following standard packages: 

In addition, we will be using Keras (2.0 or higher) with Tensorflow backend, the machine learning library scikit-learn, and seaborn. In addition, we will be using the mplleaflet Python library for visualization.

You can install packages using console commands:

  • pip install <package name>
  • conda install <package name> (if you are using the anaconda packet manager)

Step #1 Load the Data

We begin by downloading the data from the SF crime challenge on kaggle.com. Once you have downloaded the dataset, place the CSV files (train.csv) into your Python working folder.

The dataset was collected by the SFO police department between 2003 and 2015. According to the data description from the SF crime challenge, it contains the following variables:

  • Dates – timestamp of the crime incident
  • Category – category of the crime incident (only in train.csv). This is the target variable you are going to predict.
  • Descript – detailed description of the crime incident (only in train.csv)
  • DayOfWeek – the day of the week
  • PdDistrict – name of the Police Department District
  • Resolution – how the crime incident was resolved (only in train.csv)
  • Address – the approximate street address of the crime incident 
  • X – Longitude
  • Y – Latitude

The next step is to load the data into a dataframe. Then we use the head() command to print the first five lines and make sure that everything has been loaded correctly.

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import matplotlib.pyplot as plt 
from datetime import datetime, timedelta
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix
from xgboost import XGBClassifier
import mplleaflet

# The Data is part of the Kaggle Competition: https://www.kaggle.com/c/sf-crime/data
df_base = pd.read_csv("data/sf-crime/train.csv")

print(df_base.describe())
df_base.head()

If the data was loaded correctly, you should see the first five records of the dataframe, as shown above.

Step #2 Explore the Data

At the beginning of a new project, we usually don’t understand the data well and need to acquire that understanding. Therefore, the next step after loading the data is typical to explore the data and familiarize ourselves with its characteristics.

The following examples will help us to understand the characteristics of our data better. Feel free to create more charts. For example, you can use whisker charts and a correlation matrix to get a more detailed understanding of the correlation between variables such as between weekdays and prediction categories.

2.1 Prediction Labels

We print an overview of the prediction labels to see if the labels are imbalanced.

# print the value counts of the categories
plt.figure(figsize=(15,5))
sns.countplot(x = df_base['Category'], orient='v', order = df_base['Category'].value_counts().index)
plt.xticks(rotation=75)
crime types in San Francisco

As shown above, our class labels are highly imbalanced, which can affect model accuracy. When we evaluate the performance of our model, we need to take this into account.

2.2 Dates and Time

We assume that when a crime occurs impacts the type of crime. For this reason, we look at how crimes distribute across different days of the week and times of the day. First, we look at crime numbers per weekdays.

# Print Value Counts per Weekday
plt.figure(figsize=(6,3))
sns.countplot(y = df_base['DayOfWeek'], orient='h', order = df_base['DayOfWeek'].value_counts().index)
plt.xticks(rotation=75)

The least crimes are reported on Sundays, and the most crimes on Friday. So it seems that even criminals like to have a weekend. 😉

Let’s take a look at the time when certain crimes are reported. For the sake of clarity, we thereby limit the categories.

# Convert the time to minutes
df_base['Hour_Min'] = pd.to_datetime(df_base['Dates']).dt.hour  + pd.to_datetime(df_base['Dates']).dt.minute / 60

# Print Crime Counts per Time and Category
df_base_filtered = df_base[df_base['Category'].isin([
    'PROSTITUTION', 
    'VEHICLE THEFT', 
    'DRUG/NARCOTIC', 
    'WARRENTS', 
    'BURGLERY', 
    'FRAUD', 
    'ASSAULT',
    'LARCENY/THEFT',
    'VANDALISM'])]

sns.displot(x = 'Hour_Min', hue="Category", data = df_base_filtered, kind="kde", height=8, aspect=1.5)
plt.figure(figsize=(16,10))
different crime types in San Francisco and how often they occur during the day

In addition, the time when a crime is committed affects the likelihood of certain types. For example, we can see that FRAUD rarely occurs at night and usually during the day. We can see that criminals most often go to work in the afternoon and at midnight. On the other hand, certain crimes such as VEHICLE THEFT mainly occur at night and in the late afternoon, but less often in the morning.

2.3 Address

Next, we look at the address information, from which we often can extract additional information. We do this by printing some sample address values.

# let's explore what information we can extract from the streetnames
for i in df_base['Address'][0:10]:
    print(i)

The street names alone are not so helpful. However, the address data does provide additional information. For example, it tells us whether the location is a street intersection or not. In addition, it contains the type of street. This information is valuable because now we can extract parts of the text and use them as separate features.

There’s a lot more we could do, but for now, we’ve got a good enough idea of the data.

Step #3 Preprocessing

Probably the most exciting and important aspect of model development is feature engineering. Compared to model parameterization, the right features can often achieve greater leaps in performance.

3.1 Remarks on data preprocessing for XGBoost

When preprocessing the data, it is helpful to know which algorithms to use because some algorithms are picky about the shape of the data. Therefore, we have to do some transformations.

We intend to prepare the data in a way that we can use them for training a gradient boosting model (XGBoost). This algorithm uses a random forest ensemble, which can only handle integer and Boolean values, but no categorical data. Therefore we need to encode our values. We also need to map the categorical labels to integer values.

We don’t need to scale the continuous feature variables because gradient boosting and decision trees, in general, are not sensitive to variables that have different scales.

3.2 Feature Engineering

Based on the data exploration that we have done in the previous section, we create three types of features:

  • Date & Time: When a crime is committed is important. For example, when there is a lot of traffic on the street, there is a higher likelihood of traffic-related crimes. As another example, when it is Saturday, more people will usually come to the nightlife district, which attracts certain crimes, e.g. drug-related. Therefore, we will create separate features for the time, the day, the month, and the year.
  • Address: As mentioned before, we will extract additional features from the address column. First, we create separate features for the type of the street (for example, ST, AV, WY, TR, DR). In addition, we check whether the address contains the word “Block”. In addition, we will let our model know whether the address is a street crossing.
  • Latitude & location: We will transform the latitude and longitude values into polar coordinates. Above all this will make it easier for our model to make sense of the location. We will also remove some outliers from the dataset whose latitude is far off the grid.

Considering these features, the primary input to our crime type prediction model is the information when and where a crime occurs.

# let's explore what information we can extract from the streetnames
for i in df_train['Address'][0:20]:
    print(i)
    
    # Processing Function for Features
def cart2polar(x, y):
    dist = np.sqrt(x**2 + y**2)
    phi = np.arctan2(y, x)
    return dist, phi

def preprocessFeatures(dfx):
    
    # Time Feature Engineering
    df = pd.get_dummies(dfx[['DayOfWeek' , 'PdDistrict']])
    df['Hour_Min'] = pd.to_datetime(dfx['Dates']).dt.hour + pd.to_datetime(dfx['Dates']).dt.hour/60
    df['Day'] = pd.to_datetime(dfx['Dates']).dt.day
    df['Month'] = pd.to_datetime(dfx['Dates']).dt.month
    df['Year'] = pd.to_datetime(dfx['Dates']).dt.year

    month_one_hot_encoded = pd.get_dummies(pd.to_datetime(dfx['Dates']).dt.month, prefix='Month')
    df = pd.concat([df, month_one_hot_encoded], axis=1, join="inner")
    
    # Convert Carthesian Coordinates to Polar Coordinates
    df[['X', 'Y']] = dfx[['X', 'Y']]
    df['dist'], df['phi'] = cart2polar(df['X'], df['Y'])
  
    # Extracting Street Types
    df['Is_ST'] = dfx['Address'].str.contains(" ST", case=True)
    df['Is_AV'] = dfx['Address'].str.contains(" AV", case=True)
    df['Is_WY'] = dfx['Address'].str.contains(" WY", case=True)
    df['Is_TR'] = dfx['Address'].str.contains(" TR", case=True)
    df['Is_DR'] = dfx['Address'].str.contains(" DR", case=True)
    df['Is_Block'] = dfx['Address'].str.contains(" Block", case=True)
    df['Is_crossing'] = dfx['Address'].str.contains(" / ", case=True)
    
    return df

# Processing Function for Labels
def encodeLabels(dfx):
    df = pd.DataFrame (columns = [])
    factor = pd.factorize(dfx['Category'])
    return factor

# Remove Outliers by Longitude
df_cleaned = df_train[df_train['Y']<70]

# Encode Labels as Integer
factor = encodeLabels(df_cleaned)
y_df['f_Category'] = factor[0]
labels = list(factor[1])
# for val, i in enumerate(labels):
#   print(val, i)

We could also try to further improve our features by using additional data sources, such as weather data. However, there is no guarantee that this will improve the model results, and in the case of the criminal records, it did not. Therefore, we have omitted this part.

Step #4 Visualize the Data on a Map

The location where a crime occurs is indicated with cartesian coordination. We can use these spatial data to create a dot plot and overlay it with a map of San Francisco. In this way, we can better understand how different crime types distribute across the city.

4.1 Simple plot

First, we plot the coordinates on a blank chart. This allows us to see different patterns. For example, we see streets and neighborhoods where certain types of crimes are most common. In addition, we can see areas where certain types of crimes occur relatively rarely.

# Plot Criminal Activities by Lat and Long
df_filtered = df_cleaned.sample(frac=0.05)  
#df_filtered = df_cleaned[df_cleaned['Category'].isin(['PROSTITUTION', 'VEHICLE THEFT', 'FRAUD'])].sample(frac=0.05) # to filter 

groups = df_filtered.groupby('Category')

fig, ax = plt.subplots(sharex=False, figsize=(20, 10))
ax.margins(0.05) # Optional, just adds 5% padding to the autoscaling
for name, group in groups:
    ax.plot(group['X'], group['Y'], marker='.', linestyle='', label=name, alpha=0.1)
ax.legend()
plt.show()
Crime map of San Francisco

We can clearly see that certain streets in San Francisco are more prone to certain types of crime than others. It is also clear that there are certain crime hotspots in the city and especially in the center.

4.2 Create a Map of San Francisco

We can also display the crimes on a map of San Francisco. We use the mplleaflet Python plugin, which provides the geographic map data of San Francisco. Running the code below opens a _map.html file in your browser to display the map. Since the map can only handle a certain amount of data, we will only use a fraction of 1% of the data and limit the crimes to selected categories.

# Limit the data to a fraction and selected categories
df_filtered = df_cleaned.sample(frac=0.01) 
df_filtered = df_cleaned[df_cleaned['Category'].isin(['PROSTITUTION', 'VEHICLE THEFT', 'FRAUD'])].sample(frac=0.05) # to filter 

# Visualize Criminal Activities on a Geo Map
fig, ax = plt.subplots(sharex=False, figsize=(20, 10))
for name, group in groups:
    ax.plot(group['X'], group['Y'], marker='.', linestyle='', label=name)
mplleaflet.show()
Crime map of San Francisco created with the mplleaflet Python library

We can create this map for all types of crimes or limit the information to specific crimes. Once we filter the data, we can see that some crimes have their hotspots in the city. Examples are DRUG/NARCOTIC-related crimes, which occur mainly in the city center.

Step #5 Split the Data

Before we start training our predictive model, we will split our data into separate train and test datasets. For this purpose, we use the train_test_split function of scikit-learn and configure a split ratio of 70%. Then we output the data, which we employ in the next step to train and validate a model.

# Create train_df & test_df
x_df = preprocessFeatures(df_cleaned).copy()

# Split the data into x_train and y_train data sets
x_train, x_test, y_train, y_test = train_test_split(x_df, y_df, train_size=0.7, random_state=0)
x_train

Step #6 Train a Random Forest Classifier

Now that we have prepared the data, we can start training predictive models. In the first step, we train a basic model based on the Random Forest algorithm. The Random Forest is a powerful algorithm that can handle regression and classification problems. If you want to learn more about Random Forests and the optimal configuration of their hyperparameters, read my last article on churn prediction. In this article, we will use the Random Forest with a simple parameter configuration. The performance of this model is the baseline against which we can measure the performance of our XGboost model.

# Train a single random forest classifier - parameters are a best guess
clf = RandomForestClassifier(max_depth=100, random_state=0, n_estimators = 200)
clf.fit(x_train, y_train.ravel())
y_pred = clf.predict(x_test)

results_log = classification_report(y_test, y_pred)
print(results_log)

Our random forest classifier scores 31% percent accuracy on the test dataset.

Step #7 Train an XGBoost Classifier

In this section, we will train our gradient boosting classifier using the XGBoost package.

7.1 About Gradient Boosting

XGBoost is an implementation of a gradient boosting algorithm that uses a decision-tree-based ensemble machine learning algorithm. The algorithm searches for an optimal ensemble of trees. In this process, the algorithm iteratively adds trees to the model or removes them to reduce the prediction error of the previous tree constellation. This is done until no further improvements can be made. Thus, training does not optimize the model against the predictions but the previous model’s residuals (prediction errors).

In contrast to the random decision forest, the number of trees is not fixed initially, but an optimal number of trees is determined in the training process. But XGBoost does more! It is an extreme version of gradient boosting that uses additional optimization techniques to achieve the best result at a minimum effort.

A disadvantage of XGBoost is that it tends to overfit the data. Therefore, testing against unseen data is essential. In this tutorial, for the sake of simplicity, we will test against a single test sample. However, we denote that using cross-validation would be a better choice.

7.2 Train XGBoost

There are various Gradient Boosting Algorithms available for Python, including one from scikit-learn. However, scikit-learn does not support multi-threading, which makes the training process slower than necessary. For this reason, we will use the gradient boosting classifier from the XGBoost package.

# Configure the XGBoost model
param = {'booster': 'gbtree', 
         'tree_method': 'gpu_hist',
         'predictor': 'gpu_predictor',
         'max_depth': 140, 
         'eta': 0.3, 
         'objective': '{multi:softmax}', 
         'eval_metric': 'mlogloss', 
         'num_round': 30,
         'feature_selector ': 'cyclic'
        }

xgb_clf = XGBClassifier(param)
xgb_clf.fit(x_train, y_train.ravel())
score = xgb_clf.score(x_test, y_test.ravel())
print(score)

# Create predictions on the test dataset
y_pred = xgb_clf.predict(x_test)

# Print a classification report
results_log = classification_report(y_test, y_pred)
print(results_log)

Now that we have trained our classification model let’s take a look at how it performs. For this purpose, we will generate predictions (y_pred) on the test dataset (x_test). Afterward, we use the predictions and the valid values (y_test) to create a classification report.

Our model achieves an accuracy score of 31%. At first hand, this might look not so good, but considering that we have 39 categories and only sparse information available, this performance is quite impressive.

Step #8 Measure Model Performance

Finally, we create a confusion matrix to visualize the performance of the XGboost classifier. The matrix below shows the number of correct and false predictions for each crime category.

# Print a multi-Class Confusion Matrix
cnf_matrix = confusion_matrix(y_test.reshape(-1), y_pred)
df_cm = pd.DataFrame(cnf_matrix, columns=np.unique(y_test), index = np.unique(y_test))
df_cm.index.name = 'Actual'
df_cm.columns.name = 'Predicted'
plt.figure(figsize = (16,12))
plt.tight_layout()
sns.set(font_scale=1.4) #for label size
sns.heatmap(df_cm, cbar=True, cmap= "inferno", annot=False, fmt='.0f' #, annot_kws={"size": 13}
           )
Evaluating the performance of our XGboost classifier

The matrix shows that our model frequently predicts crime category two and tends to neglect the other crime types. The reason for this is the uneven distribution of crime types in the training data. As a result, when we evaluate the model, we need to pay attention to the importance of the different crime types. For example, we might train the model to predict certain crime types particularly accurately, although this might come at the expense of a lower accuracy when predicting other crime types. However, such optimizations depend on the technical context and the goals one wants to achieve with the prediction model. And we can currently only speculate about these goals.

Summary

Congratulations, you made it to the end of this tutorial! You’ve learned a lot along the way. We have configured and trained an XGBoost model that predicts crime types in San Francisco. In addition, you have learned how to work with spatial data and plot coordinates to a map with the mplleaflet Python library. Finally, we have used this model to make some test predictions and evaluated the model performance against other algorithms such as a classic Random Decision Forest.

We hope this tutorial was helpful. If you have any questions or suggestions on what we could improve, feel free to post them in the comments. We appreciate your feedback.

Authors

  • Hi, I am Florian, a Zurich-based consultant for AI and Data. Since the completion of my Ph.D. in 2017, I have been working on the design and implementation of ML use cases in the Swiss financial sector. I started this blog in 2020 with the goal in mind to share my experiences and create a place where you can find key concepts of machine learning and materials that will allow you to kick-start your own Python projects.

  • Hi, I am a student at the Technical University of Munich and currently pursuing a Masters degree in Electrical Engineering and Information Technology. I am very passionate about Machine Learning, Software Development, and Signal Processing.

Leave a Reply