Forecasting Criminal Activity in San Francisco using XGBoost and Python

I recently came across an interesting Kaggle contest that involves predicting different types of criminal activity in San Francisco. Not surprisingly, in a huge city like San Francisco, numerous crimes occur daily. Among the most commonly reported are vehicle theft, assault, and drug-related activity. Predicting these crimes can help police spread their forces across the city, which can then respond more quickly. The Kaggle competition is about predicting 38 different types of crimes based on location and time. Consequently, this is a multi-label classification problem. In this blog we share our approach to tackle this prediction challenge. We describe how we developed a predictive model using XGBoost, which achieves a prediction accuracy of about 31%.

Python Implementation


Before we start the coding part, make sure that you have setup your Python environment and required packages. In this tutorial, we will be working with the following packages: pandas, scikit-learn, numpy, matplotlib. For visualization purposes, we will also use seaborn and mplleaflet.

You can install these packages using console commands:

  • pip install <package name>
  • conda install <package name> (if you are using the anaconda packet manager)

If you don’t have an environment set up yet, you can follow this tutorial to setup the Anaconda environment.

1 Load the Data

We begin by downloading the data from the sf crime challenge on Once you have downloaded the dataset, place the csv files (train.csv) into your python working folder.

The dataset was collected by the SFO police department between 2003 and 2015. According to the data description from sf crime challenge, it contains the following data fields:

  • Dates – timestamp of the crime incident
  • Category – category of the crime incident (only in train.csv). This is the target variable you are going to predict.
  • Descript – detailed description of the crime incident (only in train.csv)
  • DayOfWeek – the day of the week
  • PdDistrict – name of the Police Department District
  • Resolution – how the crime incident was resolved (only in train.csv)
  • Address – the approximate street address of the crime incident 
  • X – Longitude
  • Y – Latitude

The next step is to load the data into a dataframe. Then we use the head() command to print the first five lines and make sure that everything has been loaded correctly.

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import matplotlib.pyplot as plt 
from datetime import datetime, timedelta
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix
from xgboost import XGBClassifier
import mplleaflet

# The Data is part of the Kaggle Competition:
df_base = pd.read_csv("data/sf-crime/train.csv")


If the data was loaded correctly, you should see the first five records of the dataframe as shown above.

2 Explore the Data

At the beginning of a new project, we usually don’t understand the data well and need to acquire that understanding. Therefore, the next step after loading the data is typically to explore the data and familiarize ourselves with its characteristics.

The following examples show what we can do to better understand the data. Feel free to create more charts, such as whisker charts, and explore additional relationships between the data such as the relation between weekdays and prediction categories.

2.1 Prediction Labels

We print an overview of the prediction labels to see if the labels are imbalanced.

# print the value counts of the categories
sns.countplot(x = df_base['Category'], orient='v', order = df_base['Category'].value_counts().index)

As we can see above, we are dealing with an imbalanced dataset. In other words, we have different numbers of samples for the prediction labels, which is not desirable. Prediction models will typically perform better, when the data is balanced.

2.2 Dates and Time

We assume that there are large differences in which day of the week and time of day certain crimes are committed. For this reason, we look at how crime rates are distributed across different days of the week and times of day. First, we look at crime numbers per week days.

# Print Value Counts per Weekday
sns.countplot(y = df_base['DayOfWeek'], orient='h', order = df_base['DayOfWeek'].value_counts().index)

The least crimes are reported on Sundays and the most crimes on Friday. So it seems that even criminals like to have a weekend. 😉

Let’s take a look at the time when certain crimes are reported. For the sake of clarity, we thereby limit the categories.

# Convert the time to minutes
df_base['Hour_Min'] = pd.to_datetime(df_base['Dates']).dt.hour  + pd.to_datetime(df_base['Dates']).dt.minute / 60

# Print Crime Counts per Time and Category
df_base_filtered = df_base[df_base['Category'].isin([

sns.displot(x = 'Hour_Min', hue="Category", data = df_base_filtered, kind="kde", height=8, aspect=1.5)

We can see that criminals most often go to work in the afternoon and at midnight. In addition, time affects the likelihood of certain types of crimes being committed. For example, we can see that FRAUD rarely occurs at night and is usually reported during the day – which seems plausible. The opposite is the case for VEHICLE THEFT, which mostly occurs at night and in the late afternoon, but less often in the morning.

2.3 Address

Next we look at the address information, from which we often can extract additional information. We do this by printing some sample address values, so that we can see how the address data is stored.

# let's explore what information we can extract from the streetnames
for i in df_base['Address'][0:10]:

The street names alone are not so useful. However, the address data does provide additional information. For example, we it tells us whether the location is a street intersections or not. In addition it contains the type of the street. This information is valuable, because now we can extract parts of the text and use them as separate features.

There’s a lot more we could do, but for now we’ve gotten a good enough idea of what the data looks like.

3 Preprocessing

Probably the most exciting and important aspect of model development is feature engineering. Compared to model parameterization, the right features can often achieve greater leaps in performance.

3.1 Remarks on data preprocessing for XGBoost

When preprocessing the data, it is useful to already have an idea of which algorithms to use, because some algorithms are picky about the shape of the data. Therefore, we have to do some transformations.

Our intention is to prepare the data in a way that we can use them to train a gradient boosting model (XGBoost). This algorithm is based on a random forest ensemble, which can only handle integer and Boolean values, but no categorical data. Therefore we need to encode our values. We also need to map the categorical labels to integer values.

We don’t need not scale the continuous feature variables, because gradient boosting and decision trees in general are not sensitive to variables that have different scales.

3.2 Feature Engineering

Based on the data exploration that we have done in the previous section, we can create three types of features.

  • Date & Time: The time when a crime is committed makes some important features. Just as an example, when there is a lot of traffic on the street, there is a higher likelihood of traffic-related crimes. As another example, when it is Saturday, more people will usually come to the nightlife district, which attracts certain crimes, e.g. drug-related. Therefore, we will create separate features for the time, the day, the month and the year.
  • Address: As mentioned before, we will extract additional features from the address column. First, we create separate features for the type of the street (for example, ST, AV, WY, TR, DR). In addition, we check whether the address contains the word “Block”. In addition, we will let our model know whether the address is a street crossing.
  • Latitude & location: We will transform the latitude and longitude values into polar coordinates. Above all this will make it easier for our model to make sense of the location. We will also remove some outliers from the dataset whose latitude is far off the grid.
# let's explore what information we can extract from the streetnames
for i in df_train['Address'][0:20]:
    # Processing Function for Features
def cart2polar(x, y):
    dist = np.sqrt(x**2 + y**2)
    phi = np.arctan2(y, x)
    return dist, phi

def preprocessFeatures(dfx):
    # Time Feature Engineering
    df = pd.get_dummies(dfx[['DayOfWeek' , 'PdDistrict']])
    df['Hour_Min'] = pd.to_datetime(dfx['Dates']).dt.hour + pd.to_datetime(dfx['Dates']).dt.hour/60
    df['Day'] = pd.to_datetime(dfx['Dates'])
    df['Month'] = pd.to_datetime(dfx['Dates']).dt.month
    df['Year'] = pd.to_datetime(dfx['Dates']).dt.year

    month_one_hot_encoded = pd.get_dummies(pd.to_datetime(dfx['Dates']).dt.month, prefix='Month')
    df = pd.concat([df, month_one_hot_encoded], axis=1, join="inner")
    # Convert Carthesian Coordinates to Polar Coordinates
    df[['X', 'Y']] = dfx[['X', 'Y']]
    df['dist'], df['phi'] = cart2polar(df['X'], df['Y'])
    # Extracting Street Types
    df['Is_ST'] = dfx['Address'].str.contains(" ST", case=True)
    df['Is_AV'] = dfx['Address'].str.contains(" AV", case=True)
    df['Is_WY'] = dfx['Address'].str.contains(" WY", case=True)
    df['Is_TR'] = dfx['Address'].str.contains(" TR", case=True)
    df['Is_DR'] = dfx['Address'].str.contains(" DR", case=True)
    df['Is_Block'] = dfx['Address'].str.contains(" Block", case=True)
    df['Is_crossing'] = dfx['Address'].str.contains(" / ", case=True)
    return df

# Processing Function for Labels
def encodeLabels(dfx):
    df = pd.DataFrame (columns = [])
    factor = pd.factorize(dfx['Category'])
    return factor

# Remove Outliers by Longitude
df_cleaned = df_train[df_train['Y']<70]

# Encode Labels as Integer
factor = encodeLabels(df_cleaned)
y_df['f_Category'] = factor[0]
labels = list(factor[1])
# for val, i in enumerate(labels):
#   print(val, i)

We could also try to further improve our features by using additional data sources, such as weather data. However, there is no guarantee that this will improve the results of the model, and in the case of the crime records, it did not. Therefore, we have omitted this part.

4 Visualize the Data on a Map

The location where a crime occurred is indicated with cartesian coordination. There are many reasons to use these spatial data to create a dot plot. Above all, it helps us to develop a better understanding of how crime types distribute across the city. First, we create a simple dot plot, before we will plot the data onto a real geographical map using mpleaflet.

4.1 Simple plot

First, we plot the coordinates on a blank chart. This allows us to see different patterns. For example, we see streets and neighborhoods where certain types of crimes are most common. In addition, we can see areas where certain types of crimes occur rather rarely.

# Plot Criminal Activities by Lat and Long
df_filtered = df_cleaned.sample(frac=0.05)  
#df_filtered = df_cleaned[df_cleaned['Category'].isin(['PROSTITUTION', 'VEHICLE THEFT', 'FRAUD'])].sample(frac=0.05) # to filter 

groups = df_filtered.groupby('Category')

fig, ax = plt.subplots(sharex=False, figsize=(20, 10))
ax.margins(0.05) # Optional, just adds 5% padding to the autoscaling
for name, group in groups:
    ax.plot(group['X'], group['Y'], marker='.', linestyle='', label=name, alpha=0.1)

We can clearly see that certain streets in San Francisco are more prone to certain types of crime than others. It is also clear that there are certain crime hotspots in the city and especially in the center.

4.2 Create a Map of San Francisco

We can also display the crimes on a map of San Francisco. For this we use plugin mplleaflet, which provides the map data. When you run the code, this should open a _map.html file in your browser that should display the map. Since the map can only handle a certain amount of data, we will only use a fraction of 1% of the data and limit the crimes to selected categories.

# Limit the data to a fraction and selected categories
df_filtered = df_cleaned.sample(frac=0.01) 
df_filtered = df_cleaned[df_cleaned['Category'].isin(['PROSTITUTION', 'VEHICLE THEFT', 'FRAUD'])].sample(frac=0.05) # to filter 

# Visualize Criminal Activities on a Geo Map
fig, ax = plt.subplots(sharex=False, figsize=(20, 10))
for name, group in groups:
    ax.plot(group['X'], group['Y'], marker='.', linestyle='', label=name)

We can create this map for all types of crime, or for specific crimes. Once we filter the data, we can see that some crimes have their own hotspots in the city. Examples are DRUG/NARCOTIC, which are dominantly reported in the city center.

5 Split the Data

Before we start training our predictive model, we will split our data into separate train and test datasets. For this purpose, we use the train_test_split function of scikit-learn and configure a split ratio of 70%. Then we output the data, which we use in the next step to train and validate a model.

# Create train_df & test_df
x_df = preprocessFeatures(df_cleaned).copy()

# Split the data into x_train and y_train data sets
x_train, x_test, y_train, y_test = train_test_split(x_df, y_df, train_size=0.7, random_state=0)

6 Train a Random Forest Classifier

Once, we have split the data, we establish a performance baseline. We do this by training a Random Forest Classifier. In short, it is a powerful prediction algorithm which can deal with regression and classification problems. It is based on the idea to create an ensemble of decision trees. Each tree generates a separate prediction and votes on the final outcome. Check out my recent post to learn more about decision trees and how to identity an optimal configuration of their hyperparameters.

The idea of the base line model is not to achieve great performance, but to establish a performance baseline against which we can compare the performance later when we want to evaluate and optimize the performance of our real model. Therefore, we will use a simple best guess configuration for the hyperparameters of the baseline model.

# Train a single random forest classifier - parameters are a best guess
clf = RandomForestClassifier(max_depth=100, random_state=0, n_estimators = 200), y_train.ravel())
y_pred = clf.predict(x_test)

results_log = classification_report(y_test, y_pred)

Our random forest classifier scores 31% percent accuracy on the test dataset.

7 Train an XGBoost Classifier

In this section, we will train our gradient boosting classifier, using the XGBoost package.

7.1 About Gradient Boosting

XGBoost is an implementation of a gradient boosting algorithm that uses a decision-tree-based ensemble machine learning algorithm. The algorithm searches for an optimal ensemble of trees. In this process, trees are iteratively added or removed from the model with the goal of reducing the prediction error of the previous tree constellation until no further improvements can be made. Thus, the models are not trained on the actual value to be predicted and instead are trained on the residuals (prediction error) of the previous model.

In contrast to the random decision forest, the number of trees is not fixed from the beginning, but an optimal number of trees is determined in the training process. But XGBoost does more! It is an extreme version of gradient boosting that uses additional optimization techniques to achieve the best result at a minimum effort.

A disadvantage of XGBoost is that it tends to overfit the data. Therefore, testing against unseen data is important. In this tutorial, for the sake of simplicity, we will test against a single test sample, however cross-validation would be a better choice.

7.2 Train XGBoost

There are various Gradient Boosting Algorithms available for Python, including one from sklearn. However, sklearn does not support multi-threading, which makes the training process slower than necessary. For this reason, we will use the gradient boosting classifier from the XGBoost package.

# Configure the XGBoost model
param = {'booster': 'gbtree', 
         'tree_method': 'gpu_hist',
         'predictor': 'gpu_predictor',
         'max_depth': 140, 
         'eta': 0.3, 
         'objective': '{multi:softmax}', 
         'eval_metric': 'mlogloss', 
         'num_round': 30,
         'feature_selector ': 'cyclic'

xgb_clf = XGBClassifier(param), y_train.ravel())
score = xgb_clf.score(x_test, y_test.ravel())

# Create predictions on the test dataset
y_pred = xgb_clf.predict(x_test)

# Print a classification report
results_log = classification_report(y_test, y_pred)

Now that we have trained our classification model, let’s take a look at how it performs. Therefore, we will generate predictions (y_pred) on the test dataset (x_test). Afterwards, we use the predictions and the valid values (y_test) to generate a classification report.

Our model achieves an accuracy score of 31%. At first, hand, this might look not so good, but considering that we have 38 categories and only sparse information available, this is quite impressive.

8 Measure Model Performance

Finally, we create a confusion matrix to visualize the performance of the XGboost classifier. The matrix shows the number of correct and faulty predictions per category.

# Print a multi-Class Confusion Matrix
cnf_matrix = confusion_matrix(y_test.reshape(-1), y_pred)
df_cm = pd.DataFrame(cnf_matrix, columns=np.unique(y_test), index = np.unique(y_test)) = 'Actual' = 'Predicted'
plt.figure(figsize = (16,12))
sns.set(font_scale=1.4) #for label size
sns.heatmap(df_cm, cbar=True, cmap= "inferno", annot=False, fmt='.0f' #, annot_kws={"size": 13}


Congratulations, you made it to the end of this tutorial! You’ve learned a lot along the way, for example how to work with spatial data and plot coordinates to a map using mpleaflet. You have also learned how to configure and train an XGBoost model and to use this model to predict one of 39 different crime categories.

We hope, you found this tutorial useful and would really appreciate your feedback. If you have any questions or suggestions on what we could improve, feel free to post them in the comments.


  • Hi, my name is Florian! I am a Zurich-based Data Scientist with a passion for Artificial Intelligence and Machine Learning. After completing my PhD in Business Informatics at the University of Bremen, I started working as a Machine Learning Consultant for the swiss consulting firm ipt. When I'm not working on use cases for our clients, I work on own analytics projects and report on them in this blog.

Follow Florian Müller:

Data Scientist & Machine Learning Consultant

Hi, my name is Florian! I am a Zurich-based Data Scientist with a passion for Artificial Intelligence and Machine Learning. After completing my PhD in Business Informatics at the University of Bremen, I started working as a Machine Learning Consultant for the swiss consulting firm ipt. When I'm not working on use cases for our clients, I work on own analytics projects and report on them in this blog.

Leave a Reply