Geographic Maps Archives - relataly.com

Predictive Policing: Preventing Crime in San Francisco using XGBoost and Python

Florian Follonier — Sun, 07 Mar 2021 16:16:19 +0000

In this tutorial, we’ll be using machine learning to predict and map out crime in San Francisco. We’ll be working with a dataset from Kaggle that contains information on 39 different types of crimes, including everything from vehicle theft to drug offenses. Using Python and the powerful Scikit-Learn library, we’ll train a classification model using the XGboost algorithm to predict 39 types of crimes based on when and where it occurred. We’ll then use the Plotly library to visualize the results on a map of the city, highlighting areas with higher rates of certain crimes. This type of prediction and mapping is similar to what the San Francisco Police Department uses in their practice of predictive policing, where they allocate resources to at-risk areas in an effort to prevent crime.

As we embark on this thrilling journey, we’ll start by downloading and preprocessing the San Francisco crime data. Next, we’ll channel the data to train two distinct classification models. The first model will utilize a standard Random Forest Classifier, while the second will leverage the exceptional XGBoost package. We’ll experiment with various models that boast different hyperparameters. Ultimately, we’ll visualize our predictions on a striking SF crime map and assess the performance of our diverse models. So, buckle up and let’s dive into the exhilarating world of crime prediction and mapping!

Predictive policing can make police work much more efficient and effective. Image generated using Midjourney.

What is Predictive Policing?

The use case we are looking at in this article falls into predictive policing. Predictive policing uses data, algorithms, and other technological tools to predict where and when crimes are likely to occur. The goal of predictive policing is to help law enforcement agencies better allocate their resources and focus their efforts on areas where crime is likely to happen, with the ultimate goal of reducing crime and improving public safety. This approach to policing is based on the idea that by using data and other tools to identify patterns and trends, law enforcement agencies can better anticipate where crimes are likely to occur and take steps to prevent them from happening.

The benefits of predictive policing include the ability to allocate law enforcement resources better, the potential to reduce crime and improve public safety, and the ability to identify trends and patterns that may not be immediately obvious to law enforcement officers. Additionally, by using data and other tools to anticipate where crimes are likely to occur, law enforcement agencies can take proactive steps to prevent those crimes from happening, which can save time and money.

Creating a Crime Map for Predictive Policing using XGBoost in Python

In this practical tutorial, we’ll construct an XGBoost multi-label classifier to predict crime types in San Francisco. Urban crime, such as in San Francisco, is a dynamic and multifaceted issue that can dramatically vary based on location, time, and other factors. Our aim is to develop a predictive algorithm capable of forecasting specific crime types based on a given location and time parameters. The end product is an interactive San Francisco crime map providing a snapshot of crime hotspots throughout the city.

Law enforcement agencies, like the San Francisco Police Department, use similar maps for strategic resource allocation to curb crime rates effectively. Additionally, this SF crime map will underscore crime clusters – areas notorious for particular types of crime incidents. By the end of this tutorial, you’ll have a deeper understanding of using machine learning in practical scenarios and aiding real-world decision-making.

The code is available on the GitHub repository.

View on GitHub Relataly Github Repo

Crime doesn’t sleep in San Francisco. That’s why predictive policing can make a real impact. Image generated with Midjourney

Prerequisites

Before starting the Python coding part, ensure that you have set up your Python 3 environment and required packages. If you don’t have an environment, follow this tutorial to set up the Anaconda environment.

Also, make sure you install all required packages. In this tutorial, we will be working with the following standard packages:

pandas
NumPy
matplotlib
Seaborn

In addition, we will be using XGBoost (‘xgboost’) and the machine learning library scikit-learn.

You can install packages using console commands:

pip install
conda install (if you are using the anaconda packet manager)

Step #1 Load the Data

We begin by downloading the San Francisco crime challenge data on kaggle.com. Once you have downloaded the dataset, place the CSV files (train.csv) into your Python working folder.

The dataset was collected by the SFO police department between 2003 and 2015. According to the data description from the SF crime challenge, the dataset contains the following variables:

Dates: timestamp of the crime incident
Category: Category of the crime incident (only in train.csv) that we will use as the target variable
Descript: detailed description of the crime incident (only in train.csv)
DayOfWeek: the day of the week
PdDistrict: the name of the Police Department District
Resolution: how the crime incident was resolved (only in train.csv)
Address: the approximate street address of the crime incident
X: Longitude
Y: Latitude

The next step is to load the data into a dataframe. Then we use the head() command to print the first five lines and ensure you can see the data.

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import matplotlib.pyplot as plt 
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix
from xgboost import XGBClassifier
import plotly.express as px

# The Data is part of the Kaggle Competition: https://www.kaggle.com/c/sf-crime/data
df_base = pd.read_csv("data/crime/sf-crime/train.csv")

print(df_base.describe())
df_base.head()

		X              Y
count  	878049.000000  878049.000000
mean     -122.422616      37.771020
std         0.030354       0.456893
min      -122.513642      37.707879
25%      -122.432952      37.752427
50%      -122.416420      37.775421
75%      -122.406959      37.784369
max      -120.500000      90.000000

	Dates				Category	Descript			DayOfWeek	PdDistrict	Resolution	Address				X			Y
0	2015-05-13 23:53:00	WARRANTS	WARRANT ARREST		Wednesday	NORTHERN	ARREST, 	OAK ST / ...		-122.425892	37.774599
1	2015-05-13 23:53:00	OTHER ...	TRAFFIC ...			Wednesday	NORTHERN	ARREST, 	OAK ST / ...		-122.425892	37.774599
2	2015-05-13 23:33:00	OTHER ...	TRAFFIC ...			Wednesday	NORTHERN	ARREST, 	VANNESS AV... ST	-122.424363	37.800414
3	2015-05-13 23:30:00	LARCENY/THEFT	GRAND THEFT...	Wednesday	NORTHERN	NONE		1500 Block... ST	-122.426995	37.800873
4	2015-05-13 23:30:00	LARCENY/THEFT	GRAND THEFT ...	Wednesday	PARK		NONE		100 Block... ST		-122.438738	37.771541

If the data was loaded correctly, you should see the first five records of the dataframe, as shown above.

Step #2 Explore the Data

At the beginning of a new project, we usually don’t understand the data well and need to acquire that understanding. Therefore, next, we will explore the data and familiarize ourselves with its characteristics.

The following examples will help us better understand our data’s characteristics. For example, you can use whisker charts and a correlation matrix to understand better the correlation between variables, such as between weekdays and prediction categories. Feel free to create more charts.

2.1 Prediction Labels

Running the code below shows a bar plot of the prediction labels. The plot shows the frequency in which the class labels occur in the data.

# print the value counts of the categories
plt.figure(figsize=(15,5))
ax = sns.countplot(x = df_base['Category'], orient='v', order = df_base['Category'].value_counts().index)
ax.set_xticklabels(ax.get_xticklabels(),rotation = 90)

As shown above, our class labels are highly imbalanced, affecting model accuracy. When we evaluate the performance of our model, we need to consider this.

2.2 When a Crime Occured – Considering Dates and Time

We assume that when a crime occurs impacts the type of crime. For this reason, we look at how crimes distribute across different days of the week and times of the day. First, we look at crime numbers per weekday.

# Print Crime Counts per Weekday
plt.figure(figsize=(6,3))
ax = sns.countplot(y = df_base['DayOfWeek'], orient='h', order = df_base['DayOfWeek'].value_counts().index)
ax.set_xticklabels(ax.get_xticklabels(),rotation = 90)

Fewer crimes happen on Sundays, and most are on Fridays. So it seems that even criminals like to have a weekend. For the sake of clarity, we thereby limit the categories. Let’s take a look at the time when certain crimes are reported.

# Convert the time to minutes
df_base['Hour_Min'] = pd.to_datetime(df_base['Dates']).dt.hour  + pd.to_datetime(df_base['Dates']).dt.minute / 60

# Print Crime Counts per Time and Category
df_base_filtered = df_base[df_base['Category'].isin([
    'PROSTITUTION', 
    'VEHICLE THEFT', 
    'DRUG/NARCOTIC', 
    'WARRENTS', 
    'BURGLERY', 
    'FRAUD', 
    'ASSAULT',
    'LARCENY/THEFT',
    'VANDALISM'])]

plt.figure(figsize=(16,10))
ax = sns.displot(x = 'Hour_Min', hue="Category", data = df_base_filtered, kind="kde", height=8, aspect=1.5)

In addition, the time when a crime happens affects the likelihood of certain types. For example, we can see that FRAUD rarely occurs at night and usually during the day. We can see that criminals often go to work in the afternoon and at midnight. On the other hand, certain crimes, such as VEHICLE THEFT, mainly occur at night and late afternoon but less often in the morning.

If you want to gain an overview of additional features, you can use the pair plot function. Because our dataset is large, we reduce the computation time by plotting 1/100 of the data.

sns.pairplot(data = df_base_filtered[0::100], height=4, aspect=1.5, hue='Category')

2.3 Where a Crime Occured – Considering Address

Next, we look at the address information, from which we can often extract additional information. We do this by printing some sample address values.

# Extracting information from the streetnames
for i in df_base['Address'][0:10]:
    print(i)

OAK ST / LAGUNA ST
OAK ST / LAGUNA ST
VANNESS AV / GREENWICH ST
1500 Block of LOMBARD ST
100 Block of BRODERICK ST
0 Block of TEDDY AV
AVALON AV / PERU AV
KIRKWOOD AV / DONAHUE ST
600 Block of 47TH AV
JEFFERSON ST / LEAVENWORTH ST

The street names alone are not so helpful. However, the address data does provide additional information. For example, it tells us whether the location is a street intersection or not. In addition, it contains the type of street. This information is valuable because now we can extract parts of the text and use them as separate features.

We could do a lot more, but we’ve got a good enough idea of the data.

Step #3 Data Preprocessing

Probably the most exciting and important aspect of model development is feature engineering. Compared to model parameterization, the right features can often achieve more significant leaps in performance.

3.1 Remarks on Data Preprocessing for XGBoost

When preprocessing the data, it is helpful to know which algorithms to use because some algorithms are picky about the shape of the data. We will prepare the data to train a gradient-boosting model (XGBoost). This algorithm uses a random forest ensemble, which can only handle integer and Boolean values, but no categorical data. Therefore we need to encode our values. We also need to map the categorical labels to integer values.

We don’t need to scale the continuous feature variables because gradient boosting and decision trees, generally, are not sensitive to variables that have different scales.

3.2 Feature Engineering

Based on the data exploration that we have done in the previous section, we create three feature types:

Date & Time: When a crime happens is essential. For example, when there is a lot of traffic on the street, there is a higher likelihood of traffic-related crimes. For example, when it is Saturday, more people will usually come to the nightlife district, which attracts certain crimes, e.g., drug-related. Therefore, we will create different features for the time, the day, the month, and the year.
Address: As mentioned, we will extract additional features from the address column. First, we create different features for the street type (for example, ST, AV, WY, TR, DR). In addition, we check whether the address contains the word “Block.” In addition, we will let our model know whether the address is a street crossing.
Latitude & Longitude: We will transform the latitude and longitude values into polar coordinates. We will also remove some outliers from the dataset whose latitude is far off the grid. Above all, this will make it easier for our model to make sense of the location.

Considering these features, the primary input to our crime-type prediction model is the information on when and where a crime occurs.

# Processing Function for Features
def cart2polar(x, y):
    dist = np.sqrt(x**2 + y**2)
    phi = np.arctan2(y, x)
    return dist, phi

def preprocessFeatures(dfx):
    
    # Time Feature Engineering
    df = pd.get_dummies(dfx[['DayOfWeek' , 'PdDistrict']])
    df['Hour_Min'] = pd.to_datetime(dfx['Dates']).dt.hour + pd.to_datetime(dfx['Dates']).dt.minute / 60
    # We add a feature that contains the expontential time
    df['Hour_Min_Exp'] = np.exp(df['Hour_Min'])
    
    df['Day'] = pd.to_datetime(dfx['Dates']).dt.day
    df['Month'] = pd.to_datetime(dfx['Dates']).dt.month
    df['Year'] = pd.to_datetime(dfx['Dates']).dt.year

    month_one_hot_encoded = pd.get_dummies(pd.to_datetime(dfx['Dates']).dt.month, prefix='Month')
    df = pd.concat([df, month_one_hot_encoded], axis=1, join="inner")
    
    # Convert Carthesian Coordinates to Polar Coordinates
    df[['X', 'Y']] = dfx[['X', 'Y']] # we maintain the original coordindates as additional features
    df['dist'], df['phi'] = cart2polar(dfx['X'], dfx['Y'])
  
    # Extracting Street Types
    df['Is_ST'] = dfx['Address'].str.contains(" ST", case=True)
    df['Is_AV'] = dfx['Address'].str.contains(" AV", case=True)
    df['Is_WY'] = dfx['Address'].str.contains(" WY", case=True)
    df['Is_TR'] = dfx['Address'].str.contains(" TR", case=True)
    df['Is_DR'] = dfx['Address'].str.contains(" DR", case=True)
    df['Is_Block'] = dfx['Address'].str.contains(" Block", case=True)
    df['Is_crossing'] = dfx['Address'].str.contains(" / ", case=True)
    
    return df

# Processing Function for Labels
def encodeLabels(dfx):
    df = pd.DataFrame (columns = [])
    factor = pd.factorize(dfx['Category'])
    return factor

# Remove Outliers by Longitude
df_cleaned = df_base[df_base['Y']<70]

# Encode Labels as Integer
factor = encodeLabels(df_cleaned)
y_df = factor[0]
labels = list(factor[1])
# for val, i in enumerate(labels):
#     print(val, i)

We could also try to further improve our features by using additional data sources, such as weather data. However, there is no guarantee that this will improve the model results, and it did not in the case of criminal records. Therefore, we have omitted this part.

Step #4 Visualize Crime Types on a Map of San Francisco

Next, we create a San Francisco crime map using the cartesian coordinates indicating where a crime has occurred. First, we only plot the data without a geographical map. Later we will use these spatial data to create a dot plot and overlay it with a map of San Francisco. Visualizing the crime types on a map helps us understand how crime types distribute across the city.

4.1 Plot Crime Types using a Scatter Plot

Next, we want to gain an overview of possible spatial patterns and hotspots. We expect to see streets and neighborhoods where certain crimes are more common than in the more expensive areas of the city. In addition, we expect to see places in the city where certain crime types occur relatively rarely. To gain an overview of the crime distribution in San Francisco, we use a scatter plot to display the crime coordinates on a blank chart.

Running the code below creates the crime map of San Francisco with all crime types. Depending on the speed of your machine, the creation of the map may take several minutes.

# Plot Criminal Activities by Lat and Long
df_filtered = df_cleaned.sample(frac=0.05)  
#df_filtered = df_cleaned[df_cleaned['Category'].isin(['PROSTITUTION', 'VEHICLE THEFT', 'FRAUD'])].sample(frac=0.05) # to filter 

groups = df_filtered.groupby('Category')

fig, ax = plt.subplots(sharex=False, figsize=(20, 12))
ax.margins(0.05) # Optional, just adds 5% padding to the autoscaling
for name, group in groups:
    ax.plot(group['X'], group['Y'], marker='.', linestyle='', label=name, alpha=0.9)
ax.legend()
plt.show()

The plot shows that certain streets in San Francisco are more prone to specific crime types than others. It is also clear that there are certain crime hotspots in the city, especially in the center. We can also see that few crimes are reported in public park areas.

4.2 Create a Crime Map of San Francisco using Plotly

Next, we will create a San Francisco crime map using the Plotly Python library. Because the plugin can handle a limited amount of data simultaneously, we will reduce our data to a fraction of 1% and a few selected crime types.

Running the code below opens a _map.html file in your browser that displays the SF crime map. The result is a zoomable geographic map of San Francisco that shows how the selected crime types distribute across the city.

# 4.2 Create a Crime Map of San Francisco using Plotly
# Limit the data to a fraction and selected categories
df_filtered = df_cleaned.sample(frac=0.01) 
fig = px.scatter_mapbox(df_filtered, lat="Y", lon="X", hover_name="Category", color='Category', hover_data=["Y", "X"], zoom=12, height=800)
fig.update_layout(mapbox_style="open-street-map")
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show()

The SF crime map shows different types of crimes, including prostitution, vehicle theft, and fraud. The interactive map allows you to change zoom levels and filter the type of crime displayed on the map. For example, if you filter DRUG/NARCOTIC-related crimes, you can see that these crimes mainly occur in the city center near the financial district and the nightlife area.

Step #5 Split the Data

Before training our predictive model, we will split our data into separate datasets for training and testing. For this purpose, we use the train_test_split function of scikit-learn and configure a split ratio of 70%. Then we output the data, which we employ in the next step to train and validate a model.

# Create train_df & test_df
x_df = preprocessFeatures(df_cleaned).copy()

# Split the data into x_train and y_train data sets
x_train, x_test, y_train, y_test = train_test_split(x_df, y_df, train_size=0.7, random_state=0)
x_train

		DayOfWeek_Friday	DayOfWeek_Monday	DayOfWeek_Saturday	DayOfWeek_Sunday	DayOfWeek_Thursday	DayOfWeek_Tuesday	DayOfWeek_Wednesday	PdDistrict_BAYVIEW	PdDistrict_CENTRAL	PdDistrict_INGLESIDE	...	Y			dist		phi			Is_ST	Is_AV	Is_WY	Is_TR	Is_DR	Is_Block	Is_crossing
276998	0					0					0					0					0					1					0					0					0					0						...	37.785023	128.110900	2.842200	True	False	False	False	False	True		False
81579	0					0					0					0					0					1					0					0					0					0						...	37.748470	128.185052	2.842677	False	True	False	False	False	True		False
206676	0					0					0					1					0					0					0					0					0					0						...	37.762744	128.113657	2.842389	True	False	False	False	False	True		False
732006	0					0					0					0					0					0					1					0					0					0						...	37.784140	128.109653	2.842204	True	False	False	False	False	False		True
796194	1					0					0					0					0					0					0					0					0					0						...	37.791333	128.125982	2.842185	True	False	False	False	False	True		False
5 rows × 45 columns

Step #6 Train a Random Forest Classifier

We can train the predictive models now that we have prepared the data. We train a basic model based on the Random Forest algorithm in the first step. The Random Forest is a robust algorithm that can handle regression and classification problems. One of our recent articles provides more information on Random Forests and how you can find the optimal configuration of their hyperparameters. In this tutorial, we use the Random Forest to establish a baseline against which we can measure the performance of our XGboost model. We, therefore, use the Random Forest with a simple parameter configuration without tuning the hyperparameters.

# Train a single random forest classifier - parameters are a best guess
clf = RandomForestClassifier(max_depth=100, random_state=0, n_estimators = 200)
clf.fit(x_train, y_train.ravel())
y_pred = clf.predict(x_test)

results_log = classification_report(y_test, y_pred)
print(results_log)

Output exceeds the size limit. Open the full output data in a text editor
              precision    recall  f1-score   support

           0       0.15      0.10      0.12     12657
           1       0.29      0.35      0.32     37898
           2       0.38      0.63      0.47     52237
           3       0.46      0.40      0.43     16136
           4       0.16      0.08      0.10     13426
           5       0.25      0.21      0.23     27798
           6       0.10      0.04      0.06      6850
           7       0.23      0.22      0.23     23087
           8       0.19      0.12      0.15      2586
           9       0.20      0.13      0.15     10942
          10       0.08      0.03      0.05      9559
          11       0.00      0.00      0.00      1300
          12       0.20      0.10      0.14      3200
          13       0.37      0.43      0.40     16282
          14       0.02      0.02      0.02      1350
          15       0.01      0.00      0.00      2912
          16       0.05      0.03      0.04      2217
          17       0.61      0.52      0.56      7865
          18       0.11      0.06      0.08      4954
          19       0.04      0.03      0.03       723
          20       0.28      0.19      0.23       581
          21       0.05      0.02      0.03       708
          22       0.25      0.13      0.17      1333
...
    accuracy                           0.31    263395
   macro avg       0.15      0.12      0.13    263395
weighted avg       0.28      0.31      0.28    263395

The baseline model is a random forest classifier with 31% percent accuracy on the test dataset.

Step #7 Train an XGBoost Classifier

Now that we have a baseline model, we can train our gradient boosting classifier using the XGBoost package. We expect this model to perform better than the baseline.

7.1 About Gradient Boosting

XGBoost is an implementation of a gradient-boosting algorithm that uses a decision-tree-based ensemble machine learning algorithm. The algorithm searches for an optimal ensemble of trees. In this process, the algorithm iteratively adds trees to the model or removes them to reduce the prediction error of the previous tree constellation. The algorithm repeats these steps until it can make no further improvements. Thus, training does not optimize the model against the predictions but the previous model’s residuals (prediction errors).

But XGBoost does more! It is an extreme version of gradient boosting that uses additional optimization techniques to achieve the best result with minimal effort. In contrast to the random decision forest, the XGBoost classification algorithm determines an optimal number of trees in the training process. We do not have to specify this number in advance.

A disadvantage of XGBoost is that it tends to overfit the data. Therefore, testing against unseen data is essential. This tutorial will test only against a single test sample for simplicity, but using cross-validation would be a better choice.

7.2 Train the XGBoost Classifier

Various Gradient Boosting Algorithms are available for Python, including one from scikit-learn. However, scikit-learn does not support multi-threading, which makes the training process slower than necessary. For this reason, we will use the gradient boosting classifier from the XGBoost package.

# Configure the XGBoost model
param = {'booster': 'gbtree', 
         'tree_method': 'gpu_hist',
         'predictor': 'gpu_predictor',
         'max_depth': 140, 
         'eta': 0.3, 
         'objective': '{multi:softmax}', 
         'eval_metric': 'mlogloss', 
         'num_round': 30,
         'feature_selector ': 'cyclic'
        }

xgb_clf = XGBClassifier(param)
xgb_clf.fit(x_train, y_train.ravel())
score = xgb_clf.score(x_test, y_test.ravel())
print(score)

# Create predictions on the test dataset
y_pred = xgb_clf.predict(x_test)

# Print a classification report
results_log = classification_report(y_test, y_pred)
print(results_log)

Output exceeds the size limit. Open the full output data in a text editor
0.30852142219859907
              precision    recall  f1-score   support

           0       0.17      0.01      0.02     12657
           1       0.30      0.42      0.35     37898
           2       0.33      0.72      0.46     52237
           3       0.31      0.27      0.29     16136
           4       0.21      0.03      0.05     13426
           5       0.24      0.18      0.21     27798
           6       0.17      0.01      0.01      6850
           7       0.21      0.19      0.20     23087
           8       0.26      0.01      0.02      2586
           9       0.22      0.08      0.12     10942
          10       0.13      0.00      0.00      9559
          11       0.07      0.00      0.01      1300
          12       0.20      0.08      0.11      3200
          13       0.34      0.43      0.38     16282
          14       0.00      0.00      0.00      1350
          15       0.12      0.00      0.01      2912
          16       0.15      0.02      0.03      2217
          17       0.57      0.34      0.43      7865
          18       0.19      0.03      0.05      4954
          19       0.00      0.00      0.00       723
          20       0.50      0.24      0.32       581
          21       0.10      0.01      0.01       708
...
    accuracy                           0.31    263395
   macro avg       0.18      0.11      0.11    263395
weighted avg       0.27      0.31      0.25    263395

Now that we have trained our classification model, let’s see how it performs. For this purpose, we will generate predictions (y_pred) on the test dataset (x_test). Afterward, we use the predictions and the valid values (y_test) to create a classification report.

Our model achieves an accuracy score of 31%. At first hand, this might not look so good, but considering that we have 39 categories and only sparse information available, this performance is quite impressive.

Step #8 Measure Model Performance

So how well does our XGboost model perform? To measure the performance of our model, we create a confusion matrix that visualizes the performance of the XGboost classifier. If you want to learn more about measuring the performance of classification models, check out this tutorial on measuring classification performance.

Running the code below creates the confusion matrix that shows the number of correct and false predictions for each crime category.

# Print a multi-Class Confusion Matrix
cnf_matrix = confusion_matrix(y_test.reshape(-1), y_pred)
df_cm = pd.DataFrame(cnf_matrix, columns=np.unique(y_test), index = np.unique(y_test))
df_cm.index.name = 'Actual'
df_cm.columns.name = 'Predicted'
plt.figure(figsize = (16,12))
plt.tight_layout()
sns.set(font_scale=1.4) #for label size
sns.heatmap(df_cm, cbar=True, cmap= "inferno", annot=False, fmt='.0f' #, annot_kws={"size": 13}
           )

The confusion matrix shows that our model frequently predicts crime category two and neglects the other crime types. The reason is the uneven distribution of crime types in the training data. As a result, when we evaluate the model, we need to pay attention to the importance of the different crime types. For example, we might train the model to predict certain crime types accurately, although this might come at a lower accuracy when predicting other crime types. However, such optimizations depend on the technical context and the goals one wants to achieve with the prediction model.

Summary

This tutorial has presented the machine learning use case “Predictive Policing” and showed how to implement it in Python. We have trained an XGBoost model that predicts crime types in San Francisco based on the information on when and where specific crimes have occurred. We also illustrated our data on an interactive crime map of San Francisco with the Plotly Python library. The Crime Map is an intuitive way of visualizing crime in a city and highlighting particular hotspots. Finally, we have used the prediction model to make test predictions and evaluate the model performance against other algorithms, such as a classic Random Decision Forest. The XGBoost model achieves a prediction accuracy of about 31%—a respectable performance, considering that the prediction problem involves 39 crime classes.

We hope this tutorial was helpful. If you have any questions or suggestions on what we could improve, feel free to post them in the comments. We appreciate your feedback.

Predictive policing with machine learning – Crime map of San Francisco, created with Python and Plotly

Sources and Further Reading

Looking for more esciting map vizualizations? Consider the relataly tutorial on visualizing COVID-19 data on geographic heatmaps using GeoPandas.

The links above to Amazon are affiliate links. By buying through these links, you support the Relataly.com blog and help to cover the hosting costs. Using the links does not affect the price.

The post Predictive Policing: Preventing Crime in San Francisco using XGBoost and Python appeared first on relataly.com.

Geographic Heat Maps with GeoPandas: Visualizing COVID-19 Data in Python

Florian Follonier — Wed, 08 Apr 2020 22:03:00 +0000

The spreading of COVID-19 has led to an increased interest in displaying region and country-specific information on geographic heat maps. Geographic heat maps use color shadings to visualize data that includes a spatial component and refers, for example, to countries, cities, towns, mountains, etc. The color shades are defined in a color palette and determined by numerical values on a scale. In this way, geographic heat maps give the viewer a quick overview of what is happening in different regions. This tutorial shows how to create geographic heat maps in Python using the GeoPandas library. We will work with COVID-19 data and visualize it using various color-coded maps.

The rest of this article proceeds as follows: We begin by going through the steps to visualize COVID-19 data on a geographic heat map. We will be using the GeoPandas library to plot the maps. Geopandas is an open-source project for working with geospatial data in Python. Our heat map will use color shades to visualize growth rates and total cases of COVID-19 in different countries. In addition, we will zoom in on specific map regions.

Also: Predictive Policing: Preventing Crime in San Francisco using XGBoost

What are Geographic Heat Maps?

Geographic heat maps are visual representations of data that use color or other visual encodings to show the density or intensity of data points in a geographic region. They are commonly used to represent data that is associated with a geographic location, such as population data, economic data, or weather data.

Geographic heat maps are typically created by overlaying a grid or mesh on a map and then assigning a color or other visual encoding to each grid cell based on the density or intensity of data points in that cell. The resulting heat map shows the distribution or pattern of data points across the geographic region and can provide valuable insights and information about the data.

Geographic heat map showing COVID-19 growth rates in different countries of the world. In this Python tutorial we will create similar maps.

Also: Color-Coded Cryptocurrency Price Charts in Python

What are Geographic Heat Maps used for?

Geographic heat maps are used for a variety of purposes, such as:

Visualizing data: geographic heat maps can provide a clear and intuitive way to visualize data that is associated with a geographic location, allowing analysts and users to quickly and easily understand the data and identify patterns, trends, and relationships.
Identifying spatial patterns: geographic heat maps can help to identify spatial patterns or trends in the data, such as clusters, outliers, or trends over time. This can provide valuable insights and information about the data and can help to inform decision-making and analysis.
Analyzing and comparing data: geographic heat maps can be used to compare and contrast different datasets or to analyze the relationship between different variables or data sources. This can help to identify correlations, trends, or patterns that may not be immediately apparent from the raw data.

What are the Potential Pitfalls of Using Geographic Heat Maps?

While geographic heat maps are useful for all kinds of purposes, there are a few potential limitations and pitfalls to consider when using heat maps:

Choosing an appropriate color scale: It’s essential to choose a color scale that accurately reflects the data being represented and is easy for viewers to interpret. If the color scale is not well-suited to the data, it can be difficult for viewers to understand the patterns being shown accurately.
Overloading the map with too much data: It’s possible to add too much data to a heat map, which can make it difficult to interpret and potentially obscure important patterns. It’s important to balance the need for detail with the need for clarity when creating a heat map.
Visual distortion: When working with large or irregularly shaped regions, it can be challenging to depict the data using a heat map accurately. This can lead to visual distortion, where the map does not accurately reflect the actual distribution of the data.
Misinterpretation of the data: Heat maps are a visual representation of data, which can be subject to misinterpretation. It’s important to carefully consider how the data is represented and provide clear context and explanations for the presented patterns.

Let’s keep these potential pitfalls in mind during the following tutorial.

What is GeoPandas?

GeoPandas is a Python package that provides tools for working with geospatial data. It extends the popular pandas package, which provides data manipulation and analysis tools, to include support for geographic data. GeoPandas allows users to manipulate and analyze geospatial data in a familiar pandas DataFrame structure and includes functions for reading and writing spatial data in various formats, as well as tools for visualizing and mapping data.

GeoPandas is built on top of other popular packages, such as Shapely and Fiona, and is a popular choice for working with geospatial data in Python.

With GeoPandas, users can:

Read and write spatial data in various formats, such as Shapefile, GeoJSON, and GeoPackage.
Perform geometric operations on spatial data, such as buffering, intersection, and union.
Create maps and visualize spatial data using matplotlib, a popular Python plotting library.
Analyze and manipulate spatial data in a pandas DataFrame structure, allowing users to use the powerful data manipulation and analysis tools provided by pandas.

Creating Geographic Heat Maps with Python and GeoPandas

In this tutorial, we will learn how to create geographic heat maps using Python and the GeoPandas package. Geographic heat maps are visualizations that show the intensity of data at different locations on a map. They are commonly used to represent the distribution of a variable across a geographic area, and can be useful for identifying patterns, trends, and anomalies in the data. In this tutorial, we will learn how to create geographic heat maps using Python and the GeoPandas package. In the following, we will walk through the steps of loading, manipulating, and visualizing spatial data with GeoPandas, and demonstrate how to create geographic heat maps for covid-19.

The code is available on the GitHub repository.

View on GitHub Relataly GitHub Repo

Prerequisites

Before starting the coding part, make sure that you have set up your Python 3 environment and required packages. If you don’t have an environment set up yet, you can follow the steps in this tutorial to set up the Anaconda environment.

Also, make sure you install all required packages. We will be working with the following standard packages:

You can install packages using console commands:

pip install 
conda install  (if you are using the anaconda packet manager)

We will create geographic heat maps with the GeoPandas Python library. You can install GeoPandas via the console by using the following command:

conda install –channel conda-forge geopandas
pip install geopandas

Update (2020-09-23): With the release of Python 3.8, there is a new install procedure:

conda create -n geo_env
conda activate geo_env
conda config --env --add channels conda-forge
conda config --env --set channel_priority strict
conda install python=3 geopandas

Download the Geographic Map Data From Naturalearthdata

First, we will get the map with the geospatial data. Rendering maps with GeoPandas requires a shapefile. A shapefile is a DataFrame with some graphical data attached. For instance, some shapefiles show cities, countries, continents, or maps of the entire world. So in our case, the shapefile is a list of countries, whereby each country has its graphical representation in polygons. The example presented in this tutorial will use a world map.

Various sources on the web provide shapefiles for different geographical regions and in varying detail. For example, n aturalearthdata.com provides a map of the world. To download the map, go to the natualearthdata webpage, and with a click on the green button, you can download version 4.1.0.

Once the download is complete, unpack the files into the folder of your Python notebook or a subfolder in the folder of your Python notebook (e.g., data/shapefiles/worldmap/).

naturalearthdata.com

Step #1 Loading the COVID-19 Data

Next, we retrieve the COVID-19 data for all countries via the statworx API. If you want to learn more about using REST APIs, check out this tutorial on accessing data sources via REST APIs.

Also: Accessing Remote Data Sources via REST APIs in Python

# Setting up Packages
import json
import country_converter as coco
from datetime import datetime, timedelta
import requests
import pandas as pd
import geopandas as gpd
import matplotlib.pyplot as plt
# Getting the data
PAYLOAD = {'code': 'ALL'}
URL = 'https://api.statworx.com/covid'
RESPONSE = requests.post(url=URL, data=json.dumps(PAYLOAD))
# Convert the response to a data frame
covid_df = pd.DataFrame.from_dict(json.loads(RESPONSE.text))
covid_df.head(3)

      date		day	month	year	cases	deaths	country		code	population	continent	cases_cum	deaths_cum
0	2019-12-31	31	12		2019	0		0		Afghanistan	AF		38041757.0	Asia		0			0
1	2020-01-01	1	1		2020	0		0		Afghanistan	AF		38041757.0	Asia		0			0
2	2020-01-02	2	1		2020	0		0		Afghanistan	AF		38041757.0	Asia		0			0

We continue by preparing the COVID-19 data for visualizing them on a heat map.

Step #2 Specifying a Shapefile

Next, we use the Geopandas library to read in a shapefile at “data/shapefiles/worldmap/ne_10m_admin_0_countries.shp”. We then select the columns “ADMIN,” “ADM0_A3”, and “geometry” from the shapefile and store them in a GeoDataFrame called “geo_df.” Finally, we display the first three rows of the GeoDataFrame.

# Setting the path to the shapefile
SHAPEFILE = 'data/shapefiles/worldmap/ne_10m_admin_0_countries.shp'
# Read shapefile using Geopandas
geo_df = gpd.read_file(SHAPEFILE)[['ADMIN', 'ADM0_A3', 'geometry']]
# Rename columns.
geo_df.columns = ['country', 'country_code', 'geometry']
geo_df.head(3)

	country		country_code	geometry
0	Indonesia	IDN				MULTIPOLYGON (((117.70361 4.16341, 117.70361 4...
1	Malaysia	MYS				MULTIPOLYGON (((117.70361 4.16341, 117.69711 4...
2	Chile		CHL				MULTIPOLYGON (((-69.51009 -17.50659, -69.50611

We have created a dataframe with three columns, as you can see above. The column geometry contains the graphical representation of countries. Now that we have prepared the data, we can plot our first geographic map. We create the map by using the GeoPandas plot function.

# Drop row for 'Antarctica'. It takes a lot of space in the map and is not of much use
geo_df = geo_df.drop(geo_df.loc[geo_df['country'] == 'Antarctica'].index)
# Print the map
geo_df.plot(figsize=(20, 20), edgecolor='white', linewidth=1, color='lightblue')

If you get an error: “ImportError: The Descartes package is required for plotting polygons in GeoPandas.” you first have to install the Descartes package. You can do this by typing in your console: conda install descartes

Step #3 Bringing It All Together

Next, we need to ensure that our data matches the country codes. The dataframe with the geospatial data of the world map contains country codes that adhere to iso3. However, our COVID-19 data uses iso2_codes. Luckily there is a country_converter available that does this job for us:

# Next, we need to ensure that our data matches with the country codes. 
iso3_codes = geo_df['country'].to_list()
# Convert to iso3_codes
iso2_codes_list = coco.convert(names=iso3_codes, to='ISO2', not_found='NULL')
# Add the list with iso2 codes to the dataframe
geo_df['iso2_code'] = iso2_codes_list
# There are some countries for which the converter could not find a country code. 
# We will drop these countries.
geo_df = geo_df.drop(geo_df.loc[geo_df['iso2_code'] == 'NULL'].index)

We have a list with all nations’ names (country) and codes (country_code). An additional column includes the geographical representation of each country.

Step #4 Preprocessing

Our COVID-19 data so far contains historical Covid-19 cases. We want to drop these historical cases and only get the data from the last day. Then we merge the data frames.

Before we plot the heat map, we have to specify a variable that determines the color of the countries on the map. Our goal is to color the countries depending on the growth rate of COVID-19 cases per day. The formula for the growth rate is ‘new cases’ / total present cases.

# We want to drop the history and only get the data from the last day
d = datetime.today()-timedelta(days=1)
date_yesterday = d.strftime("%Y-%m-%d")
# Preparing the data
covid_df = covid_df[covid_df['date'] == date_yesterday]
# Merge the two dataframes
merged_df = pd.merge(left=geo_df, right=covid_df, how='left', left_on='iso2_code', right_on='code')
# Delete some columns that we won't use
df = merged_df.drop(['day', 'month', 'year', 'country_y', 'code'], axis=1)
#Create the indicator values
df['case_growth_rate'] = round(df['cases']/df['cases_cum'], 2)
df['case_growth_rate'].fillna(0, inplace=True) 
df.head(3)

	country_x	country_code		geometry												iso2_code	date		cases	deaths	population	continent	cases_cum	deaths_cum	case_growth_rate
0	Indonesia	IDN					MULTIPOLYGON 	(((117.70361 4.16341, 117.70361 4...	ID			2020-06-28	1385.0	37.0	270625567.0	Asia		52812.0		2720.0		0.03
1	Malaysia	MYS					MULTIPOLYGON 	(((117.70361 4.16341, 117.69711 4...	MY			2020-06-28	10.0	0.0		31949789.0	Asia		8616.0		121.0		0.00
2	Chile		CHL					MULTIPOLYGON 	(((-69.51009 -17.50659, -69.50611...	CL			2020-06-28	4406.0	279.0	18952035.0	America		267766.0	5347.0		0.02

Step #5 Creating a Geographic Heat Map

In the previous step, we set up the data for our map. Next, we create the geographical heat map for the world.

We set the path to the shapefile and use Geopandas to read it. We then rename these columns. Next, we set the range for the choropleth and create a figure and axes for Matplotlib. We remove the axis and plot the choropleth using the data from the ‘df’ dataframe and the ‘case_growth_rate’ column, setting the edgecolor, linewidth, and cmap. We also add a title to the map and an annotation for the data source. Additionally, we create a colorbar as a legend, using the ScalarMappable function, and add it to the figure with a specified position.

# Print the map
# Set the range for the choropleth
title = 'Daily COVID-19 Growth Rates'
col = 'case_growth_rate'
source = 'Source: relataly.com \nGrowth Rate = New cases / All previous cases'
vmin = df[col].min()
vmax = df[col].max()
cmap = 'viridis'
# Create figure and axes for Matplotlib
fig, ax = plt.subplots(1, figsize=(20, 8))
# Remove the axis
ax.axis('off')
df.plot(column=col, ax=ax, edgecolor='0.8', linewidth=1, cmap=cmap)
# Add a title
ax.set_title(title, fontdict={'fontsize': '25', 'fontweight': '3'})
# Create an annotation for the data source
ax.annotate(source, xy=(0.1, .08), xycoords='figure fraction', horizontalalignment='left', 
            verticalalignment='bottom', fontsize=10)
            
# Create colorbar as a legend
sm = plt.cm.ScalarMappable(norm=plt.Normalize(vmin=vmin, vmax=vmax), cmap=cmap)
# Empty array for the data range
sm._A = []
# Add the colorbar to the figure
cbaxes = fig.add_axes([0.15, 0.25, 0.01, 0.4])
cbar = fig.colorbar(sm, cax=cbaxes)

Geographic heat map showing COVID-19 growth rates in different countries of the world

As shown in the map above, countries in Central Asia and Africa currently report the highest COVID-19 growth rates.

There are different color palettes. You can use them by altering the cmap variable. Below is a sample of ready-to-use color scales. You can find more color scales on the matblotlib page.

Colormaps

Step #6 Zooming in on Specific Regions

We have observed that many African countries are currently reporting rising case numbers, so we create a new dataframe based on a filter for African countries using the list of country codes.

In the following, we create a geographic map specifically for Africa. We can zoom in on a continent or a country by filtering our dataframe. The code below will filter the spatial-geo data to African countries and plot the heat map. We plot the map for Africa using this new dataframe, setting the title of the map to ‘COVID-19 Growth Rate per Day in Africa’ and adding a source annotation to the bottom left corner of the map.

# The map shows that many african countries are currently reporting increasing case numbers
# Next we create a new df based on a filter for african countries
africa_country_list = ['ZM', 'BF', 'TZ', 'EG', 'UG', 'TN', 'TG', 'SZ', 'SD', 
                       'EH', 'SS', 'ZW', 'ZA', 'SO', 'SL', 'SC', 'SN', 'ST', 
                       'SH', 'RW', 'RE', 'GW', 'NG', 'NE', 'NA', 'MZ', 'MA', 
                       'MU', 'MR', 'ML', 'MW', 'MG', 'LY', 'LR', 'LS', 'KE', 
                       'CI', 'GN', 'GH', 'GM', 'GA', 'DJ', 'ER', 'ET', 'GQ', 
                       'BJ', 'CD', 'CG', 'YT', 'KM', 'TD', 'CF', 'CV', 'CM', 
                       'BI', 'BW', 'AO', 'DZ']
africa_map_df = df[df['iso2_code'].isin(africa_country_list)]
# Plot the map for Africa
title = 'COVID-19 Growth Rate per Day in Africa'
col = 'case_growth_rate'
source = 'Source: relataly.com \nGrowth Rate = New cases / All previous cases'
vmin = df[col].min()
vmax = df[col].max()
fig, ax = plt.subplots(1, figsize=(20, 9))
ax.axis('off')
africa_map_df.plot(column=col, ax=ax, edgecolor='0.8', linewidth=1, cmap=cmap)
ax.set_title(title, fontdict={'fontsize': '25', 'fontweight': '3'})
ax.annotate(source, xy=(0.24, .08), xycoords='figure fraction',
            horizontalalignment='left',
            verticalalignment='bottom', fontsize=10)
sm = plt.cm.ScalarMappable(norm=plt.Normalize(vmin=vmin, vmax=vmax), cmap=cmap)
cbaxes = fig.add_axes([0.35, 0.25, 0.01, 0.5])
{"type":"block","srcIndex":53,"srcClientId":"2ddd9666-6def-46e0-803e-4bf7b0366a27","srcRootClientId":""}cbar = fig.colorbar(sm, cax=cbaxes)

Geographic heat map of Africa showing COVID-19 growth rates in different countries

In case you encounter an error with the mapclassify-package, you can try the following command to reinstall it: conda install -c conda-forge mapclassify

Voilá, now we only see the African continent. The map shows that the countries in Africa that currently report the highest total case numbers are South Africa, Algeria, Morocco, Kamerun, and Egypt.

Let’s take a look at the total cases per country in Africa:

# Insert cases per population
# Alternative: africa_map_df2['cases_population'] = round(africa_map_df['cases_cum'] / africa_map_df['population'] * 100)
africa_map_df2 = africa_map_df.copy()
# Remove NAs
africa_map_df2.loc[: , 'cases_cum'].fillna(0, inplace=True)
# Show the data
africa_map_df2.head()
# Plot the map
title = 'Total COVID-19 Cases on the African Continent'
col = 'cases_cum'
source = 'Source: relataly.com '
vmin = africa_map_df2[col].min()
vmax = africa_map_df2[col].max()
fig, ax = plt.subplots(1, figsize=(20, 9))
ax.axis('off')
africa_map_df2.plot(column=col, ax=ax, edgecolor='1', linewidth=1, cmap=cmap)
ax.set_title(title, fontdict={'fontsize': '25', 'fontweight' : '3'})
ax.annotate(
    source, xy=(0.24, .08), xycoords='figure fraction', horizontalalignment='left', 
    verticalalignment='bottom', fontsize=10)
sm = plt.cm.ScalarMappable(norm=plt.Normalize(vmin=vmin, vmax=vmax), cmap=cmap)
cbaxes = fig.add_axes([0.35, 0.25, 0.01, 0.5])
cbar = fig.colorbar(sm, cax=cbaxes)

The highest growth rate was reported by South Sudan, followed by Botswana and Niger.

Step #7 Saving a Geo-Heat Maps to PNG

If you want to save the map, you can do this with the following command.

# Safe the map to a png
fig.savefig('map_export.png', dpi=300)

Summary

This article showed how to create geographic heat maps using the Geopandas library in Python. It showed how to read in a shapefile and create a choropleth map using the data from a dataframe. Additionally, the article explained how to filter the data to display maps of specific regions, in this case Africa. We showed how to prepare spatial data and color-code the maps using COVID-19 data. In addition, we filtered the DataFrame to create maps for specific regions, zoom in on specific areas, and alter the color style using different color maps.

Geographic heat maps can provide valuable insights into the distribution of data and help to identify patterns and trends. The technique of creating heat maps with Geopandas is a powerful tool for data visualization and can be applied to a wide range of geographical data.

I hope this article was helpful. If you have any questions or remarks, please write them in the comments.

Looking for more exciting map visualizations? Consider this relataly tutorial on predicting and visualizing crimes on a map of San Francisco.

Sources and Further Reading

https://geopandas.org/en/stable/getting_started.html

The post Geographic Heat Maps with GeoPandas: Visualizing COVID-19 Data in Python appeared first on relataly.com.