Data Visualization Archives - relataly.com

Predictive Maintenance: Predicting Machine Failure using Sensor Data with XGBoost and Python

Florian Follonier — Sun, 08 Jan 2023 20:34:44 +0000

Predictive maintenance is a game-changer for the modern industry. Still, it is based on a simple idea: By using machine learning algorithms, businesses can predict equipment failures before they happen. This approach can help businesses improve their operations by reducing the need for reactive, unplanned maintenance and by enabling them to schedule maintenance activities during planned downtime. In this article, we’ll explore the use of machine learning algorithms to predict machine failures using the robust XGBoost algorithm in Python. By the end of this tutorial, you’ll have the knowledge and skills to start implementing predictive maintenance in your organization. So, let’s get started!

We begin by discussing the concept of predictive maintenance and show different ways to implement it. Then we will turn to the coding part in python and implement the prediction model based on machine sensor data. We train a classification model that predicts different types of machine failure using XGBoost.

Predictive maintenance is a game-changer for the modern industry. Image generated with Midjourney.

What is Predictive Maintenance?

Predictive maintenance is a data-driven approach that uses predictive modeling to assess the state of equipment and determine the optimal timing for maintenance activities. This technique is particularly beneficial in industries that heavily rely on equipment for their operations, such as manufacturing, transportation, energy, and healthcare. Depending on the requirements and challenges of an organization, predictive maintenance may contribute to one or several of the following goals:

Improve equipment reliability: By proactively identifying and addressing potential problems with equipment, predictive maintenance can help improve the reliability of the equipment, reducing the risk of unexpected downtime or failure.
Increase efficiency: Predictive maintenance can help improve the efficiency of equipment by identifying and fixing problems before they cause equipment failure or downtime. This can help reduce maintenance costs and increase productivity.
Improve safety: Predictive maintenance can help improve safety by identifying and addressing potential problems with equipment before they occur. This can help prevent accidents and injuries caused by equipment failure.
Reduce maintenance costs: By proactively identifying and fixing potential problems with equipment, predictive maintenance can help reduce the overall cost of maintenance by minimizing the need for unscheduled downtime.
Improve asset management: Predictive maintenance can help improve asset management by providing data and insights into the condition and performance of equipment. This can help organizations decide when to replace or upgrade equipment.

Next, we look at the different ways organizations can implement predictive maintenance.

Utilities and manufacturing are only two of the many industries that use predictive maintenance. Image generated with Midjourney.

Approaches to Predictive Maintenance

There are several approaches to implementing a predictive maintenance solution, depending on the type of equipment being monitored and the resources available. These approaches include:

Condition-based monitoring: This involves continuously monitoring the condition of the equipment using sensors. When certain thresholds or conditions are met, an alert is triggered, or corrective measures are launched. The goal is to reduce the risk of failure. For example, if the temperature of a motor exceeds a certain level, this may indicate that the motor is about to fail.
Predictive modeling: This approach involves using machine learning algorithms to analyze historical lifetime data about the equipment to identify patterns that may indicate an impending failure. This can be done using data from sensors, as well as operational data and maintenance records. When historical or failure data is not available, a degradation model can be created to estimate failure times based on a threshold value. This approach is often used when there is limited data available.
Prognostic algorithms: By using data from sensors and other sources, prognostic algorithms can predict the remaining useful life of a piece of equipment. This information can help organizations determine the likelihood of a breakdown and plan for replacements or maintenance activities. By understanding the equipment better, organizations can potentially extend maintenance cycles, which can reduce costs for replacements and maintenance.

It is important to choose an approach that is appropriate for the specific equipment and maintenance challenges faced by the organization.

Data Requirements

When implementing predictive maintenance, it is important to consider that each approach comes with its own set of data requirements. Types of data include the following:

Current condition data includes information about the state of the equipment, such as its temperature, pressure, vibration, and other physical parameters.
Operating data includes information about how the equipment is being used, such as its load, speed, and other operating parameters.
Maintenance history data includes information about past maintenance activities that have been performed on the equipment.
Failure history data includes information about past equipment failures, such as the date of the failure, the cause of the failure, and the impact on operations.

Collecting these data requires investing in sensors and other data collection infrastructure and ensuring that data collection is accurate and storage is proper. By combining various data types, organizations can create a comprehensive view of equipment condition and performance and use it to predict maintenance requirements.

The specific types of data needed will depend on the implementation approach. Organizations must ensure they have access to the necessary data to implement the selected approach effectively. Some specific data requirements for each approach include the following:

Approach	Data Requirements
Condition-based monitoring	Sensor data from the equipment being monitored.
Predictive modeling	A combination of sensor data, operational data, and maintenance records.
Prognostic algorithms	Sensor data, as well as data about past failures and maintenance events.

Data requirements per implementation approach

Predictive maintenance – Machine learning can make maintenance cycles more cost-efficient. Image generated using Midjourney

Predicting Failures in Milling Machines using XGBoost in Python

Now that we have a basic understanding of predictive maintenance, it’s time to get hands-on with Python. We will use sensor data and machine learning to predict failures in milling machines. But why do these machines break down in the first place? Milling machines have many moving parts that can suffer from wear and tear over time, leading to failures. Additionally, improper maintenance can cause issues with machine operation and lead to costly damage. Efficient maintenance can be challenging due to the varying loads that milling machines are subjected to. However, by implementing a predictive maintenance solution with Python, we can proactively identify and address issues to prevent costly downtime and ensure the smooth operation of our milling machines. Our goal is to predict one of five failure types, which corresponds to a predictive modeling approach. Let’s get started on building our predictive maintenance solution.

The code is available on the GitHub repository.

View on GitHub Relataly GitHub Repo

Image of a CNC milling machine. Image created with Midjourney

Prerequisites

Before starting the coding part, make sure that you have set up your Python 3 environment and required packages.

Python Environment

Before diving into the FairLearn Python tutorial, it is important to take the necessary steps to ensure that your Python environment is properly set up and that you have all the required packages installed. This will ensure a seamless learning experience and prevent any potential roadblocks or issues that may arise due to an improperly configured environment.

If you don’t have an environment, follow this tutorial to set up the Anaconda environment.

Python Packages

Make sure you install all required packages. In this tutorial, we will be working with the following packages:

Pandas
NumPy
Matplotlib
Seaborn
Plotly

In addition, we will be using the machine learning library Scikit-learn and the XGBoost library, which is a popular library for training gradient-boosting models.

You can install packages using console commands:

pip install 
conda install  (if you are using the anaconda packet manager)

About the Sensor Dataset

In this tutorial, we will work with a synthetic sensor dataset from the UCL ML archives that simulates the typical life cycle of a milling machine. The dataset contains the following fields:

The dataset consists of 10 000 data points stored as rows with 14 features in columns:

UID: unique identifier ranging from 1 to 10000
productID: consisting of a letter L, M, or H for low (50% of all products), medium (30%), and high (20%) as product quality variants and a variant-specific serial number
air temperature [K]
process temperature [K]
rotational speed [rpm]
torque [Nm]
tool wear [min]
machine failure. A label that indicates whether the machine has failed or not
Failure type (prediction label). The label contains five failure types: tool wear failure (TWF), heat dissipation failure (HDF), power failure (PWF), overstrain failure (OSF), random failures (RNF)

Source: UCL ML Repository

You can download the dataset from Kaggle.com. Unzip the file predictive_maintenance.csv and save it under the following file path: “/data/iot/classification/”

Step #1 Load the Data

We begin by importing the required libraries. This also includes the XGBoost library, which is a popular library for training gradient-boosting models. In addition, we will load the dataset using the pandas library. Then we define our target variable as Failure Type. The dataset contains a second target column, which only contains the binary information of machine failures. We will drop this column, as our goal is to predict the specific type of failure. Then we print the first three rows of the loaded dataset.

# A tutorial for this file is available at www.relataly.com
# Tested with Python 3.9.13, Matplotlib 3.6.2, Scikit-learn 1.2, Seaborn 0.12.1, numpy 1.21.5, xgboost 1.7.2

import pandas as pd 
import matplotlib.pyplot as plt 
import numpy as np
import seaborn as sns
import plotly.express as px
sns.set_style('white', { 'axes.spines.right': False, 'axes.spines.top': False})
from sklearn.metrics import classification_report, confusion_matrix, precision_recall_fscore_support as score, roc_curve
from sklearn.model_selection import cross_val_score, train_test_split, cross_validate
from sklearn.utils import compute_sample_weight
from xgboost import XGBClassifier

# load the train data
path = '/data/iot/classification/'
df = pd.read_csv(path + "predictive_maintenance.csv") 

# define the target
target_name='Failure Type'

# drop a redundant columns
df.drop(columns=['Target'], inplace=True)

# print a summary of the train data
print(df.shape[0])
df.head(3)

	UDI	Product ID	Type	Air temperature [K]	Process temperature [K]	Rotational speed [rpm]	Torque [Nm]	Tool wear [min]	Failure Type
0	1	M14860		M		298.1				308.6				1551						42.8		0				No Failure
1	2	L47181		L		298.2				308.7				1408						46.3		3				No Failure
2	3	L47182		L		298.1				308.5				1498						49.4		5				No Failure

Step #2 Clean the Data

Next, we quickly check the data quality of our dataset. The following code block checks if there are any missing values in our dataset. If there are missing values, it creates a barplot showing the number of missing values for each column, along with the percentage of missing values. If there are no missing values, it prints a message saying “no missing values.”

The function then drops any columns with more than 5% missing values from the DataFrame. Finally, it prints the names of the remaining columns in the DataFrame. This function can be used to identify and handle missing values in a dataset before applying machine learning algorithms to it.

# check for missing values
def print_missing_values(df):
    null_df = pd.DataFrame(df.isna().sum(), columns=['null_values']).sort_values(['null_values'], ascending=False)
    fig = plt.subplots(figsize=(16, 6))
    ax = sns.barplot(data=null_df, x='null_values', y=null_df.index, color='royalblue')
    pct_values = [' {:g}'.format(elm) + ' ({:.1%})'.format(elm/len(df)) for elm in list(null_df['null_values'])]
    ax.set_title('Overview of missing values')
    ax.bar_label(container=ax.containers[0], labels=pct_values, size=12)

if df.isna().sum().sum() > 0:
    print_missing_values(df)
else:
    print('no missing values')

# drop all columns with more than 5% missing values
for col_name in df.columns:
    if df[col_name].isna().sum()/df.shape[0] > 0.05:
        df.drop(columns=[col_name], inplace=True) 

df.columns

no missing values
Index(['UDI', 'Product ID', 'Type', 'Air temperature [K]',
       'Process temperature [K]', 'Rotational speed [rpm]', 'Torque [Nm]',
       'Tool wear [min]', 'Failure Type'],
      dtype='object')

Next, we will drop two unnecessary columns and rename the remaining ones to make them easier to work with. The original column names are quite long and contain special characters that could cause errors during the training process. Once the columns are renamed, we will print the updated DataFrame to verify the changes.

# drop id columns
df_base = df.drop(columns=['Product ID', 'UDI'])

# adjust column names
df_base.rename(columns={'Air temperature [K]': 'air_temperature', 
                        'Process temperature [K]': 'process_temperature', 
                        'Rotational speed [rpm]':'rotational_speed', 
                        'Torque [Nm]': 'torque', 
                        'Tool wear [min]': 'tool_wear'}, inplace=True)
df_base.head()

	Type	air_temperature	process_temperature	rotational_speed	torque	tool_wear	Failure Type
0	M		298.1			308.6				1551				42.8	0			No Failure
1	L		298.2			308.7				1408				46.3	3			No Failure
2	L		298.1			308.5				1498				49.4	5			No Failure
3	L		298.2			308.6				1433				39.5	7			No Failure
4	L		298.2			308.7				1408				40.0	9			No Failure

Everything looks as expected: Our dataset contains six features and the target column with the five failure types.

Step #3 Explore the Data

Next, let’s explore the dataset.

Target Class Distribution

The following code uses the plotly express library to create a histogram showing the class distribution of the “Failure Type” column in a DataFrame called “df_base.” The histogram will have one bar for each unique value in the “Failure Type” column, and the height of each bar will represent the number of occurrences of that value in the column. This can be useful for understanding the imbalance in the distribution of classes in a classification problem.

# display class distribution of the target variable
px.histogram(df_base, y="Failure Type", color="Failure Type")

Our dataset is highly imbalanced, with the vast majority of cases having a “No Failure” label. If the dataset is highly imbalanced, with a disproportionate number of cases in one class compared to the others, it can impact the performance of machine learning models. This is because imbalanced datasets can lead to models that are biased towards the majority class, and may not perform well on the minority class. In order to improve model performance on imbalanced datasets, we will later adjust the model hyperparameters accordingly.

Feature Pairplots

Next, let’s construct pair plots to explore feature relations with the target variable. Pair plots, also known as scatter plots, are a type of plot that shows the relationship between two variables. In the context of a predictive maintenance dataset, pair plots can be useful for exploring the relationships between different features and the target variable (e.g., the likelihood of a machine failure). By creating pair plots and visualizing the relationships between different features and the target variable, you can gain insights into which features might be most useful for building a predictive model.

# pairplots on failure type
sns.pairplot(df_base, height=2.5, hue='Failure Type')

The pair plots reveal valuable patterns in our features that can inform the predictions of our model. For instance, we see that Power Failures tend to be correlated with torque values that are either close to the maximum or minimum. Such patterns should allow our predictive model to make solid predictions.

Feature Correlation

Next, we will look at feature correlation. The following code block creates a heatmap using the seaborn library that shows the correlation between all pairs of columns in a DataFrame called “df_base”. The heatmap is plotted using a color scale, with warmer colors indicating stronger correlations and cooler colors indicating weaker correlations. The correlation values are also displayed in the cells of the heatmap, with values ranging from -1 (perfect negative correlation) to 1 (perfect positive correlation). By creating a heatmap, you can quickly see which variables are positively or negatively correlated with each other, and to what degree. This can be helpful for identifying which features might be most useful for building a predictive model.

# correlation plot
plt.figure(figsize=(6,4))
sns.heatmap(df_base.corr(), cbar=True, fmt='.1f', vmax=0.8, annot=True, cmap='Blues')

From the table, it looks like there is a strong positive correlation between “air_temperature” and “process_temperature” (0.87). This makes sense since a high process temperature will naturally also heat up the air around the machine. In addition, there is a strong negative correlation between “rotational_speed” and “torque” (-0.87). The other correlations are weaker and closer to 0, indicating weaker relationships.

Understanding the correlations between different variables in a dataset can be helpful for building predictive models, as it can give you an idea of which features might be most important for predicting a given target. It can also help you identify any redundant features that might not add much value to your model. Since our dataset only contains six features, we will keep all of them.

Feature Boxplots

Box plots are a useful visualization tool for understanding the distribution of values in a dataset. They show the minimum, first quartile, median, third quartile, and maximum values for each group, as well as any outliers. By creating box plots separated by a categorical variable, you can compare the distributions of values between different groups and see if there are any significant differences. This can be useful for identifying trends or patterns in the data that might be useful for building a predictive model.

If there are significant differences between the boxplots for different categories, it could be a good sign for building a predictive model. For example, if the boxplots for one category tend to have higher values for a particular feature than the boxplots for another category, it could indicate that the feature is related to the target variable and could be useful for making predictions.

# create histograms for feature columns separated by target column
def create_histogram(column_name):
    plt.figure(figsize=(16,6))
    return px.box(data_frame=df_base, y=column_name, color='Failure Type', points="all", width=1200)

create_histogram('air_temperature')

Feature boxplot for process_temperature.

create_histogram('process_temperature')

Feature boxplot for rotational speed.

create_histogram('rotational_speed')

Feature boxplot for torque.

create_histogram('torque')

Feature boxplot for tool wear.

create_histogram('tool_wear')

Now that we have a good understanding of our dataset, we can prepare the data for model training.

Step #4 Data Preparation

To prepare the data for model training, we will need to split our dataset and make additional modifications.

The following code block contains a reusable function called data_preparation. The purpose of this function is to prepare the data in a way that is suitable for building and evaluating machine learning models. It performs several preprocessing steps, such as encoding categorical variables and splitting the data into training and test sets.

def data_preparation(df_base, target_name):
    df = df_base.dropna()

    df['target_name_encoded'] = df[target_name].replace({'No Failure': 0, 'Power Failure': 1, 'Tool Wear Failure': 2, 'Overstrain Failure': 3, 'Random Failures': 4, 'Heat Dissipation Failure': 5})
    df['Type'].replace({'L': 0, 'M': 1, 'H': 2}, inplace=True)
    X = df.drop(columns=[target_name, 'target_name_encoded'])
    y = df['target_name_encoded'] #Prediction label

    # split the data into x_train and y_train data sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, random_state=0)

    # print the shapes: the result is: (rows, training_sequence, features) (prediction value, )
    print('train: ', X_train.shape, y_train.shape)
    print('test: ', X_test.shape, y_test.shape)
    return X, y, X_train, X_test, y_train, y_test

# remove target from training data
X, y, X_train, X_test, y_train, y_test = data_preparation(df_base, target_name)

Step #5 Model Training

Now that we have prepared the dataset, we can train the XGBoost classification model. The basic idea behind XGBoost is to train a series of weak models, such as decision trees, and then combine their predictions using gradient boosting. During training, XGBoost uses an optimization algorithm to adjust the weight of each model in the ensemble in order to improve the overall prediction accuracy. XGBoost also includes a number of additional features and techniques that help to improve the performance of the model, such as regularization, feature selection, and handling missing values.

XGboost provides several configuration options that we can use to finetune performance and adjust the training process to our dataset. For a complete list of hyperparameters, please see the library documentation.

Remember that our class labels are imbalanced. Therefore, we will provide the model with sample weights. The following code creates a weight array for the training and test sets using the “compute_sample_weight” function from scikit-learn. We calculate the weight array based on the “balanced” mode. This means that the weights are calculated such that the class distribution in the sample is balanced. This can be useful when working with imbalanced datasets, as it helps to mitigate the effects of class imbalance on the model.

weight_train = compute_sample_weight('balanced', y_train)
weight_test = compute_sample_weight('balanced', y_test)

xgb_clf = XGBClassifier(booster='gbtree', 
                        tree_method='gpu_hist', 
                        sampling_method='gradient_based', 
                        eval_metric='aucpr', 
                        objective='multi:softmax', 
                        num_class=6)
# fit the model to the data
xgb_clf.fit(X_train, y_train.ravel(), sample_weight=weight_train)

We can see that the blue box summarizes the configuration of our model and indicates that the training process has been successful. Now that we have the classifier, we can use it to make predictions on new data.

Step #6 Model Evaluation

Finally, we will evaluate the model’s performance. This will involve three steps:

Model scoring
Cross-validation
Confusion matrix

Model Scoring

First, we calculate the accuracy of the classifier on the test set using the “score” method. To account for the imbalance of class labels, we pass in the weight array for the test set as an additional parameter. This returns the fraction of correct predictions made by the classifier. Next, the code uses the classifier to make predictions on the test set using the “predict” method. It then generates a classification report using the “classification_report” function from scikit-learn. The report displays a summary of the model’s performance in terms of various evaluation metrics such as precision, recall, and f1-score.

# score the model with the test dataset
score = xgb_clf.score(X_test, y_test.ravel(), sample_weight=weight_test)

# predict on the test dataset
y_pred = xgb_clf.predict(X_test)

# print a classification report
results_log = classification_report(y_test, y_pred)
print(results_log)

precision    recall  f1-score   support

           0       0.99      0.98      0.99      2903
           1       0.64      0.88      0.74        24
           2       0.04      0.08      0.06        12
           3       0.77      0.89      0.83        27
           4       0.00      0.00      0.00         4
           5       0.76      0.97      0.85        30

    accuracy                           0.98      3000
   macro avg       0.53      0.63      0.58      3000
weighted avg       0.98      0.98      0.98      3000

The classification report shows the performance of our XGBoost classifier on the test dataset. The model appears to perform well, with a high accuracy of 0.98 and a high weighted average f1-score of 0.98.

However, there are a few classes where the model’s performance is not as strong. Class 1 has a relatively low precision of 0.64 and a low f1-score of 0.74, while class 2 has a very low precision of 0.04 and a low f1-score of 0.06. Class 4 has a precision and f1-score of 0.00, which suggests that the model is not making any correct predictions for this class.

It is also worth noting that the support for some classes is much lower than for others. Class 1 has a support of 24, while class 0 has a support of 2903. This is due to the fact that there are relatively few instances of class 1 in the test dataset compared to class 0, which affects the model’s performance on class 1.

Confusion Matrix

Next, we create a confusion matrix. We input the true labels of the test set (y_test) and the predicted labels produced by the model (y_pred) to generate the matrix. The matrix shows us the number of correct and incorrect predictions made by the model for each class.

We then create a DataFrame from the confusion matrix and use the seaborn library to visualize the matrix as a heatmap. The heatmap allows us to easily see which classes are being predicted correctly and which are being misclassified.

# create predictions on the test dataset
y_pred = xgb_clf.predict(X_test)

# print a multi-Class Confusion Matrix
cnf_matrix = confusion_matrix(y_test, y_pred)
df_cm = pd.DataFrame(cnf_matrix, columns=np.unique(y_test), index=np.unique(y_test))
df_cm.index.name = 'Actual'
df_cm.columns.name = 'Predicted'
plt.figure(figsize = (8, 5))
sns.set(font_scale=1.1) #for label size
sns.heatmap(df_cm, cbar=True, cmap= "inferno", annot=True, fmt='.0f')

The color scale of the heatmap indicates the magnitude of the values in the matrix. In this case, the darker the color, the higher the number of predictions. This visualization helps us to understand the performance of the model and identify areas for improvement.

Here are a few things that we can learn from this matrix:

The model made a total of 2902 correct predictions and 67 incorrect predictions.
For the “No Failure” class, the model made 2854 correct predictions and 29 incorrect predictions. The majority of the incorrect predictions were false negatives.
For the “Power Failure” class, the model made 21 correct predictions and three incorrect predictions.
For the “Tool Wear Failure” class, the model made 1 correct prediction and 1 incorrect prediction.
For the “Overstrain Failure” class, the model made 24 correct predictions and 2 incorrect predictions.
For the “Random Failures” class, the model made 29 correct predictions and 4 incorrect predictions.
For the “Heat Dissipation Failure” class, the model made 29 correct predictions and 1 incorrect prediction.

Overall, the model seems to be performing relatively well, but it is making a lot of false negatives for some classes.

Cross Validation

Finally, we perform cross-validation on the training set using the “cross_validate” function from scikit-learn. Cross-validation is a technique for evaluating the performance of a machine learning model by training it on different subsets of the data and evaluating it on the remaining data.

In this case, we will train and evaluate our model 10 times using different splits of the data (specified by the “cv” parameter). We also specify that the evaluation metric should be the weighted f1-score (specified by the “scoring” parameter). We then pass the weight array for the training set to the classifier.

The “cross_validate” function returns a dictionary containing various evaluation metrics for each fold of the cross-validation. We will convert the dictionary to a DataFrame and create a bar plot using the plotly express library to visualize the results. This helps us to understand the consistency and stability of the model’s performance.

# cross validation
scores  = cross_validate(xgb_clf, X_train, y_train, cv=10, scoring="f1_weighted", fit_params={ "sample_weight" :weight_train})
scores_df = pd.DataFrame(scores)
px.bar(x=scores_df.index, y=scores_df.test_score, width=800)

The model performance remains consistent across all folds.

Summary

In this article, we have presented the concept of predictive maintenance and demonstrated how organizations can use this approach to improve their maintenance cycles. The second part of the article provided a hands-on tutorial showing how to implement a predictive maintenance solution for predicting different failure types of a milling machine. We trained a classification model using the XGBoost algorithm and sensor data from the machine.

While the model demonstrated good performance overall, we observed that it was not able to predict all classes with the same level of accuracy. This suggests that there may be opportunities to improve the model’s performance. One potential approach is to balance the dataset by up or down-sampling the data to achieve a more even distribution of classes. By doing so, we can mitigate the effects of class imbalance and potentially improve the model’s predictions for all classes.

By implementing such a predictive maintenance approach, organizations can improve their operational efficiency and ensure the smooth running of their machinery.

I hope this article was helpful. If you have any questions or feedback, let me know in the comments.

Predictive maintenance also plays an essential role in a smart factory. Image created with Midjourney.

Sources and Further Reading

There are many books available on the topics of IoT and predictive maintenance. Here are a few recommendations:

An Introduction to Predictive Maintenance by R Keith Mobley
Predictive Analytics: The Secret to Predicting Future Events Using Big Data and Data Science Techniques Such as Data Mining, Predictive Modelling, Statistics, Data Analysis, and Machine by Richard Hurley
Stephan Matzka, Explainable Artificial Intelligence for Predictive Maintenance Applications, Third International Conference on Artificial Intelligence for Industries (AI4I 2020)
David Forsyth (2019) Applied Machine Learning Springer
ChatGPT was used to revise certain parts of this article
Images created using Midjourney and OpenAI Dall-E

The links above to Amazon are affiliate links. By buying through these links, you support the Relataly.com blog and help to cover the hosting costs. Using the links does not affect the price.

The post Predictive Maintenance: Predicting Machine Failure using Sensor Data with XGBoost and Python appeared first on relataly.com.

How to Use Hierarchical Clustering For Customer Segmentation in Python

Florian Follonier — Thu, 22 Dec 2022 18:50:14 +0000

Have you ever found yourself wondering how you can better understand your customer base and target your marketing efforts more effectively? One solution is to use hierarchical clustering, a method of grouping customers into clusters based on their characteristics and behaviors. By dividing your customers into distinct groups, you can tailor your marketing campaigns and personalize your marketing efforts to meet the specific needs of each group. This can be especially useful for businesses with large customer bases, as it allows them to target their marketing efforts to specific segments rather than trying to appeal to everyone at once. Additionally, hierarchical clustering can help businesses identify common patterns and trends among their customers, which can be useful for targeting future marketing efforts and improving the overall customer experience. In this tutorial, we will use Python and the scikit-learn library to apply hierarchical (agglomerative) clustering to a dataset of customer data.

The rest of this tutorial proceeds in two parts. The first part will discuss hierarchical clustering and how we can use it to identify clusters in a set of customer data. The second part is a hands-on Python tutorial. We will explore customer health insurance data and apply an agglomerative clustering approach to group the customers into meaningful segments. Finally, we will use a tree-like diagram called a dendrogram, which is helpful for visualizing the structure of the data. The resulting segments could inform our marketing strategies and help us better understand our customers. So let’s get started!

Customer segmentation is a typical use case for clustering. Image generated with Midjourney.

What is Hierarchical Clustering?

So what is hierarchical clustering? Hierarchical clustering is a method of cluster analysis that aims to build a hierarchy of clusters. It creates a tree-like diagram called a dendrogram, which shows the relationships between clusters. There are two main types of hierarchical clustering: agglomerative and divisive.

Agglomerative hierarchical clustering: This is a bottom-up approach in which each data point is treated as a single cluster at the outset. The algorithm iteratively merges the most similar pairs of clusters until all data points are in a single cluster.
Divisive hierarchical clustering: This is a top-down approach in which all data points are treated as a single cluster at the outset. The algorithm iteratively splits the cluster into smaller and smaller subclusters until each data point is in its own cluster.

Agglomerative Clustering

In this article, we will apply the agglomerative clustering approach, which is a bottom-up approach to clustering. The idea is to initially treat each data point in a dataset as its own cluster and then combine the points with other clusters as the algorithm progresses. The process of agglomerative clustering can be broken down into the following steps:

Start with each data point in its own cluster.
Calculate the similarity between all pairs of clusters.
Merge the two most similar clusters.
Repeat steps 2 and 3 until all the data points are in a single cluster or until a predetermined number of clusters is reached.

There are several ways to calculate the similarity between clusters, including using measures such as the Euclidean distance, cosine similarity, or the Jaccard index. The specific measure used can impact the results of the clustering algorithm.

For details on how the clustering approach works, see the Wikipedia page.

Hierarchical clustering is an unsupversied way to classify things.

" data-image-caption="

Hierarchical clustering is an unsupversied way to classify things.

" data-large-file="https://www.relataly.com/wp-content/uploads/2023/03/mushrooms_and_fruits_pattern-min.png" src="https://www.relataly.com/wp-content/uploads/2023/03/mushrooms_and_fruits_pattern-min-430x512.png" alt="Hierarchical clustering is an unsupversied way to classify things. " class="wp-image-13027" srcset="https://www.relataly.com/wp-content/uploads/2023/03/mushrooms_and_fruits_pattern-min.png 430w, https://www.relataly.com/wp-content/uploads/2023/03/mushrooms_and_fruits_pattern-min.png 252w, https://www.relataly.com/wp-content/uploads/2023/03/mushrooms_and_fruits_pattern-min.png 506w" sizes="(max-width: 430px) 100vw, 430px" />

Hierarchical clustering is an unsupervised technique to classify things based on patterns in their data. Image created with Midjourney.

Hierarchical Clustering vs. K-means

In a previous article, we have already discussed the popular clustering approach k-means. So how are k-means and hierarchical clustering different? Hierarchical clustering and k-means are both clustering algorithms that can be used to group similar data points together. However, there are several key differences between these two approaches:

The number of clusters: In k-means, the number of clusters must be specified in advance, whereas in hierarchical clustering, the number of clusters is not specified. Instead, hierarchical clustering creates a hierarchy of clusters, starting with each data point as its own cluster and then merging the most similar clusters until all data points are in a single cluster.
Cluster shape: K-means produces clusters that are spherical, while hierarchical clustering produces clusters that can have any shape. This means that k-means is better suited for data that is well-separated into distinct, spherical clusters, while hierarchical clustering is more flexible and can handle more complex cluster shapes.
Distance measure: K-means uses a distance measure, such as the Euclidean distance, to calculate the similarity between data points, while hierarchical clustering can use a variety of distance measures. This means that k-means is more sensitive to the scale of the features, while hierarchical clustering is less sensitive to the feature scale.
Computational complexity: K-means is generally faster than hierarchical clustering, especially for large datasets. This is because k-means only requires a single pass through the data to assign data points to clusters, while hierarchical clustering requires multiple passes to merge clusters.
Visualization: Hierarchical clustering produces a tree-like diagram called a “dendrogram.” The dendrogram shows the relationships between clusters. This can be useful for visualizing the structure of the data and understanding how clusters are related.

Next, let’s look at how we can implement a hierarchical clustering model in Python.

Customer Segmentation using Hierarchical Clustering in Python

In this comprehensive guide, we explore the application of hierarchical clustering for effective customer segmentation using a customer dataset. This data-driven segmentation method enables businesses to identify distinct customer clusters based on various factors, including demographics, behaviors, and preferences.

Customer segmentation is a strategic approach that splits a customer base into smaller, more manageable groups with similar characteristics. It aims to better understand the diverse needs and wants of different customer segments to enhance marketing strategies and product development.

Applying customer segmentation through hierarchical clustering allows businesses to personalize their marketing messages, design targeted campaigns, and tailor products to meet the unique needs of each segment. This proactive approach can stimulate increased customer loyalty and sales.

We begin by loading the customer data and selecting the relevant features we want to use for clustering. We then standardize the data using the StandardScaler from scikit-learn. Next, we apply hierarchical clustering using the AgglomerativeClustering method, specifying the number of clusters we want to create. Finally, we add the predictions to the original data as a new column and view the resulting segments by calculating the mean of each feature for each segment.

The code is available on the GitHub repository.

View on GitHub Relataly GitHub Repo

The future of healthcare will see a tight collaboration between humans and AI. Image generated using Midjourney

About the Customer Health Insurance Dataset

In this tutorial, we will work with a public dataset on health_insurance_customer_data from kaggle.com. Download the CSV file from Kaggle and copy it into the following path, starting from the folder with your python notebook: data/customer/

The dataset is relatively simple and contains 1338 rows of insured customers. It includes the insurance charges, as well as demographic and personal information such as Age, Sex, BMI, Number of Children, Smoker, and Region. The dataset does not have any undefined or missing values.

Prerequisites

Before we start the coding part, ensure that you have set up your Python 3 environment and the required packages. If you don’t have an environment, follow this tutorial to set up the Anaconda environment. Also, make sure you install all required packages. In this tutorial, we will be working with the following standard packages:

pandas
NumPy
matplotlib
scikit-learn

You can install packages using console commands:

pip install  
conda install  (if you are using the anaconda packet manager)

Step #1 Load the Data

To begin, we need to load the required packages and the data we want to cluster. We will load the data by reading the CSV file via the pandas library.

# import necessary libraries
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import AgglomerativeClustering
from sklearn.preprocessing import LabelEncoder
from pandas.api.types import is_string_dtype
import pandas as pd
import math
import seaborn as sns

# load customer data
customer_df = pd.read_csv("data/customer/customer_health_insurance.csv")
customer_df.head(3)

	age	sex		bmi		children	smoker	region		charges
0	19	female	27.90	0			yes		southwest	16884.9240
1	18	male	33.77	1			no		southeast	1725.5523
2	28	male	33.00	3			no		southeast	4449.4620

Step #2 Explore the Data

Next, it is a good idea to explore the data and get a sense of its structure and content. This can be done using a variety of methods, such as examining the shape of the dataframe, checking for missing values, and plotting some basic statistics. For example, the following plots will explore the relationships between some of the variables. We won’t go into too much detail here.

def make_kdeplot(df, column_name, target_name):
    fig, ax = plt.subplots(figsize=(10, 6))
    sns.kdeplot(data=df, hue=column_name, x=target_name, ax = ax, linewidth=2,)
    ax.tick_params(axis="x", rotation=90, labelsize=10, length=0)
    ax.set_title(column_name)
    ax.set_xlim(0, df[target_name].quantile(0.99))
    plt.show()

# make kde plot for ext_color 
make_kdeplot(customer_df, 'smoker', 'charges')

# make kde plot for ext_color 
make_kdeplot(customer_df, 'sex', 'charges')

sns.lmplot(x="charges", y="age", hue="smoker", data=customer_df, aspect=2)
plt.show()

def make_boxplot(customer_df, x,y,h):
    fig, ax = plt.subplots(figsize=(10,4))
    box = sns.boxplot(x=x, y=y, hue=h, data=customer_df)
    box.set_xticklabels(box.get_xticklabels())
    fig.subplots_adjust(bottom=0.2)
    plt.tight_layout()

make_boxplot(customer_df, "smoker", "charges", "sex")

make_boxplot(customer_df, "region", "charges", "sex")

make_boxplot(customer_df, "children", "bmi", "sex")

Next, let’s prepare the data for model training.

Step #3 Prepare the Data

Before we can train a model on the data, we must prepare it for modeling. This typically involves selecting the relevant features, handling missing values, and scaling the data. However, we are using a very simple dataset that already has good data quality. Therefore we can limit our data preparation activities to encoding the labels and scaling the data.

To encode the categorical values, we will use label encoder from the scikit-learn library.

# encode categorical features
label_encoder = LabelEncoder()

for col_name in customer_df.columns:
    if (is_string_dtype(customer_df[col_name])):
        customer_df[col_name] = label_encoder.fit_transform(customer_df[col_name])
customer_df.head(3)

Next, we will scale the numeric variables. While scaling the data is an essential preprocessing step for many machine learning algorithms to work effectively, it is generally not necessary for hierarchical clustering. This is because hierarchical clustering is not sensitive to the scale of the features. However, when you use certain distance measures, such as Euclidean distance, scaling the data might still be useful when performing hierarchical clustering. Scaling the data can help to ensure that all of the features are given equal weight. This can be useful if you want to avoid giving more weight to features with larger scales.

# select features
X = customer_df # we will select all features

# standardize the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
X_scaled.head(3)

array([[-1.43876426, -1.0105187 , -0.45332   , ...,  1.34390459,
         0.2985838 ,  1.97058663],
       [-1.50996545,  0.98959079,  0.5096211 , ...,  0.43849455,
        -0.95368917, -0.5074631 ],
       [-0.79795355,  0.98959079,  0.38330685, ...,  0.43849455,
        -0.72867467, -0.5074631 ],
       ...,
       [-1.50996545, -1.0105187 ,  1.0148781 , ...,  0.43849455,
        -0.96159623, -0.5074631 ],
       [-1.29636188, -1.0105187 , -0.79781341, ...,  1.34390459,
        -0.93036151, -0.5074631 ],
       [ 1.55168573, -1.0105187 , -0.26138796, ..., -0.46691549,
         1.31105347,  1.97058663]])

Step #4 Train the Hierarchical Clustering Algorithm

To train a hierarchical clustering model using scikit-learn, we can use the AgglomerativeClustering or Ward class. The main parameters for these classes are:

n_clusters: The number of clusters to form. This parameter is required for AgglomerativeClustering but is not used for Ward.
affinity: The distance measure used to calculate the similarity between pairs of samples. This can be any of the distance measures implemented in scikit-learn, such as the Euclidean distance or the cosine similarity.
linkage: The method used to calculate the distance between clusters. This can be one of “ward,” “complete,” “average,” or “single.”
distance_threshold: The maximum distance between two clusters that allows them to be merged. This parameter is only used in the AgglomerativeClustering class.

To train the model, we specify the desired parameters and fit the model to the data using the fit_predict method. This method will fit the model to the data and generate predictions in one step.

# apply hierarchical clustering 
model = AgglomerativeClustering(affinity='euclidean')
predicted_segments = model.fit_predict(X_scaled)

Now we have a trained clustering model also predicted the segments for our data.

Step #5 Visualize the Results

After the model is trained, we can visualize the results to get a better understanding of the clusters that were formed. There is a wide range of plots and tools to visualize clusters. In this tutorial, we will use a scatterplot and a dendrogram.

5.1 Scatterplot

For this, we can use the lmplot function in Seaborn. The lmplot creates a 2D scatterplot with an optional overlay of a linear regression model. The plot visualizes the relationship between two variables and fits a linear regression model to the data that can highlight differences. In the following, we use this linear regression model to highlight the differences between our two cluster segments and the age of the customers.

# add predictions to data as a new column
customer_df['segment'] = predicted_segments

# create a scatter plot of the first two features, colored by segment
sns.lmplot(x="charges", y="age", hue="segment", data=customer_df, aspect=2)
plt.show()

We can see that our model has determined two clusters in our data. The clusters seem to correspond well with the smoker category, which indicates that this attribute is decisive in forming relevant groups.

5.2 Dendrogram

The hierarchical clustering approach lets us visualize relationships between different groups in our dataset in a dendrogram. A dendrogram is a graphical representation of a hierarchical structure, such as the relationships between different groups of objects or organisms. It is typically used in biology to show the relationships between different species or taxonomic groups, but it can also be used in other fields to represent the hierarchical structure of any set of data. In a dendrogram, the objects or groups being studied are represented as branches on a tree-like diagram. The branches are usually labeled with the names of the objects or groups, and the lengths of the branches represent the distances or dissimilarities between the objects or groups. The branches are also arranged in a hierarchical manner, with the most closely related objects or groups being placed closer together and the more distantly related ones being placed farther apart.

# Visualize data similarity in a dendogram
def plot_dendrogram(model, **kwargs):
    # create the counts of samples under each node
    counts = np.zeros(model.children_.shape[0])
    n_samples = len(model.labels_)
    for i, merge in enumerate(model.children_):
        current_count = 0
        for child_idx in merge:
            if child_idx < n_samples:
                current_count += 1  # leaf node
            else:
                current_count += counts[child_idx - n_samples]
        counts[i] = current_count

    linkage_matrix = np.column_stack(
        [model.children_, model.distances_, counts]
    ).astype(float)

    # Plot the corresponding dendrogram
    dendrogram(linkage_matrix, orientation='right',**kwargs)


plt.title("Hierarchical Clustering Dendrogram")
# plot the top three levels of the dendrogram
plot_dendrogram(cluster_model, truncate_mode="level", p=4)
plt.xlabel("Euclidean Distance")
plt.ylabel("Number of points in node (or index of point if no parenthesis).")
plt.show()

Source: This code block is based on code from the scikit-learn page

Summary

In conclusion, hierarchical clustering is a powerful tool for customer segmentation that can help businesses better understand their customer base and target their marketing efforts more effectively. By grouping customers into clusters based on their characteristics and behaviors, companies can create targeted campaigns and personalize their marketing efforts to better meet the needs of each group. Using Python and the scikit-learn library, we were able to apply an agglomerative clustering approach to a dataset of customer data and identify two distinct segments. We can then use these segments to inform our marketing strategies and get a better understanding of our customers.

By the way, customer segmentation is an area where real-world data can be prone to bias and unfairness. If you’re concerned about this, check out our latest article on addressing fairness in machine learning with fairlearn.

I hope this article was useful. If you have any feedback, please write your thoughts in the comments.

Sources and Further Reading

Articles

https://scikit-learn.org/stable/auto_examples/cluster/plot_agglomerative_dendrogram.html
Images generated with OpenAI Dall-E and Midjourney.

Books on Clustering

“Data Clustering: Algorithms and Applications” by Charu C. Aggarwal: This book covers a wide range of clustering algorithms, including hierarchical clustering, and discusses their applications in various fields.
“Data Mining: Practical Machine Learning Tools and Techniques” by Ian H. Witten and Eibe Frank: This book is a comprehensive introduction to data mining and machine learning, including a chapter on hierarchical clustering.

Books on Machine Learning

The links above to Amazon are affiliate links. By buying through these links, you support the Relataly.com blog and help to cover the hosting costs. Using the links does not affect the price.

Relataly articles on clustering and machine learning

Simple Clustering using K-means in Python: This article gives an overview of cluster analysis with k-means.
Clustering crypto markets using affinity propagation in Python: This article applies cluster analysis to crypto markets and creates a market map for various cryptocurrencies.
Addressing fairness in machine learning with the fairlearn library

The post How to Use Hierarchical Clustering For Customer Segmentation in Python appeared first on relataly.com.

Feature Engineering and Selection for Regression Models with Python and Scikit-learn

Florian Follonier — Mon, 26 Sep 2022 22:20:29 +0000

Training a machine learning model is like baking a cake: the quality of the end result depends on the ingredients you put in. If your input data is poor, your predictions will be too. But with the right ingredients – in this case, carefully selected input features – you can create a model that’s both accurate and powerful. This is where feature engineering comes in. It’s the process of exploring, creating, and selecting the most relevant and useful features to use in your model. And just like a chef experimenting with different spices and flavors, the process of feature engineering is iterative and tailored to the problem at hand. In this guide, we’ll walk you through a step-by-step process using Python and Scikit-learn to create a strong set of features for a regression problem. By the end, you’ll have the skills to tackle any feature engineering challenge that comes your way.

The remainder of this article proceeds as follows: We begin with a brief intro to feature engineering and describe valuable techniques. We then turn to the hands-on part, in which we develop a regression model for car sales. We apply various techniques that show how to handle outliers and missing values, perform correlation analysis, and discover and manipulate features. You will also find information about common challenges and helpful sklearn functions. Finally, we will compare our regression model to a baseline model that uses the original dataset.

Also: Sentiment Analysis with Naive Bayes and Logistic Regression in Python

What is Feature Engineering?

Feature engineering is the process of using domain knowledge of the data to create features (variables) that make machine learning algorithms work. This is an important step in the machine learning pipeline because the choice of good features can greatly affect the performance of the model. The goal is to identify features, tweak them, and select the most promising ones into a smaller feature subset. We can break this process down into several action items.

Data Scientists can easily spend 70% to 80% of their time on feature engineering. The time is well spent, as changes to input data have a direct impact on performance. This process is often iterative and requires repeatedly revisiting the various tasks as understanding the data and the problem evolves. Knowing techniques and associated challenges helps in adequate feature engineering.

Also: Mastering Prompt Engineering for ChatGPT for Business Use

Feature engineering is about carefully choosing features instead of taking all the features at once. Image created with Midjourney.

Core Tasks

The goal of feature engineering is to create a set of features that are representative of the underlying data and that can be used by the machine learning algorithm to make accurate predictions. Several tasks are commonly performed as part of the feature engineering process, including:

Data discovery: To solve real-world problems with analytics, it is crucial to understand the data. Once you have gathered your data, describing and visualizing the data are means to familiarize yourself with it and develop a general feel for the data.
Data structuring: The data needs to be structured into a unified and usable format. Variables may have a wrong datatype, or the data is distributed across different data frames and must first be merged. In these cases, we first need to bring the data together and into the right shape.
Data cleansing: Besides being structured, data needs to be cleaned. Records may be redundant or contaminated with errors and missing values that can hinder our model from learning effectively. The same goes for outliers that can distort statistics.
Data transformation: We can increase the predictive power of our input features by transforming them. Activities may include applying mathematical functions, removing specific data, or grouping variables into bins. Or we create entirely new features out of several existing ones.
Feature selection: Only some may contain valuable information from the many available variables. By sorting variables that are less relevant and selecting the most promising features, we can create models that are less complex and yield better results.

Exploratory Feature Engineering Toolset

Exploratory analysis for identifying and assessing relevant features knows several tools:

Data Cleansing
Descriptive statistics
Univariate Analysis
Bi-variate Analysis
Multivariate Analysis

Data Cleansing

Educational data is often remarkably perfect, without any errors or missing values. However, it is important to recognize that most real-world data has data quality issues. Some reasons for data quality issues are

Standardization issues because the data was recorded from different peoples, sensor types, etc.
Sensor or system outages can lead to gaps in the data or create erroneous data points.
Human errors

An important part of feature engineering is to inspect the data and ensure its quality before use. This is what we understand as “data cleansing.” It includes several tasks that aim to improve the data quality, remove erroneous data points and bring the data into a more useful form.

Cleaning errors, missing values, and other issues.
Handling possible imbalanced data
Removing obvious outliers
Standardisation, e.g., dates or adresses

Accomplishing these tasks requires a good understanding of the data. We, therefore, carry out data cleansing activities closely intertwined with other exploratory tasks, e.g., univariate and bivariate data analysis. Also, remember that visualizations can aid in the process, as they can greatly enhance your ability to analyze and understand the data.

Descriptive Statistics

One of the first steps in familiarizing oneself with a new dataset is to use descriptive statistics. Descriptive statistics help understand the data and how the sample represents the real-world population. We can use several statistical measures to analyze and describe a dataset, including the following:

Measures of Central Tendency represent a typical value of the data.
- The mean: The average-based adds together all values in the sample and divides them by the number of samples.
- The median: The median is the value that lies in the middle of the range of all sample values
- The mode: is the most occurring value in a sample set (for categorical variables)
Measures of Variability tell us something about the spread of the data.
- Range: The difference between the minimum and maximum value
- Variance: This is the average of the squared difference of the mean.
- Standard Deviation: The square root of the variance.
and Measures of Frequency inform us how often we can expect a value to be present in the data, e.g., value counts

Univariate Analysis

As “uni” suggests, the univariate analysis focuses on a single variable. Rather than examining the relationships between the variables, univariate analysis employs descriptive statistics and visualizations to understand individual columns better.

Which illustrations and measures we use depends on the type of the variable.

Categorical variables (incl. binary)

Descriptive measures include counts in percent and absolute values
Visualizations include pie charts, bar charts (count plots)

Continuous variables

Descriptive measures include min, max, median, mean, variance, standard deviation, and quantiles.
Visualizations include box plots, line plots, and histograms.

Normal distribution, univariate analysis

" data-image-caption="

Normal distribution, univariate analysis

" data-large-file="https://www.relataly.com/wp-content/uploads/2022/09/output.png" src="https://www.relataly.com/wp-content/uploads/2022/09/output.png" alt="" class="wp-image-9261" srcset="https://www.relataly.com/wp-content/uploads/2022/09/output.png 838w, https://www.relataly.com/wp-content/uploads/2022/09/output.png 300w, https://www.relataly.com/wp-content/uploads/2022/09/output.png 768w" sizes="(max-width: 838px) 100vw, 838px" />

Normal distribution, univariate analysis

Bi-variate Analysis

Bi-variate (two-variate) analysis is a kind of statistical analysis that focuses on the relationship between two variables, for example, between a feature column and the target variable. In the case of machine learning projects, bivariate analysis can help to identify features that are potentially predictive of the label or the regression target.

Model performance will benefit from strong linear dependencies. In addition, we are also interested in examining the relationships among the features used to train the model. Different types of relations exist that can be examined using various plots and statistical measures:

Numerical/Numerical

Both variables have numerical values. We can illustrate their relation using lineplots or dot plots. We can examine such relations with correlation analysis.

The ideal feature subset contains features that are not correlated with each other but are heavily correlated with the target variable. We can use dimensionality reduction to reduce a dataset with many features to a lower-dimensional space in which the remaining features are less correlated.

Traditional correlation analysis (e.g., Pearson) cannot consider non-linear relations. We can identify such a relation manually by visualizing the data, for example, using line plots. Once we denote a non-linear relation, we could try to apply mathematical transformations to one of the variables to make their relation more linear.

For pairwise analysis, we must understand which variables we deal with. We can differentiate between three categories:

Numerical/Categorical
Numerical/Numerical
Categorical/Categorical

Heatmaps illustrate the relation between features and a target variable.

Numerical/Categorical

Plots that visualize the relationship between a categorical and a numerical variable include barplots and lineplots.

Especially helpful are histograms (count plots). They can highlight differences in the distribution of the numerical variable for different categories.

A specific subcase is a numerical/date relation. Such relations are typically visualized using line plots. In addition, we want to look out for linear or non-linear dependencies.

Line charts are useful when examining trends.

Categorical/Categorical

The relation between two categorical variables can be studied, including density plots, histograms, and bar plots.

For example, with car types (attributes: sedan and coupe) and colors (characteristics: red, blue, yellow), we can use a barplot to see if sedans are more often red than coupes. Differences in the distribution of characteristics can be a starting point for attempts to manipulate the features and improve model performance.

Bar and column charts are a great way to compare numeric values for discrete categories visually.

Multivariate Analysis

Multivariate analysis encompasses the simultaneous analysis of more than two variables. The approach can uncover multi-dimensional dependencies and is often used in advanced feature engineering. For example, you may find that two variables are weakly correlated with the target variable, but when combined, their relation intensifies. So you might try to create a new feature that uses the two variables as input. Plots that can visualize relations between several variables include dot plots and violin plots.

In addition, multivariate analysis refers to techniques to reduce the dimensionality of a dataset. For example, principal component analysis (PCA) or factor analysis can condense the information in a data set into a smaller number of synthetic features.

Now that we have a good understanding of what feature selection techniques are available, we can start the practical part and apply them.

Scatter charts are useful when you want to compare two numeric quantities and see a relationship or correlation between them.

Also: Color-Coded Cryptocurrency Price Charts in Python

Feature Engineering for Car Price Regression with Python and Scikit-learn

The value of a car on the market depends on various factors. The distance traveled with the vehicle and the year of manufacture is obvious dependencies. But beyond that, we can use many other factors to train a machine learning model that predicts the selling price of the used car market. The following hands-on Python tutorial will create such a model. We will work with a dataset containing used cars’ characteristics in the following. For marketing, it is crucial to understand what car characteristics determine the price of a vehicle. Our goal is to model the car price from the available independent variables. We aim to build a model that performs well on a small but powerful input subset.

Exploring and creating features varies between different application domains. For example, feature engineering in computer vision will differ greatly from feature engineering for regression or classification models or NLP models. So the example provided in this article is just for regression models.

We follow an exploratory process that includes the following steps:

Loading the data
Cleaning the data
Univariate analysis
Bivariate analysis
Selecting features
Data preparation
Model training
Measuring performance

Finally, we compare the performance of our model, which was trained on a minimal set of features, to a model that uses the original data.

Yes, you can judge by the length of the beard that this guy is a legendary feature engineer. Image created with Midjourney.

The Python code is available in the relataly GitHub repository.

View on GitHub Relataly Github Repo

Prerequisites

Before you proceed, ensure that you have set up your Python environment (3.8 or higher) and the required packages. If you don’t have an environment, follow this tutorial to set up the Anaconda environment.

Also, make sure you install all required packages. In this tutorial, we will be working with the following standard packages:

pandas
NumPy
matplotlib
Seaborn
Scikit-learn

You can install packages using console commands:

pip install
conda install (if you are using the anaconda packet manager)

About the Dataset

In this tutorial, we will be working with a dataset containing listings for 111763 used cars. The data includes 13 variables, including the dependent target variable

prod_date: The year of production
maker: The manufacturer’s name
model: The car edition
trim: Different versions of the model
body_type: The body style of a vehicle
transmission_type: The way the power is brought to the wheels
state: The state in which the car is auctioned
condition: The condition of the cars
odometer: The distance the car has traveled since manufactured
exterior_color: Exterior color
interior_color: Interior color
sale_price (target variable): The price a car was sold
sale_date: The date on which the car has been sold

The dataset is available for download from Kaggle.com, but you can execute the code below and load the data from the relataly GitHub repository.

Car price prediction is a solid use case for machine learning. Image created with Midjourney.

Step #1 Load the Data

We begin by importing the necessary libraries and downloading the dataset from the relataly GitHub repository. Next, we will read the dataset into a pandas DataFrame. In addition, we store the name of our regression target variable to ‘price_usd,’ which is one of the columns in the initial dataset. The “.head ()” function displays the first records of our DataFrame.

# Tested with Python 3.8.8, Matplotlib 3.5, Scikit-learn 0.24.1, Seaborn 0.11.1, numpy 1.19.5
from codecs import ignore_errors
import math
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('white', {'axes.spines.right': False, 'axes.spines.top': False})
from pandas.api.types import is_string_dtype, is_numeric_dtype 
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import mean_absolute_error, mean_absolute_percentage_error
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.inspection import permutation_importance
from sklearn.model_selection import ShuffleSplit
# Original Data Source: 
# https://www.kaggle.com/datasets/tunguz/used-car-auction-prices
# Load train and test datasets
df = pd.read_csv("https://raw.githubusercontent.com/flo7up/relataly_data/main/car_prices2/car_prices.csv")
df.head(3)

	prod_year	maker			model		trim		body_type		transmission_type	state	condition	odometer	exterior_color	interior	sellingprice	date
0	2015		Kia				Sorento		LX			SUV				automatic			ca		5.0			16639.0		white			black		21500	2014-12-16
1	2015		Nissan			Altima		2.5 S		Sedan			automatic			ca		1.0			5554.0		gray			black		10900	2014-12-30
2	2014		Audi			A6	3.0T 	Prestige 	quattro	Sedan	automatic			ca		4.8			14414.0		black			black		49750	2014-12-16

We now have a dataframe that contains 12 columns and the dependent target variable we want to predict.

Step #2 Data Cleansing

Now that we have loaded the data, we begin with the exploratory analysis. First, we will put it into shape.

2.1 Check Names and Datatypes

If the names in a dataset are not self-explaining, it is easy to get confused with all the data. Therefore, will rename some of the columns and provide clearer names. There is no default naming convention, but striving for consistency, simplicity, and understandability is generally a good idea.

The following code line renames some of the columns.

# rename some columns for consistency
df.rename(columns={'exterior_color': 'ext_color', 
                   'interior': 'int_color', 
                   'sellingprice': 'sale_price'}, inplace=True)
df.head(1)

	prod_year	maker	model	trim	body_type	transmission_type	state	condition	odometer	ext_color	int_color	sale_price	date
0	2015		Kia		Sorento	LX		SUV			automatic			ca		5.0			16639.0		white		black		21500		2014-12-16

Next, we will check and remove possible duplicates.

# check and remove dublicates
print(len(df))
df = df.drop_duplicates()
print(len(df))

OUT: 111763, 111763

There were no duplicates in the data, which is good.

# check datatypes
df.dtypes

prod_year              int64
maker                 object
model                 object
trim                  object
body_type             object
transmission_type     object
state                 object
condition            float64
odometer             float64
ext_color             object
int_color             object
sale_price             int64
date                  object
dtype: object

We compare the datatypes to the first records we printed in the previous section. Be aware that categorical variables (e.g., of type “string”) are shown as “objects.” The data types look as expected.

Finally, we define our target variable’s name, “sale_price.” The target variable will be our regression target, and we will use its name often.

# consistently define the target variable
target_name = 'sale_price'

2.2 Checking Missing Values

Some machine learning algorithms are sensitive to missing values. Handling missing values is, therefore a crucial step in exploratory feature engineering.

Let’s first gain an overview of null values. With a larger DataFrame, it would be inefficient to review all the rows and columns individually for missing values. Instead, we use the sum function and visualize the results to get a quick overview of missing data in the DataFrame.

# check for missing values
null_df = pd.DataFrame(df.isna().sum(), columns=['null_values']).sort_values(['null_values'], ascending=False)
fig = plt.subplots(figsize=(16, 6))
ax = sns.barplot(data=null_df, x='null_values', y=null_df.index, color='royalblue')
pct_values = [' {:g}'.format(elm) + ' ({:.1%})'.format(elm/len(df)) for elm in list(null_df['null_values'])]
ax.bar_label(container=ax.containers[0], labels=pct_values, size=12)
ax.set_title('Overview of missing values')

overview of missing values in the car price regression dataset

" data-image-caption="

overview of missing values in the car price regression dataset

" data-large-file="https://www.relataly.com/wp-content/uploads/2022/09/missing-values-bar-chart-for-car-price-regression.png" src="https://www.relataly.com/wp-content/uploads/2022/09/missing-values-bar-chart-for-car-price-regression-1024x384.png" alt="overview of missing values in the car price regression dataset" class="wp-image-9365" srcset="https://www.relataly.com/wp-content/uploads/2022/09/missing-values-bar-chart-for-car-price-regression.png 1024w, https://www.relataly.com/wp-content/uploads/2022/09/missing-values-bar-chart-for-car-price-regression.png 300w, https://www.relataly.com/wp-content/uploads/2022/09/missing-values-bar-chart-for-car-price-regression.png 768w, https://www.relataly.com/wp-content/uploads/2022/09/missing-values-bar-chart-for-car-price-regression.png 1026w" sizes="(max-width: 1024px) 100vw, 1024px" />

The bar chart shows that there are several variables with missing values. Variables with many missing values can negatively affect model performance, which is why we should try to treat them.

2.3 Overview of Techniques for Handling Missing Values

There are various ways to handle missing data. The most common options to handle missing values are:

Custom substitution value: Sometimes, the information that a value is missing can be important information to a predictive model. We can substitute missing values with a placeholder value such as “missing” or “unknown.” The approach works particularly well for variables with many missing values.
Statistical filling: We can fill in a statistically chosen measure, such as the mean or median for numeric variables, or the mode for categorical variables.
Replace using Probabilistic PCA: PCA uses a linear approximation function that tries to reconstruct the missing values from the data.
Remove entire rows: It is crucial to ensure that we only use data we know is correct. In those cases, we can drop an entire row if it contains a missing value. This also solves the problem but comes at the cost of losing potentially important information – especially if the data quantity is small.
Remove the entire column: It is another alternative way of resolving missing values. This is typically the least option, as we lose an entire feature.

How we handle missing values can dramatically affect our prediction results. To find the ideal method, it is often necessary to experiment with different techniques. Sometimes, the information that a value is missing can also be important. This occurs when the missing values are not randomly distributed in the data and show a pattern. In such a case, you should create an additional feature that states whether values are missing.

2.4 Handle Missing Values

In this example, we will use the median value to fill in the missing values of our numeric variables and the mode to replace the missing values of categorical variables. When we check again, we can see that odometer and condition have no more missing values.

# fill missing values with the mean for numeric columns
for col_name in df.columns:
    if (is_numeric_dtype(df[col_name])) and (df[col_name].isna().sum() > 0):
        df[col_name].fillna(df[col_name].median(), inplace=True) # alternatively you could also drop the columns with missing values using .drop(columns=['engine_capacity']) 
print(df.isna().sum())

prod_year                0
maker                 2078
model                 2096
trim                  2157
body_type             2641
transmission_type    13135
state                    0
condition                0
odometer                 0
ext_color              173
int_color              173
sale_price               0
date                     0
dtype: int64

Next, we handle the missing values of transmission_type by filling them with the mode.

# check the distribution of missing values for transmission type
print(df['transmission_type'].value_counts())
# fill values with the mode
df['transmission_type'].fillna(df['transmission_type'].mode()[0], inplace=True)
print(df['transmission_type'].isna().sum())

automatic    108198
manual         3565
Name: transmission_type, dtype: int64
0

We handle body_type analogs as transmission_type and fill the missing values with the mode. The mode is the value that appears most often in the data. The mode of transmission_type is “Sedan.” However, this value is not that prevalent, as half of the cars have other body types, e.g., “SUV.” Therefore, we will replace the missing values with “Unknown.”

# check the distribution of missing values for body type
print(df['body_type'].value_counts())
# fill values with 'Unknown'
df['body_type'].fillna("Unknown", inplace=True)
print(df['body_type'].isna().sum())

Sedan                 39955
SUV                   23836
sedan                  8377
suv                    4934
Hatchback              4241
                      ...  
cts-v coupe               2
Ram Van                   1
Transit Van               1
CTS Wagon                 1
beetle convertible        1
Name: body_type, Length: 74, dtype: int64
0

Now we have handled most of the missing values in our data. However, some variables are still left, with a few missing values. We will make things easy and simply drop all remaining records with missing values. Considering that we have more than 100k records and only a few variables, we can afford to do this without fear of a severe impact on our model performance.

# remove all other records with missing values
df.dropna(inplace=True)
print(df.isna().sum())

prod_year            0
maker                0
model                0
trim                 0
body_type            0
transmission_type    0
state                0
condition            0
odometer             0
ext_color            0
int_color            0
sale_price           0
date                 0
dtype: int64

Finally, we check again for missing values and see that everything has been filled. Now, we have a cleansed dataset with 13 columns.

2.3 Save a Copy of the Cleaned Data

Before exploring the features, let’s make a copy of the cleaned data. We will later use this “full” dataset to compare the performance of our model with a baseline model.

# Create a copy of the dataset with all features for comparison reasons
df_all = df.copy()

Step #3 Getting started with Statistical Univariate Analysis

Now it’s time to analyze the data and explore potential useful features for our subset. Although the process follows a linear flow in this example, you may notice in practice that you must go back and forth between different steps of the feature exploration and engineering process.

First, we will look at the variance of the features in the initial dataset. Machine learning models can only learn from variables that have adequate variance. So, low-variance features are often candidates to exclude from the feature subset.

We use the .describe() method to display univariate descriptive statistics about the numerical columns in our dataset.

# show statistics for numeric variables
print(df.columns)
df.describe()

Next, we check the categorical variables. All variables seem to have a good variance. We can measure the variance with statistical measures or observe it manually using bar charts and scatterplots.

We can use histplots to visualize the distributions of the numeric variables. The example below shows the histplot for our target variable sale_price.

# Explore the variance of the target variable
variable_name = 'sale_price'
fig, ax = plt.subplots(figsize=(14,5))
sns.histplot(data=df[[variable_name]].dropna(), ax=ax, color='royalblue', kde=True)
ax.get_legend().remove()
ax.set_title(variable_name + ' Distribution')
ax.set_xlim(0, df[variable_name].quantile(0.99))

The histplot shows that sale prices are skewed to the left. This means there are many cheap cars and fewer expensive ones, which makes sense.

Next, we create bar plots for categorical values.

# 3.2 Illustrate the Variance of Numeric Variables 
f_list_numeric = [x for x in df.columns if (is_numeric_dtype(df[x]) and df[x].nunique() > 2)]
f_list_numeric
# box plot design
PROPS = {
    'boxprops':{'facecolor':'none', 'edgecolor':'royalblue'},
    'medianprops':{'color':'coral'},
    'whiskerprops':{'color':'royalblue'},
    'capprops':{'color':'royalblue'}
    }
sns.set_style('ticks', {'axes.edgecolor': 'grey',  
                        'xtick.color': '0',
                        'ytick.color': '0'})
# Adjust plotsize based on the number of features
ncols = 1
nrows = math.ceil(len(f_list_numeric) / ncols)
fig, axs = plt.subplots(nrows, ncols, figsize=(14, nrows*1))
for i, ax in enumerate(fig.axes):
    if i < len(f_list_numeric):
        column_name = f_list_numeric[i]
        sns.boxplot(data=df[column_name], orient="h", ax = ax, color='royalblue', flierprops={"marker": "o"}, **PROPS)
        ax.set(yticklabels=[column_name])
        fig.tight_layout()

We can observe two things: First, the variance of transmission type is low, as most cars have an automatic transmission. So transmission_type is the first variable that we exclude from our feature subset.

# Drop features with low variety
df = df.drop(columns=['transmission_type'])
df.head(2)

	prod_year	maker	model	trim	body_type	state	condition	odometer	ext_color	int_color	sale_price	date
0	2015		Kia		Sorento	LX		SUV			ca		5.0			16639.0		white		black		21500		2014-12-16
1	2015		Nissan	Altima	2.5 S	Sedan		ca		1.0			5554.0		gray		black		10900		2014-12-30

Second, int_color and ext_color have many categorical values. By grouping some of these values that hardly ever occur, we can help the model to focus on the most relevant patterns. However, before we do that, we need to take a closer look at how the target variable differs between the categories.

Step #4 Bi-variate Analysis

Now that we have a general understanding of our dataset’s individual variables, let’s look at pairwise dependencies. We are particularly interested in the relationship between features and the target variables. Our goal is to keep features whose dependence on the target variable shows some pattern – linear or non-linear. On the other hand, we want to exclude features whose relationship with the target variable looks arbitrary.

Visualizations have to take the datatypes of our variables into account. To illustrate the relation between categorical features and the target, we create boxplots and kdeplots. For numeric (continuous) features, we use scatterplots.

4.1 Analyzing the Relation between Features and the Target Variable

We begin by taking a closer look at the int_color and ext_color. We use kdeplots to highlight the distribution of prices depending on different colors.

def make_kdeplot(column_name):
    fig, ax = plt.subplots(figsize=(20,8))
    sns.kdeplot(data=df, hue=column_name, x=target_name, ax = ax, linewidth=2,)
    ax.tick_params(axis="x", rotation=90, labelsize=10, length=0)
    ax.set_title(column_name)
    ax.set_xlim(0, df[target_name].quantile(0.99))
    plt.show()
    
make_kdeplot('ext_color')

make_kdeplot('int_color')

In both cases, a few colors are prevalent and account for most observations. Moreover, distributions of the car price differ for these prevalent colors. These differences look promising as they may help our model to differentiate cheaper cars from more expensive ones. To simplify things, we group the colors that hardly occur into a color category called “other.”

# Binning features
df['int_color'] = [x if  x in(['black', 'gray', 'white', 'silver', 'blue', 'red']) else 'other' for x in df['int_color']]
df['ext_color'] = [x if  x in(['black', 'gray', 'white', 'silver', 'blue', 'red']) else 'other' for x in df['ext_color']]

Next, we create plots for all remaining features.

# Vizualising Distributions
f_list = [x for x in df.columns if ((is_numeric_dtype(df[x])) and x != target_name) or (df[x].nunique() < 50)]
f_list_len = len(f_list)
print(f'numeric features: {f_list_len}')
# Adjust plotsize based on the number of features
ncols = 1
nrows = math.ceil(f_list_len / ncols)
fig, axs = plt.subplots(nrows, ncols, figsize=(18, nrows*5))
for i, ax in enumerate(fig.axes):
    if i < f_list_len:
        column_name = f_list[i]
        print(column_name)
        # If a variable has more than 8 unique values draw a scatterplot, else draw a violinplot 
        if df[column_name].nunique() > 100 and is_numeric_dtype(df[column_name]):
            # Draw a scatterplot for each variable and target_name
            sns.scatterplot(data=df, y=target_name, x=column_name, ax = ax)
        else: 
            # Draw a vertical violinplot (or boxplot) grouped by a categorical variable:
            myorder = df.groupby(by=[column_name])[target_name].median().sort_values().index
            sns.boxplot(data=df, x=column_name, y=target_name, ax = ax, order=myorder)
            #sns.violinplot(data=df, x=column_name, y=target_name, ax = ax, order=myorder)
        ax.tick_params(axis="x", rotation=90, labelsize=10, length=0)
        ax.set_title(column_name)
    fig.tight_layout()

Again, for categorical variables, we want to see differences in the distribution of the categories. Based on the boxplot’s median and the quantiles, we can denote that prod_year, int_color, and condition show adequate variance. The scatterplot for the odometer value also looks good. So we want to keep these features. In contrast, the differences between “state” and “ext_color” are rather weak. Therefore, we exclude these variables from our subset.

# drop columns with low variance
df.drop(columns=['state', 'ext_color'], inplace=True)

Finally, if you want to take a more detailed look at the numeric features, you can use jointplots. These are scatterplots with additional information about the distributions. The example below shows the jointplot for the odometer value vs price.

# detailed univariate and bivariate analysis of 'odometer' using a jointplot 
def make_jointplot(feature_name):
    p = sns.jointplot(data=df, y=feature_name, x=target_name, height=6, ratio=6, kind='reg', joint_kws={'line_kws':{'color':'coral'}})
    p.fig.suptitle(feature_name + ' Distribution')
    p.ax_joint.collections[0].set_alpha(0.3)
    p.ax_joint.set_ylim(df[feature_name].min(), df[feature_name].max())
    p.fig.tight_layout()
    p.fig.subplots_adjust(top=0.95)
make_jointplot ('odometer')
# Alternatively you can use hex_binning
# def make_joint_hexplot(feature_name):
#     p = sns.jointplot(data=df, y=feature_name, x=target_name, height=10, ratio=1, kind="hex")
#     p.ax_joint.set_ylim(0, df[feature_name].quantile(0.999))
#     p.ax_joint.set_xlim(0, df[target_name].quantile(0.999))
#     p.fig.suptitle(feature_name + ' Distribution')

Here is another example of a jointplot for the variable ‘condition.’

# detailed univariate and bivariate analysis of 'condition' using a jointplot 
make_jointplot('condition')

The graphs show a linear relationship between the price for the condition and the odometer value.

4.2 Correlation Matrix

Correlation analysis is a technique to quantify the dependency between numeric features and a target variable. Different ways exist to calculate the correlation coefficient. For example, we can use Pearson correlation (linear relation), Kendall correlation (ordinal association), or Spearman (monotonic dependence).

The example below uses Pearson correlation, which concentrates on the linear relationship between two variables. The Pearson correlation score lies between -1 and 1. General interpretations of the absolute value of the correlation coefficient are:

.00-.19 “very weak”
.20-.39 “weak”
.40-.59 “moderate”
.60-.79 “strong”
.80-1.0 “very strong”

More information on the Pearson correlation can be found here and in this article on the correlation between covid-19 and the stock market.

We will calculate a correlation matrix that provides the correlation coefficient for all features in our subset, incl. sale_price.

# 4.1 Correlation Matrix
# correlation heatmap allows us to identify highly correlated explanatory variables and reduce collinearity
plt.figure(figsize = (9,8))
plt.yticks(rotation=0)
correlation = df.corr()
ax =  sns.heatmap(correlation, cmap='GnBu',square=True, linewidths=.1, cbar_kws={"shrink": .82},annot=True,
            fmt='.1',annot_kws={"size":10})
sns.set(font_scale=0.8)
for f in ax.texts:
        f.set_text(f.get_text())

All our remaining numeric features strongly correlate with price (positive or negative). However, this is not all that matters. Ideally, we want to have features that have a low correlation with each other. We can see that prod_year and condition are moderately correlated (coefficient: 0.5). Because prod_year is more correlated with price (coefficient: 0.6) than condition (coefficient: 0.5), we drop the condition variable.

df.drop(columns='condition', inplace=True)

Step #5 Data Preprocessing

Now our subset contains the following variables:

prod_year
maker
model
trim
body_type
odometer
int_color
sale_price

Next, we prepare the data for use as input to train a regression model. Before we train the model, we need to make a few final preparations. For example, we use a label encoder to replace the strong_values of the categorical variables with numeric values.

# encode categorical variables 
def encode_categorical_variables(df):
    # create a list of categorical variables that we want to encode
    categorical_list = [x for x in df.columns if is_string_dtype(df[x])]
    le = LabelEncoder()
    # apply the encoding to the categorical variables
    # because the apply() function has no inplace argument,  we use the following syntax to transform the df
    df[categorical_list] = df[categorical_list].apply(LabelEncoder().fit_transform)
    return df
df_final_subset = encode_categorical_variables(df)
df_all_ = encode_categorical_variables(df_all)
# create a copy of the dataframe but without the target variable
df_without_target = df.drop(columns=[target_name])
df_final_subset.head()

	prod_year	maker	model	trim	body_type	odometer	int_color	sale_price	date
0	2015		23		594		794		31			16639.0		0			21500		8
1	2015		34		59		98		32			5554.0		0			10900		17
2	2014		2		46		180		32			14414.0		0			49750		8
3	2015		34		59		98		32			11398.0		0			14100		13
4	2015		7		325		789		32			14538.0		0			7200		158

Step #6 Splitting the Data and Training the Model

To ensure that our regression model does not know the target variable, we separate car price (y) from features (x). Last, we split the data into separate datasets for training and testing. The result is four different data sets: x_train, y_train, x_test, and y_test.

Once the split function has prepared the datasets, we the regression model. Our model uses the Random Decision Forest algorithm from the scikit learn package. As a so-called ensemble model, the Random Forest is a robust Machine Learning algorithm. It considers predictions from a set of multiple independent estimators.

The Random Forest algorithm has a wide range of hyperparameters. While we could optimize our model further by testing various configurations (hyperparameter tuning), this is not the focus of this article. Therefore, we will use the default hyperparameters for our model as defined by scikit-learn. Please visit one of my recent articles on hyperparameter tuning, if you want to learn more about this topic.

For comparison reasons, we train two models—one model with our subset of selected features. The second model uses all features, cleansed but without any further manipulations.

We use shuffled cross-validation (cv=5) to evaluate our model’s performance on different data folds.

def splitting(df, name):
    # separate labels from training data
    X = df.drop(columns=[target_name])
    y = df[target_name] #Prediction label
    # split the data into x_train and y_train data sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, random_state=0)
    # print the shapes: the result is: (rows, training_sequence, features) (prediction value, )
    print(name + '')
    print('train: ', X_train.shape, y_train.shape)
    print('test: ', X_test.shape, y_test.shape)
    return X, y, X_train, X_test, y_train, y_test
# train the model
def train_model(X, y, X_train, y_train):
    estimator = RandomForestRegressor() 
    cv = ShuffleSplit(n_splits=5, test_size=0.3, random_state=0)
    scores = cross_val_score(estimator, X, y, cv=cv)
    estimator.fit(X_train, y_train)
    return scores, estimator
# train the model with the subset of selected features
X_sub, y_sub, X_train_sub, X_test_sub, y_train_sub, y_test_sub = splitting(df_final_subset, 'subset')
scores_sub, estimator_sub = train_model(X_sub, y_sub, X_train_sub, y_train_sub)
    
# train the model with all features
X_all, y_all, X_train_all, X_test_all, y_train_all, y_test_all = splitting(df_all_, 'fullset')
scores_all, estimator_all = train_model(X_all, y_all, X_train_all, y_train_all)

subset
train:  (76592, 8) (76592,)
test:  (32826, 8) (32826,)

Step #7 Comparing Regression Models

Finally, we want to see how the model performs and how its performance compares against the model that uses all variables.

7.1 Model Scoring

We use different regression metrics to measure the performance. Then we create a barplot that compares the performance scores across the different validation folds (due to cross-validation).

# 7.1 Model Scoring 
def create_metrics(scores, estimator, X_test, y_test, col_name):
    scores_df = pd.DataFrame({col_name:scores})
    # predict on the test set
    y_pred = estimator.predict(X_test)
    y_df = pd.DataFrame(y_test)
    y_df['PredictedPrice']=y_pred
    # Mean Absolute Error (MAE)
    MAE = mean_absolute_error(y_test, y_pred)
    print('Mean Absolute Error (MAE): ' + str(np.round(MAE, 2)))
    # Mean Absolute Percentage Error (MAPE)
    MAPE = mean_absolute_percentage_error(y_test, y_pred)
    print('Mean Absolute Percentage Error (MAPE): ' + str(np.round(MAPE*100, 2)) + ' %')
    
    # calculate the feature importance scores
    r = permutation_importance(estimator, X_test, y_test, n_repeats=30, random_state=0)
    data_im = pd.DataFrame(r.importances_mean, columns=['feature_permuation_score'])
    data_im['feature_names'] = X_test.columns
    data_im = data_im.sort_values('feature_permuation_score', ascending=False)
    
    return scores_df, data_im
scores_df_sub, data_im_sub = create_metrics(scores_sub, estimator_sub, X_test_sub, y_test_sub, 'subset')
scores_df_all, data_im_all = create_metrics(scores_all, estimator_all, X_test_all, y_test_all, 'fullset')
scores_df = pd.concat([scores_df_sub, scores_df_all],  axis=1)
# visualize how the two models have performed in each fold
fig, ax = plt.subplots(figsize=(10, 6))
scores_df.plot(y=["subset", "fullset"], kind="bar", ax=ax)
ax.set_title('Cross validation scores')
ax.set(ylim=(0, 1))
ax.tick_params(axis="x", rotation=0, labelsize=10, length=0)

Mean Absolute Error (MAE): 1643.39
Mean Absolute Percentage Error (MAPE): 24.36 %
Mean Absolute Error (MAE): 1813.78
Mean Absolute Percentage Error (MAPE): 25.23 %

The subset model achieves an absolute percentage error of around 24%, which is not so bad. But more importantly, our model performs better than the model that uses all features. However, the subset model is less complex as it only uses eight features instead of 12. So it is easier to understand and less costly to train.

7.2 Feature Permutation Importance Scores

Next, we calculate feature importance scores. In this way, we can determine which features attribute the most to the predictive power of our model. Feature importance scores are a useful tool in the feature engineering process, as they provide insights into how the features in our subset contribute to the overall performance of our predictive model. Features with low importance scores can be eliminated from the subset or replaced with other features.

Again we will compare our subset model to the model that uses all available features from the initial dataset.

# compare the feature importance scores of the subset model to the fullset model
fig, axs = plt.subplots(1, 2, figsize=(20, 8))
sns.barplot(data=data_im_sub, y='feature_names', x="feature_permuation_score", ax=axs[0])
axs[0].set_title("Feature importance scores of the subset model")
sns.barplot(data=data_im_all, y='feature_names', x="feature_permuation_score", ax=axs[1])
axs[1].set_title("Feature importance scores of the fullset model")

In the subset model, most features are relevant to the model’s performance. Only date and int_color do not seem to have a significant impact. For the full set model, five out of 12 features hardly contribute to the model performance (date, int_color, ext_color, state, transmission_type).

Once you have a strong subset of features, you can automate the feature selection process using different techniques, e.g., forward or backward selection. Automated feature selection techniques will test different model variants with varying feature combinations to determine the best input dataset. This step is often done at the end of the feature engineering process. However, this is something for another article.

Conclusions

That’s it for now! This tutorial has presented an exploratory approach to feature exploration, engineering, and selection. You have gained an overview of tools and graphs that are useful in identifying and preparing features. The second part was a Python hands-on tutorial. We followed an exploratory feature engineering process to build a regression model for car prices. We used various techniques to discover and sort features and make a vital feature subset. These techniques include data cleansing, descriptive statistics, and univariate and bivariate analysis (incl. correlation). We also used some techniques for feature manipulation, including binning. Finally, we compared our subset model to one that uses all available data.

If you take away one learning from this article, remember that in machine learning, less is often more. So training classic machine learning models on carefully curated feature subsets likely outperforms models that use all available information.

I hope this article was helpful. I am always trying to improve and learn from my audience. So, if you have any questions or suggestions, please write them in the comments.

Sources and Further Reading

The links above to Amazon are affiliate links. By buying through these links, you support the Relataly.com blog and help to cover the hosting costs. Using the links does not affect the price.

Stock-market prediction is a typical regression problem. To learn more about feature engineering for stock-market prediction, check out this article on multivariate stock-market forecasting.

The post Feature Engineering and Selection for Regression Models with Python and Scikit-learn appeared first on relataly.com.

Unveiling Hidden Patterns in the Cryptocurrency Market with Affinity Propagation and Python

Florian Follonier — Mon, 02 May 2022 18:34:02 +0000

Affinity propagation is a powerful unsupervised clustering technique that can identify hidden patterns in large datasets. In the cryptocurrency world, where new coins are constantly emerging and prices can be highly volatile, affinity propagation can help investors simplify the chaos.

By analyzing historical price data, affinity propagation groups coins into clusters based on their past price fluctuations. Such a cluster analysis enables crypto investors to identify promising entry and exit points, ultimately helping them make smarter investment decisions.

To use this technique effectively, it’s important to understand essential concepts such as covariance, lasso regression, and affinity propagation. Once you understand these concepts, you can apply them to analyze price time series data and identify hidden patterns.

Finally, visualizing the results in two and three dimensions can better understand the relationships between coins and their respective clusters. The resulting crypto market map can be a powerful tool for investors to gain insight into the market’s structure and make informed investment decisions.

Disclaimer

This article does not constitute financial advice. Stock markets can be very volatile and are generally difficult to predict. Predictive models and other forms of analytics applied in this article only serve the purpose of illustrating machine learning use cases.

What is Stock Market Clustering?

Clustering stock markets refers to grouping stocks based on their similarities or common characteristics. This can be done using various clustering algorithms, which analyze the data and assign each stock market to a cluster based on its similarity to other stock markets in the same cluster. In this article, we will run a cluster analysis on historical time series data. This approach involves grouping stocks into clusters based on their historical performance over a certain period of time.

Clustering stock market data can be useful for a variety of purposes, such as identifying patterns or trends in the data, comparing the performance of different stocks or sectors, or generating investment recommendations. However, it’s important to keep in mind that clustering is just one tool among many for analyzing stock market data, and it’s important to consider a range of factors when making investment decisions. It can also be used to compare the performance of different stock markets and identify potential risks or correlations between them.

Also: Color-Coded Cryptocurrency Price Charts in Python

neural network machine learning python affinity propagation midjourney relataly crypto-min

" data-image-caption="

neural network machine learning python affinity propagation midjourney relataly crypto-min

" data-large-file="https://www.relataly.com/wp-content/uploads/2023/03/neural-network-machine-learning-python-affinity-propagation-midjourney-relataly-crypto-min.png" src="https://www.relataly.com/wp-content/uploads/2023/03/neural-network-machine-learning-python-affinity-propagation-midjourney-relataly-crypto-min.png" alt="neural network machine learning python midjourney relataly crypto market map" class="wp-image-12694" srcset="https://www.relataly.com/wp-content/uploads/2023/03/neural-network-machine-learning-python-affinity-propagation-midjourney-relataly-crypto-min.png 509w, https://www.relataly.com/wp-content/uploads/2023/03/neural-network-machine-learning-python-affinity-propagation-midjourney-relataly-crypto-min.png 300w, https://www.relataly.com/wp-content/uploads/2023/03/neural-network-machine-learning-python-affinity-propagation-midjourney-relataly-crypto-min.png 140w" sizes="(max-width: 509px) 100vw, 509px" />

We can use a crypto market map to illustrate the price correlation between cryptocurrencies.

What’s the Problem with Prototype-based Clustering?

Clustering is an unsupervised learning technique that groups similar objects into clusters and separates them from different ones. One of the most popular clustering techniques is k-means. K-means belongs to the so-called prototype-based clustering techniques, which divide data points into a predefined number of groups (in the case of k-means, the groups are of equal variance).

The prototype-based clustering approach works great if the number of clusters in a dataset is known and the clusters have similar despair. However, when we deal with real-world problems, we often encounter more complex data for which the optimal number of clusters is unknown and difficult or even impossible to guess. In such a case, affinity propagation has a significant advantage because it can automatically estimate the number of clusters.

Affinity Propagation: What it is and How it Works

The idea of affinity propagation is to identify clusters by measuring the similarity of data points relative to one another. The algorithm chooses data points as cluster centers that best represent other data points near them.

We can imagine the process of identifying these representative data points as an election. Each data point (i) is a voter who casts votes and a candidate (k) who can receive votes from other voters. Votes are a measure of the similarity of data points. A voter who gives many votes to a candidate expresses that this data point is similar to him and therefore is suitable for representing him as a cluster center. The voting process continues until the algorithm reaches a consensus and selects a set number of cluster candidates.

Affinity Propagation: Data points cast votes for candidates and receive votes from other data points

The clustering process involves many separate steps (This article provides a detailed description of the steps involved) and works with several matrices:

The similarity matrix assesses the suitability of data points (candidates) to act as cluster centers.
The availability matrix (or responsibility matrix) collects the support of the data points for the candidates (potential cluster centers) and their suitability to represent them.
The criterion matrix sums up the results and defines the clusters. Data points with equal scores in the criterion matrix are considered part of the same cluster.

Criterion Matrix: Data Points (Cryptos) with equal numbers are part of the same cluster

Time Series Clustering using Affinity Propagation – Visualizing Cryptocurrency Market Structures in Python

Ready to implement affinity propagation in Python to analyze the crypto market structure and create a visual representation of price similarity? Let’s dive in!

First, we define a portfolio of cryptocurrencies and download their historical price quotes from coinmarketcap. We then visualize the time series on separate line charts to ensure that the data has been loaded successfully. After preparing and cleaning the data, we can move on to clustering the cryptocurrencies into groups with similar price movements using Affinity Propagation.

Unlike other clustering algorithms, we don’t set the number of clusters in advance. Instead, we let affinity propagation determine the optimal number of clusters for our portfolio. Finally, we calculate the covariance matrix between clusters and arrange the cryptocurrencies on a 2D map into clusters. We create a network overlay based on covariance to better understand the relationships between different clusters.

With affinity propagation, we can identify hidden patterns in the crypto market and group coins into clusters based on their past price fluctuations. This process allows us to identify promising entry and exit points, ultimately helping us make smarter investment decisions. Plus, the 2D map and network overlay help us visualize the relationships between different clusters and coins.

We can use affinity propagation to cluster financial assets and visualize them on a map.

The Python code for this tutorial is available in the relataly repository on GitHub.

View on GitHub Relataly GitHub Repo

Prerequisites

Before beginning the coding part, ensure that you have set up your Python 3 environment and required packages. Consider Anaconda if you don’t have a Python environment set up yet. To set it up, you can follow the steps in this tutorial. Also, make sure you install all required packages. In this tutorial, we will be working with the following standard packages:

Please also make sure you have the Cmcscaper package installed. We will be using it to download past crypto prices from coinmarketcap.

You can install these packages using console commands:

pip install
conda install (if you are using the anaconda packet manager)

Step #1: Load the Stock Market Data

We start by loading historical crypto price data from Coinmarketcap. To download the data, we use Cmcscraper, a Python library that allows us to collect Coinmarketcap data without signing up for the official API.

The download returns a dataframe with daily price quotes (Close, Open, Avg) for cryptocurrencies between 2016 and today. You can use the dictionary (“symbol_dict”) to control which cryptos you want to include in the data. We limit the data we use in our cluster analysis to the last 50 days. In this way, we let the correlation consider earlier price developments. But it’s up to you to specify a different period. In addition, instead of using absolute price values, we will use daily percentage fluctuations.

Also: Requesting Crypto Price Data from the Gate.io REST API in Python

Loading the data can take several minutes, depending on how many cryptocurrencies we include in the request. So it makes sense not to load the data every time you run the code. Therefore, the code below stores the historical prices in a CSV file.

The script will check if the data already exists if you run the code below. If it does, it will use the data from the CSV file. Otherwise, it will load a fresh copy of the data from coinmarketcap.

# A tutorial for this file is available at www.relataly.com
# Tested with Python 3.8.8, Matplotlib 3.5, Scikit-learn 0.24.1, Seaborn 0.11.1, numpy 1.19.5

from cryptocmd import CmcScraper
import pandas as pd 
import matplotlib.pyplot as plt 
import numpy as np 
import seaborn as sns
from sklearn import cluster, covariance, manifold
import requests
import json


#get a dictionary of the top 100 coin symbols and names from an API
def get_symbol_dict():
    url = 'https://api.coinmarketcap.com/data-api/v3/cryptocurrency/listing?start=1&limit=50&sortBy=market_cap&sortType=desc&convert=USD&cryptoType=all&tagType=all&audited=false'
    response = requests.get(url)
    data = json.loads(response.text)
    df = pd.DataFrame(data['data']['cryptoCurrencyList'])

    # exclude stable coins
    df = df[~df['symbol'].isin(['USDT', 'USDC', 'BUSD', 'DAI', 'TUSD', 'PAX', 'GUSD', 'HUSD', 'USDK', 'USDS', 'USDP', 'USDN', 'USDSB', 'USDX', 'USD++', 'BIDR', 'IDRT', 'VAI', 'BGBP'])]
    df = df[['symbol', 'name']]
    df = df.set_index('symbol')
    df = df.to_dict()
    df = df['name']
    return df

symbol_dict = get_symbol_dict()


# Download historic crypto prices via CmcScraper
def load_fresh_data_and_save_to_disc(symbol_dict, save_path):
    # Extract symbols and names from the symbol_dict
    symbols, names = np.array(sorted(symbol_dict.items())).T
    
    # Initialize an empty DataFrame for storing the prices
    df_crypto = pd.DataFrame()

    # Download and process the price data for each symbol
    for symbol in symbols:
        print(f"Fetching prices for {symbol}...")
        
        # Download the price data using CmcScraper
        scraper = CmcScraper(symbol)
        df_coin_prices = scraper.get_dataframe()

        # Process the price data and add it to df_crypto
        df = pd.DataFrame({
            f"{symbol}_Open": df_coin_prices["Open"],
            f"{symbol}_Close": df_coin_prices["Close"],
            f"{symbol}_Avg": (df_coin_prices["Close"] + df_coin_prices["Open"]) / 2,
            f"{symbol}_p": (df_coin_prices["Open"] - df_coin_prices["Close"]) / df_coin_prices["Open"]
        })
        df_crypto = pd.concat([df_crypto, df], axis=1)

    # Save the price data to a CSV file
    X_df_filtered = df_crypto.filter(like="_p")
    X_df_filtered.to_csv(save_path + "historical_crypto_prices.csv")

    return names, symbols, X_df_filtered
        

# If set to False the data will only be downloaded when you execute the code
# Set to True, if you want a fresh copy of the data.  
fetch_new_data = True 
save_path = '' # path where the price data will be stored in a csv file

# Fetch fresh data via the scraping package, or use data from the csv file on disk
if fetch_new_data == False:
    try:
        print('loading from disk')
        X_df_filtered = pd.read_csv(save_path + 'historical_crypto_prices.csv')
        if 'Unnamed: 0' in X_df_filtered.columns: 
            X_df_filtered = X_df_filtered.drop(['Unnamed: 0'], axis=1)
            symbols, names = np.array(sorted(symbol_dict.items())).T
        print(list(X_df_filtered.columns))
    except:
        print('no existing price data found - loading fresh data from coinmarketcap and saving them to disk')
        names, symbols, X_df_filtered = load_fresh_data_and_save_to_disc(symbol_dict, save_path)
        print(list(symbols))
else:
       print('loading fresh data from coinmarketcap and saving them to disk')
       names, symbols, X_df_filtered = load_fresh_data_and_save_to_disc(symbol_dict, save_path)
       print(list(symbols))

# Limit the price data to the last t days
t= 14 # in days
X_df_filtered = X_df_filtered[:t]
X_df_filtered.head()

	ACM_p		ADA_p		ARK_p		ATM_p		ATOM_p		AVAX_p		BAT_p		BCH_p		BLZ_p		BNB_p		...	THETA_p		UNI_p		USDT_p		VET_p		WAVES_p		XLM_p		XMR_p		XRP_p		ZIL_p		ZRX_p
0	0.031987	-0.037645	-0.005702	0.030928	-0.005897	-0.012404	-0.012262	-0.022529	0.008072	-0.007111	...	-0.021994	-0.023758	-0.000103	-0.021024	-0.015416	-0.004096	-0.022988	-0.027397	-0.016659	-0.012255
1	0.028192	0.065034	0.122306	0.010310	0.093558	0.106811	0.082863	0.075567	0.062105	0.054733	...	0.067264	0.081040	0.000136	0.077203	0.092987	0.078562	0.111519	0.071696	0.076484	0.085094
2	0.040771	0.016097	-0.133345	0.018963	0.011304	-0.033328	-0.007616	0.011458	-0.019993	0.005134	...	-0.005104	-0.024190	0.000077	0.002218	0.008920	0.004139	-0.031822	-0.012107	-0.003906	-0.021170
3	-0.027698	0.005129	-0.031516	-0.002639	0.022235	-0.008117	0.003969	0.019119	0.015403	0.005920	...	0.007992	0.027203	0.000003	0.000701	0.010739	0.005324	-0.007914	0.007168	0.004556	-0.003786
4	-0.021129	-0.019053	0.003273	-0.008121	0.002883	-0.004927	0.002548	-0.000599	0.028492	-0.012181	...	0.000198	-0.025817	-0.000047	-0.002800	-0.051515	-0.004861	0.015134	-0.000596	-0.010343	0.004530

The data looks good, so let’s continue.

Step #2 Plotting Crypto Price Charts

Now that the data is available, we can visualize it in various line graphs. The visualization helps us better understand what kind of data we are dealing with and check if the download was successful.

# Create Prices Charts for all Cryptocurrencies
list_length = X_df_filtered.shape[1]
ncols = 10
nrows = int(round(list_length / ncols, 0))
height = list_length/3 if list_length > 30 else 4
fig, axs = plt.subplots(nrows=nrows, ncols=ncols, sharex=True, sharey=True, figsize=(20, height))
for i, ax in enumerate(fig.axes):
        if i < list_length:
            sns.lineplot(data=X_df_filtered, x=X_df_filtered.index, y=X_df_filtered.iloc[:, i], ax=ax)
            ax.set_title(X_df_filtered.columns[i])
plt.show()

We can see the lineplots for all cryptocurrencies and everything looks as expected.

Step #3 Clustering Cryptocurrencies using Affinity Propagation

Next, we must prepare the data and run the affinity propagation algorithm. For some cryptocurrencies, we may encounter data that contains NaN values. Because clustering is sensitive to missing values, we must ensure good data quality. In addition, the Python code below will convert the DataFrame into a NumPy array and transpose it into a form where we have crypto assets as records and the days as columns.

Running the code below returns a dictionary of clusters with the cryptocurrencies assigned to them by the affinity propagation algorithm.

# Drop NaN values
X_df = pd.DataFrame(np.array(X_df_filtered)).dropna()
# Transpose the data to structure prices along columns
X = X_df.copy()
X /= X.std(axis=0)
X = np.array(X)
# Define an edge model based on covariance
edge_model = covariance.GraphicalLassoCV()
# Standardize the time series
edge_model.fit(X)
# Group cryptos to clusters using affinity propagation
# The number of clusters will be determined by the algorithm
cluster_centers_indices , labels = cluster.affinity_propagation(edge_model.covariance_, random_state=1)
cluster_dict = {}
n_labels = labels.max()
print(f"{n_labels} Clusters")
for i in range(n_labels + 1):
    clusters = ', '.join(names[labels == i])
    print('Cluster %i: %s' % ((i + 1), clusters))
    cluster_dict[i] = (clusters)

9 Clusters
Cluster 1: Binance Coin, Cake Defi
Cluster 2: Bitcoin Cash, Bitcoin, BitTorrent, Decred, EOS, Ethereum Classic, Ethereum, Ampleforth, Komodo, Solana, Sys Coin, DOT
Cluster 3: Celsius
Cluster 4: Doge Coin
Cluster 5: Cardano, ATOM, Avalance, Enjin, Internet Computer, Link, Loopring, Polygon, IOTA, NEO, Synthetix, Theta, Vechain
Cluster 6: Litecoin
Cluster 7: ACM Token, Atletico Madrid Token, Chilliz, Juventus Turin Token, PSG Token
Cluster 8: LRC
Cluster 9: Tether
Cluster 10: ARK, Battoken, BLZ, Digibyte, AS Rom Token, WAVES, Stellar Lumen, Monero, Ripple, Zilliqa, Zer0

We can see that the algorithm has identified 13 different clusters in the data and a couple of clusters with only a single member. You will most likely encounter different results depending on when you run it.

Step #4 Create a 2D Positioning Model based on the Graph Structure

In addition to clusters, we want to show the covariance between cryptocurrencies in our Crypto Market map. We need a graph-like structure that contains the covariance and position data of the cryptocurrencies for each crypto pair.

In addition, we use a node position model that calculates their relative position on a 2D plane from the covariance of the cryptocurrencies. However, the positions are only relative, so the absolute axes have no meaning.

# Create a node_position_model that find the best position of the cryptos on a 2D plane
# The number of components defines the dimensions in which the nodes will be positioned
node_position_model = manifold.LocallyLinearEmbedding(n_components=2, eigen_solver='dense', n_neighbors=20)
embedding = node_position_model.fit_transform(X.T).T
# The result are x and y coordindates for all cryptocurrencies
pd.DataFrame(embedding)
# Create an edge_model that represents the partial correlations between the nodes
partial_correlations = edge_model.precision_.copy()
d = 1 / np.sqrt(np.diag(partial_correlations))
partial_correlations *= d
partial_correlations *= d[:, np.newaxis]
# Only consider partial correlations above a specific threshold (0.02)
non_zero = (np.abs(np.triu(partial_correlations, k=1)) > 0.02)
# Convert the Positioning Model into a DataFrame
data = pd.DataFrame.from_dict({"embedding_x":embedding[0],"embedding_y":embedding[1]})
# Add the labels to the 2D positioning model
data["labels"] = labels
print(data.shape)
data.head()

(48, 3)
	embedding_x	embedding_y	labels
0	0.400590	-0.136473	6
1	-0.081908	-0.086039	4
2	-0.033982	-0.038526	9
3	0.416745	0.076849	6
4	-0.041938	0.031966	4

The next step is to create a graph of the partial correlations.

Step #5 Visualize the Crypto Market Structure

Our goal is to visualize differences in the covariance between crypto pairs by varying the connection strengths. We calculate the line strength by normalizing the covariance of the crypto pairs. In addition, we visualize the distribution of the covariance.

# Create an array with the segments for connecting the data points
start_idx, end_idx = np.where(non_zero) 
segments = [[np.array([embedding[:, start], embedding[:, stop]]).T, start, stop] for start, stop in zip(start_idx, end_idx)]
# Create a normalized representation of partial correlation between crypto currencies
# We can later use covariance to vizualize the strength of the connections
pc = np.abs(partial_correlations[non_zero])
normalized = (pc-min(pc))/(max(pc)-min(pc))
# plot the distribution of covariance between the cryptocurrencies
sns.histplot(pc)

The hist plot shows that the covariance between the crypto pairs is mostly below 0.005.

Finally, it is time to map cryptocurrencies on a 2D plane. To do this, we first define the cryptocurrencies using their relative position data with a scatterplot. We set the color of the points based on their clusters so that points in the same cluster are colored the same. Subsequently, we connect the points to the data from the edge model. The covariance between the crypto pairs determines the strength of their connections.

We also define the color of the connections as follows.

The map only shows connections with a covariance greater than 0.002.
Connections with a covariance greater than 0.05 are colored red.
Otherwise, connections between points within a cluster are shown in the cluster’s color.
We color connections in grey that are between points of different clusters.

Last but not least, we add the labels of the cryptocurrencies.

# Visualization
plt.figure(1, facecolor='w', figsize=(20, 8))
plt.clf()
ax = plt.axes([0., 0., 1., 1.])

# Plot the nodes using the coordinates of our embedding
sc = sns.scatterplot(
    data=data,
    x="embedding_x",
    y="embedding_y",
    zorder=1,
    s=350 * d ** 2,
    c=labels,
    cmap=plt.cm.nipy_spectral,
    alpha=.9,
    #palette="muted",
)

# Plot the covariance edges between the nodes (scatter points)
line_strength = 3.2
    
for index, ((x, y), start, stop) in enumerate(segments):     
    norm_partial_correlation = normalized[index]
    if list(data.iloc[[start]]['labels'])[0] == list(data.iloc[[stop]]['labels'])[0]:
        if norm_partial_correlation > 0.5:
            color = 'red'; linestyle='solid'
        else:
            color = plt.cm.nipy_spectral(list(data.iloc[[start]]['labels'])[0] / float(n_labels)); linestyle='solid'
    else:
        if norm_partial_correlation > 0.5:
            color = 'red'; linestyle='solid'
        else:
            color = 'grey'; linestyle='dashed'
    # Plot the edges
    # if x and y larger than 0
    if x[0] > 0 and y[0] > 0:
        plt.plot(x, y, alpha=.4, zorder=0, linewidth=normalized[index]*line_strength, color=color, linestyle=linestyle)

    
# Labels the nodes and position the labels to avoid overlap with other labels
for name, label, (x, y) in zip(names, labels, embedding.T):
    color = plt.cm.nipy_spectral(label / float(n_labels))
    ax.annotate(
        name,
        xy=(x, y),
        xytext=(5, 2),
        textcoords='offset points',
        ha='right',
        va='bottom',
        fontsize=10,
        color='black',
        bbox=dict(facecolor='w', edgecolor="w", alpha=.0),
     )

Note that you will likely see a different map when you run the code on your machine. Differences result from changes in market prices and covariance that lead to other graph structures.

Let’s see what the crypto market map tells us.

Interpreting the Cryptomarket Map

The 2D crypto market map tells us several things:

Most cryptos fall into the light green and dark green clusters corresponding to different types of crypto (Decentralized Finance Coins, NFT/Metaverse Coins).
There is a significant covariance between large-cap players in the crypto space, such as Cardano and Loopring and Ethereum and Bitcoin, which is plausible considering recent price movements. Some results are surprising, for example, the partial correlation between NEO and Ethereum Classic.
Some clusters are isolated and contain only a single member, for example, Tether, Komodo, AC Milan token, Wave token, and Dogecoin). The reason is that the prices of these coins/tokens have developed independently of the market.
- Tether is a stablecoin that does not change in price. It, therefore, strongly differs from the other cryptocurrencies on our map.
- Komodo has been trading sideways without following the general market trend.
- And the MCM token is a soccer token that has recently outperformed the market.
Soccer tokens are colored in dark blue. These tokens’ prices correlate with how the soccer clubs performed during the current season. It, therefore, makes perfect sense that these tokens are grouped into a cluster. An exception is the AC Milan token, which recently performed better than the other soccer tokens.

Step #6 Creating a 3D Representation

Instead of a 2D representation of the data points, we can also use a 3D node positioning model. For this purpose, the node positioning model distributes the affinity values over three dimensions.

# Find the best position of the cryptos on a 3D plane
node_position_model = manifold.LocallyLinearEmbedding(n_components=3, eigen_solver='dense', n_neighbors=20)
embedding = node_position_model.fit_transform(X.T).T
# The result are x and y coordindates for all cryptocurrencies
pd.DataFrame(embedding)
# Display a graph of the partial correlations
partial_correlations = edge_model.precision_.copy()
d = 1 / np.sqrt(np.diag(partial_correlations))
partial_correlations *= d
partial_correlations *= d[:, np.newaxis]
non_zero = (np.abs(np.triu(partial_correlations, k=1)) > 0.02)
data = pd.DataFrame.from_dict({"embedding_x":embedding[0],"embedding_y":embedding[1],"embedding_z":embedding[1]})
data["labels"] = labels
data["names"] = names
import matplotlib.pyplot as plt
import numpy as np
fig = plt.figure(figsize=(20,20))
ax = fig.add_subplot(projection='3d')
xs = data["embedding_x"]
ys = data["embedding_y"]
zs = data["embedding_z"]
sc = ax.scatter(xs, ys, zs, c=labels, s=100)
    
for i in range(len(data)):
    x = xs[i]
    y = ys[i]
    z = zs[i]
    label = data["names"][i]
    ax.text(x, y, z, label)
    
plt.legend(*sc.legend_elements(), bbox_to_anchor=(1.05, 1), loc=2)
plt.show()

Summary

Affinity propagation is a powerful technique for clustering items when the optimal number of clusters is unknown. In this article, we’ve demonstrated how to apply affinity propagation to analyze the cryptocurrency market and identify groups of assets based on similar price fluctuations.

In our example, we identified 13 groups of cryptocurrencies without specifying the number of clusters in advance. We also visualized the market structure on a 2D and 3D map using a node distribution technique. This approach can be extended to analyze and cluster stock markets, highlighting complex price patterns among multiple financial assets.

Once you’ve identified clusters, you can dive deeper into individual groups. Sometimes, outliers that temporarily break out of their usual pattern indicate interesting investment opportunities. These outliers can eventually return to the price pattern of their group, or they may represent forerunners of their group, indicating broader market movements.

By using affinity propagation, we can visualize financial assets in a new and exciting way. If you have any questions or comments about this approach, please let me know.

Sources and Further Reading

This article modifies some of the code from Scikit-learn and adapts it from the stock market to cryptocurrencies.

Jansen (2020) Machine Learning for Algorithmic Trading: Predictive models to extract signals from market and alternative data for systematic trading strategies with Python
Aurélien Géron (2019) Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems
David Forsyth (2019) Applied Machine Learning Springer
Andriy Burkov (2020) Machine Learning Engineering
Images are created using Midjourney, an AI that creates images from text.

The links above to Amazon are affiliate links. By buying through these links, you support the Relataly.com blog and help to cover the hosting costs. Using the links does not affect the price.

The post Unveiling Hidden Patterns in the Cryptocurrency Market with Affinity Propagation and Python appeared first on relataly.com.

Leveraging Distributed Computing for Weather Analytics with PySpark

Florian Follonier — Sun, 03 Apr 2022 13:49:00 +0000

Apache Spark is a popular distributed computing framework for Big Data processing and analytics. In this tutorial, we will work hands-on with PySpark, Spark’s Python-specific interface. We built on the conceptual knowledge gained in a previous tutorial: Introduction to BigData Analytics with Apache Spark, in which we learned about the essential concepts behind Apache Spark and its distributed architecture.

The PySpark library provides access to Apache Spark APIs and other cool features like SQL, DataFrame, Streaming, Spark Core, and MLlib for machine learning. Several of these features will help us to prepare and analyze a historical dataset gathered by a weather station in Zurich. You will gain an overview of essential PySpark functions for transforming and querying data in a local computing environment.

The rest of this tutorial proceeds as follows: In the first sections, we will ingest, join, clean, and transform the data using a broad set of PySpark’s DataFrame functions. In addition, we will touch on the topic of user-defined functions to perform custom transformations. Finally, we will analyze and visualize the data using the Seaborn Python library. As part of the analysis, we work with PySpark SQL, which allows us to register datasets as tables and query them using PySpark SQL syntax.

We can use PySpark and Seaborn to create dot plots like the one above.

What is Weather Analytics?

Weather analytics is the process of using data and analytics to understand and predict weather patterns and their impacts. This can involve analyzing historical weather data to identify trends and patterns, using mathematical models to simulate and forecast future weather conditions, and using machine learning algorithms to make more accurate predictions. Weather analytics can be used for a wide range of applications, such as predicting the likelihood of severe weather events, improving agricultural planning and decision-making, and managing energy consumption and supply. By providing insights into future weather conditions, weather analytics can help businesses and organizations to make more informed decisions and mitigate the potential impacts of extreme weather.

Analyzing Historical Weather Data using PySpark

Weather and climate data are interesting resources, not only for weather forecasting but also for dozens of other purposes across different industries. A central area is certainly understanding climate change. Data scientists analyze climate data collected from weather stations around the globe to forecast how climate trends and events unfold. Because the amounts of data are often large, climate analytics is a common application area for distributed computing.

This tutorial guides us through a good use case for distributed data processing. We will process and analyze a set of historical weather data collected from a weather station in Zurich. We aim to explore the relations between climate variables such as temperate, wind, snowfall, and precipitation. However, the primary purpose of this tutorial is to familiarize ourselves with the essential functions of PySpark. Some topics covered in this part are:

reading and writing DataFrames
selecting, renaming, and manipulating columns
filtering, dropping, sorting, and aggregating rows
joining and transforming data
working with UDFs and Spark SQL functions
While we implement these functions, we will also look into what PySpark does under the hood.

PySpark can run on a simulated distributed environment on your local machine. So there is no need for expensive hardware.

The code is available on the GitHub repository.

View on GitHub Relataly GitHub Repo

Zurich weather knows all seasons.

Prerequisites

This tutorial assumes that you have set up your Python 3 environment. You can follow this tutorial to prepare your Anaconda environment if you have not set it up yet. I recommend using Anaconda or Visual Studio Code, but any other Python environment will do.

It is also assumed that you have the following packages installed: pandas, matplotlib, and seaborn for visualization. You can install the packages with the console command: pip install or, if you use the anaconda packet manager, conda install . In addition, you need to install the PySpark library.

pip install conda

install (if you are using the anaconda packet manager)

Download the Weather Data

We will train our model with a public weather dataset that contains daily weather information from Zurich, Switzerland. The data was collected at a weather station between 1979-01-01 and 2021-01-01 and has been divided into two sets, “A” and “B.” After downloading the files, place them under the following directory: root-folder/data/weather/

We will work with a set of weather data from Zurich. Image created with Midjourney.

Download File A

date in format YYYY-mm-dd
the wind direction in degrees
max snow depth in mm
precipitation in mm/m²
air pressure in hPa
max wind speed in km/h
average wind speed in km/h
daily sunshine in minutes

Download File B

date string in format YYYY-MM-DD
the minimum temperature in °C
maximum temperature in °C
avergage temperature in °C

Step #1 Initialize SparkSession

The first step of any Spark application is to create a SparkSession. Older versions of Spark were using contexts for accessing Spark’s different modules. Since Spark 2.0, SparkSession has replaced SparkContext and has become the unified entry point for accessing Spark APIs and communicating with the cluster.

We can create new SparkSessions by using the getOrCreate() method. This method checks if there is an existing SparkSession, and if it does not find a current Session, the process creates a new instance. Since we are working with a local PySpark installation, this will make a new SparkSession. If you have a SparkCluster available, you can use this cluster by providing the function with the cluster address.

PySpark is the Python-specific interface of Apache Spark

# A tutorial for this file is available at www.relataly.com

# Import SparkSession from pyspark.sql
from pyspark.sql import SparkSession
from pyspark.sql.types import StringType
from pyspark.sql.functions import col, isnan, when, count, udf, year, month, to_date, mean
import pyspark.sql.functions as F
import seaborn as sns
import matplotlib.pyplot as plt

# Create my_spark
spark = SparkSession.builder.getOrCreate()
print(spark)

Step #2 Load the Data to a Spark DataFrame

Next, we want to load two CSV files by invoking the PySpark read function from the SparkSession instance. The function returns a Spark DataFrame, an abstraction layer for interacting with tabular data.

2.1 Reading File A

Let’s read the first file by invoking the read method on the dataframe. Spark will evaluate the read function lazily when we run the read function in the code below. So the read operation does not trigger computation and waits until we call an action.

After running the printSchema method in the code below, the Spark engine performs the “read” operation and shows us the result. We would not see any computations if we only used the read function (a transformation) and not any actions. However, we also call summary() to compute some basic statistics on the DataFrame. These action methods trigger Spark to read the data and create a new RDD that contains the data.

Spark supports many formats for reading data, including CSV, JSON, Delta, Parquet, TXT, AVRO, etc. Our data files are in CSV format, separated by a comma. Therefore, we use the .csv function. The code below reads a CSV file into a Spark DataFrame.

DataFrame Action	Description
show()	Displays the top n rows of a DataFrame
count()	Counts the number of rows in a DataFrame
collect()	Collects the data from the worker nodes and returns them in an array
summary(), describe()	Show statistics for string and numeric columns
first(), head()	Returns the first row, or several first rows
take()	Collects the n first rows and returns them in an array

Actions Methods on the Spark DataFrame API

# Read File A
spark_weather_df_a = spark.read \
    .option("header", False) \
    .option("sep", ",") \
    .option("inferSchema", True) \
    .csv(path=f'data/weather/zurich_weather_a')
    
spark_weather_df_a.printSchema()
spark_weather_df_a.describe().show()

root
 |-- _c0: integer (nullable = true)
 |-- _c1: string (nullable = true)
 |-- _c2: double (nullable = true)
 |-- _c3: double (nullable = true)
 |-- _c4: double (nullable = true)
 |-- _c5: double (nullable = true)
 |-- _c6: double (nullable = true)
 |-- _c7: double (nullable = true)
 |-- _c8: double (nullable = true)
 |-- _c9: string (nullable = true)

+-------+------------------+----------+-----------------+------------------+-----------------+-----------------+------------------+------------------+------------------+----+
|summary|               _c0|       _c1|              _c2|               _c3|              _c4|              _c5|               _c6|               _c7|               _c8| _c9|
+-------+------------------+----------+-----------------+------------------+-----------------+-----------------+------------------+------------------+------------------+----+
|  count|             15384|     15384|            15363|             15384|            15243|              885|               897|               897|               897|   0|
|   mean|182.61966978679146|      null|9.552613421857709|3.0395995839833567|7.952896411467559|156.8146892655367| 7.334894091415833|27.502229654403596|1017.9076923076926|null|
| stddev|105.74046684963878|      null|7.476849234845438| 6.527361918363559|31.45995738637225|96.54941299923699|4.8985850971655704|16.091181159085917| 8.236341299054192|null|
|    min|                 0|1979-01-01|            -18.0|               0.0|              0.0|              0.0|               2.2|               7.4|             984.7|null|
|    max|               366|2021-01-01|             28.0|              97.8|            550.0|            358.0|              40.4|             113.0|            1042.8|null|
+-------+------------------+----------+-----------------+------------------+-----------------+-----------------+------------------+------------------+------------------+----+

2.2 Reading File B

In the code below, you can see the option for header, separate, and inferShema. We are using Spark’s schema inference option to read the CSV data. However, we must use this function carefully because inferring schema can affect performance. Spark is more sensitive to data types, and inferring the schema does not always work. When we are unsure how the data looks, it is a better solution to make the data types explicit to Spark. In our case, inferSchema has been tested before to ensure it works.

# Read File B
spark_weather_df_b = spark.read \
    .option("header", False) \
    .option("sep", ",") \
    .option("inferSchema", True) \
    .csv(path=f'data/weather/zurich_weather_b')

spark_weather_df_b.printSchema()
spark_weather_df_b.describe().show()

root
 |-- _c0: integer (nullable = true)
 |-- _c1: string (nullable = true)
 |-- _c2: double (nullable = true)
 |-- _c3: double (nullable = true)

+-------+------------------+----------+------------------+-----------------+
|summary|               _c0|       _c1|               _c2|              _c3|
+-------+------------------+----------+------------------+-----------------+
|  count|             15383|     15383|             15374|            15368|
|   mean|182.63121627770917|      null|13.705762976453755| 6.06695731389901|
| stddev|105.73420482794731|      null|  8.83091914731549|6.620985806040894|
|    min|                 0|1979-01-01|             -15.2|            -20.8|
|    max|               366|2021-01-01|              36.0|             22.5|
+-------+------------------+----------+------------------+-----------------+

Step #3 Join the Data

Next, we will join the data from the two files and merge them into a single Spark DataFrame. Whenever we run a transformation on a DataFrame, whether via the SQL or the DataFrame API, we submit a query over to the query plan of the Spark query execution engine. The execution engine will then optimize and execute the query plan. The result is a new Resilient Distributed Dataset (RDD) with the transformed data. Keep this in mind, and imagine what happens underneath the hood as we proceed.

We now have two files containing specific columns of our weather data. To make sense of the data and prepare them for analytics, we will merge the two files into a single DataFrame. In addition, we perform several transformations to bring the data into shape.

Join the two datasets using the Join function
Remove some columns that we won’t need via the drop functions
Define new column names
Select and rename columns using Select() with Alias()

First, we drop column_c0 because we will not need it.

# Drop unused Columns
spark_weather_df_a = spark_weather_df_a.drop("_c0")
spark_weather_df_b = spark_weather_df_b.drop("_c0")

There are different ways how we can rename columns in PySpark. The first (A) is withColumnRenamed. It requires a separate function call for each column we want to rename. The second is with a custom function and a dictionary. In this way, we can rename multiple columns at once.

After renaming the columns, we use the join function to merge the two datasets based on the row column.

# method A: rename individual columns
spark_weather_df_b_renamed = spark_weather_df_b.withColumnRenamed("_c1", "date") \
    .withColumnRenamed("_c2", "max_temp") \
    .withColumnRenamed("_c3", "min_temp") 
        
# method B: rename multiple columns at once   
def rename_multiple_columns(df, columns):
    if isinstance(columns, dict):
        return df.select(*[F.col(col_name).alias(columns.get(col_name, col_name)) for col_name in df.columns])
    else:
        raise ValueError("columns need to be in dict format {'existing_name_a':'new_name_a', 'existing_name_b':'new_name_b'}")

dict_columns = {"_c1": "date2", 
                "_c2": "avg_temp", 
                "_c3": "precip",
                "_c4": "snow",
                "_c5": "wind_dir",
                "_c6": "wind_speed",
                "_c7": "wind_power",
                "_c8": "air_pressure",
                "_c9": "sunny_hours",}
spark_weather_df_a_renamed = rename_multiple_columns(spark_weather_df_a , dict_columns)

# Join the dataframes
spark_weather_df = spark_weather_df_a_renamed.join(spark_weather_df_b_renamed, spark_weather_df_a_renamed.date2 == spark_weather_df_b_renamed.date, "inner")

Step #4 Gain a Quick Overview of the Data

When we perform transformations on a new dataset, we typically want to understand the outcome and print out several functions to gain an overview of the data. We can make our lives easier by writing a small function for this purpose, which we can call as needed. Running the custom function below gives us a quick overview of the data. It takes a DataFrame as an argument and performs the following steps:

Return the two first records
Counting NAN values for all columns that have double or float datatype
Print duplicate values based on the date column. To display the result, we can use the Limit function in combination with the toPandas function
Prints the schema

def quick_overview(df):
   # display the spark dataframe
   print("FIRST RECORDS")
   print(df.limit(2).sort(col("date"), ascending=True).toPandas())

   # count null values
   print("COUNT NULL VALUES")
   print(df.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c, y in df.dtypes if y in ["double", "float"]]
      ).toPandas())

   # print("DESCRIBE STATISTICS")
   # print(df.describe().toPandas())
   # # Alternatively to get the max value, we could use max_value = df.agg({"precipitation": "max"}).collect()[0][0]

   # check for dublicates
   dublicates = df.groupby(spark_weather_df.date) \
    .count() \
    .where('count > 1') \
    .limit(5).toPandas()
   print(dublicates)

   # print schema
   print("PRINT SCHEMA")
   print(df.printSchema())

quick_overview(spark_weather_df)

FIRST RECORDS
        date2  avg_temp  precip  snow  wind_dir  wind_speed  wind_power  \
0  2020-01-01      -1.8     0.0   0.0      63.0         4.2        14.8   
1  2020-01-01      -1.8     0.0   0.0      63.0         4.2        14.8   

   air_pressure sunny_hours        date  max_temp  min_temp  
0        1034.6        None  2020-01-01      -0.9      -2.9  
1        1034.6        None  2020-01-01      -0.9      -2.9  
COUNT NULL VALUES
   avg_temp  precip  snow  wind_dir  wind_speed  wind_power  air_pressure  \
0        21       0   143     14577       14565       14565         14565   

   max_temp  min_temp  
0         9        15  
         date  count
0  1986-01-01      4
1  2011-01-01      4
2  2005-01-01      4
3  1985-01-01      4
4  2006-01-01      4
PRINT SCHEMA
root
 |-- date2: string (nullable = true)
 |-- avg_temp: double (nullable = true)
 |-- precip: double (nullable = true)
...
 |-- max_temp: double (nullable = true)
 |-- min_temp: double (nullable = true)

None

As we can see, our data has duplicates and missing values. But don’t worry; we will handle these data quality issues in the next step.

Step #5 Clean the Data using Chained Operations

Spark uses a design pattern that allows us to chain multiple operations. In the next step, we use this feature when we clean and filter the data. On execution, Spark will decide the order in which it performs the transformations as part of an optimized execution plan.

We begin cleaning the data by replacing Null values. Afterward, we will reduce the data size by selecting specific columns in a separate section.

5.1 Replacing NA Values with Means

As we saw in the previous step, there are several NULL values in the temperature and wind columns. Having complete datasets without gaps is preferable when performing historical data analytics. Therefore, we will replace the missing values with the average value. Running the code below, first, calculate the mean values. Subsequently, we use these mean values to replace the Null values. In the second step, we can utilize Spark’s method-chaining functionality and perform the replace action for several columns in a single line of code.

# calculate mean values
avg = spark_weather_df.filter(spark_weather_df.avg_temp.isNotNull())\
    .select(mean(col('min_temp')).alias('mean_min'), 
            mean(col('max_temp')).alias('mean_max'), 
            mean(col('wind_speed')).alias('mean_wind')).collect()

mean_min = avg[0]['mean_min']
mean_max = avg[0]['mean_max']
mean_wind = avg[0]['mean_wind']

# replace na values with mean values
spark_weather_df = spark_weather_df \
    .na.fill(value=mean_min, subset=["min_temp"]) \
    .na.fill(value=mean_max, subset=["max_temp"]) \
    .na.fill(value=mean_wind, subset=["wind_speed"]) \
    .na.fill(value=0, subset=["snow"])

Empty DataFrame Columns: [date, count] Index: []

5.2 Removing Duplicates

We still need to remove duplicate values now that we have a complete dataset. In addition, we will use the following DataFrame methods:

orderBy: To sort the data.
With column: Select columns and convert data types (to Date type). We can find an overview of the data types supported by PySpark here.
drop: Used to eliminate individual columns from a dataset.
dropDublicates: Used to eliminate duplicate records from the dataset.

Again, we will use PySpark’s method chaining feature and invoke all transformations in a single line of code.

# remove dublicates and drop column date2, convert date to datatype "date", Sort the Data by Date
# drop date2 column, convert date column from string to date type, order by date
spark_cleaned_df = spark_weather_df.dropDuplicates()\
    .drop(col("date2"))\ # only for demonstration, actually not necessary because we are using select at the end
    .withColumn("date", to_date(col("date"),"yyyy-MM-dd")) \
    .orderBy(col("date")) \
    .select(col("date"),col("avg_temp"),col("min_temp"),col("max_temp"),col("wind_speed"),col("snow"),col("precip")) # select columns

quick_overview(spark_cleaned_df)

FIRST RECORDS
         date  avg_temp  min_temp  max_temp  wind_speed  snow  precip
0  1979-01-01      -5.3     -12.2      -2.5    7.326082   0.0     1.0
1  1979-01-02     -10.0     -12.2      -4.2    7.326082  10.0     1.7
COUNT NULL VALUES
   avg_temp  min_temp  max_temp  wind_speed  snow  precip
0         0         0         0           0     0       0

Step #6 Transform the Data using User Defined Functions (UDFs)

In addition to the standard methods, PySpark offers the possibility to execute custom Python code in the Spark cluster. Such functions are called User Defined Functions (UDF). When we use UDFs, we should be aware that we won’t achieve the same performance level as standard PySpark functions.

We want to create several bucks containing several temperature classes in the following. We can use UDFs to create these buckets. However, sometimes we can only achieve a task by using UDFs. By running the code below, we first create a normal Python function called “binner” and then hand over this function to a UDF called udf_binner_temp.

# More Infos on User Defined Functions: https://sparkbyexamples.com/pyspark/pyspark-udf-user-defined-function/
# create bucket column for temperature
def binner(min_temp, max_temp):
        if (min_temp is None) or (max_temp is None):
            return "unknown"
        else:
            if min_temp < -10:
                return "freezing cold"
            elif min_temp < -5:
                return "very cold"
            elif min_temp < 0:
                return "cold"
            elif max_temp < 10:
                return "normal"
            elif max_temp < 20:
                return "warm"
            elif max_temp < 30:
                return "hot"
            elif max_temp >= 30:
                return "very hot"
        return "normal"


udf_binner_temp = udf(binner, StringType() )
spark_cleaned_df = spark_cleaned_df.withColumn("temp_buckets", udf_binner_temp(col("min_temp"), col("max_temp")))
spark_cleaned_df.limit(10).toPandas()

In the same way, we create buckets to cover different levels of precipitation.

# create new columns for bucket percipitation and for month and year
udf_binner_precip = udf(lambda x: "very rainy" if x > 50 else ("rainy" if x > 0 else "dry"), StringType())
spark_cleaned_df = spark_cleaned_df \
    .withColumn("precip_buckets", udf_binner_precip("precip")) \
    .withColumn("month", month(spark_cleaned_df.date)) \
    .withColumn("year", year(spark_cleaned_df.date)) 
spark_cleaned_df.limit(5).toPandas()

	date		avg_temp	min_temp	max_temp	wind_speed	snow		precip		temp_buckets
0	1979-01-01	-5.3		-12.2		-2.5		7.326082	0.0			1.0			freezing cold
1	1979-01-02	-10.0		-12.2		-4.2		7.326082	10.0		1.7			freezing cold
2	1979-01-03	-5.8		-7.9		-3.9		7.326082	110.0		0.0			very cold
3	1979-01-04	-8.4		-10.5		-4.4		7.326082	100.0		0.0			freezing cold
4	1979-01-05	-10.0		-11.2		-7.5		7.326082	70.0		0.0			freezing cold
5	1979-01-06	-6.4		-10.1		-2.8		7.326082	50.0		0.0			freezing cold
6	1979-01-07	-3.8		-5.1		-2.9		7.326082	40.0		0.0			very cold
7	1979-01-08	-2.4		-4.5		1.4			7.326082	40.0		5.1			cold
8	1979-01-09	1.4			-0.4		3.8			7.326082	20.0		3.2			cold
9	1979-01-10	0.4			-1.4		2.8			7.326082	10.0		5.4			cold

These steps complete the data preparation. Next, we will focus on analyzing the data.

Step #7 Exemplary Data Analysis using Seaborn

We can analyze the dataset now that we have a clean and consistent set of historical weather data. To aid our analysis, we will visualize the data. Since PySpark does not yet provide its functions for visualization, we will use the Python library Seaborn for visualization.

PySpark provides two different ways to analyze data. The first is standard PySpark methods executed on a DataFrame (sums, counts, average, groupby). The second is PySpark SQL, which allows us to query the data using basic SQL syntax.

The PySpark transformation functions and PySpark SQL are optimized for distributed, parallelized computations on large data sets. In the following, we will explore both ways. This section shows how we can now use the data to create different types of plots:

Temperature scatterplot with colored dots to highlight historical developments
Heatmap on mean max temperature
Scatterplots that illustrate three-dimensional relationships

7.1 Temperature Scatterplot with Colored Dots

In this section, we create a line plot colored by temperature buckets.

# set the dimensions for the scatterplot
fig, ax = plt.subplots(figsize=(28,8))
sns.scatterplot(hue="temp_buckets", y="avg_temp", x="date", data=spark_cleaned_df.toPandas(), palette="Spectral_r")

# plot formatting 
ax.tick_params(axis="x", rotation=30, labelsize=10, length=0)

# title formatting
mindate = str(spark_cleaned_df.agg({'date': 'min'}).collect()[0]['min(date)'])
maxdate = str(spark_cleaned_df.agg({'date': 'max'}).collect()[0]['max(date)'])
ax.set_title("average temperature in Zurich: " + mindate + " - " + maxdate)
plt.xlabel("Year")
plt.show()

	year  	 month   mean(avg_temp)  mean(max_temp)  mean(min_temp)
0  1979      5       19.180000       26.940000       12.900000
1  1979      6       19.200000       25.842857       14.385714
2  1979      7       20.828571       27.071429       15.357143
3  1979      8       20.200000       27.000000       14.216667
4  1979      9       19.000000       25.400000       13.700000

The outliers to the bottom and top are interesting. While the upward temperature peaks remain relatively stable, the lower temperatures have decreased in the past decades.

7.2 Heatmap on Mean Max Temperature

Next, we want to create a heatmap illustrating how maximum temperature develops over years and months. We first use PySpark to compute the mean of the maximum temperature over years and months. Then we use seaborn to create a heatmap.

spark_cleaned_df_agg = spark_cleaned_df.select(col("year"),col("month"),col("max_temp")) \
    .filter(spark_cleaned_df.year < 2021)\
    .groupby(col("year"),col("month"))\
    .agg(mean("max_temp").alias("mean_max_temp")) \
    .orderBy(col("year")).toPandas()

plt.figure(figsize=(24,6))
avg_temp_df = spark_cleaned_df_agg.pivot("month", "year", "mean_max_temp")
sns.heatmap(avg_temp_df, cmap="Spectral_r", linewidths=.5)

7.3 Scatterplots

Next, the goal is to understand the relationship between temperature and weather effects, such as wind or snow. In the following, we will create different scatterplots to display the relationship between other variables. We will show the relationship between min (x-axis) and max temperature (y-axis). In addition, we color the dots to visualize relationships with an additional variable. The code below creates three such scatter plots. The plots display the relationships between the following variables:

The first plots should illustrate the relationship between temperature and snowfall
We want to demonstrate how wind speed depends on the daily min and max temperature
We want to create a plot that visualizes how daily min and max temperature change over the month

fig, axes= plt.subplots(nrows=1, ncols=3, sharey=True, figsize=(30, 10))
fig.subplots_adjust(hspace=0.5, wspace=0.2)

palette = sns.color_palette("ch:start=.2,rot=-.3", as_cmap=True)
sns.scatterplot(ax = axes[0], hue="snow", size="snow", y="max_temp", x="min_temp", data=spark_cleaned_df.toPandas(), alpha=1.0, palette=palette)
axes[0].legend(bbox_to_anchor=(1.01, 1), loc=2, borderaxespad=0.)
axes[0].set_title("min - max temperature seperated by snow")

sns.scatterplot(ax = axes[1], hue="wind_speed", size="wind_speed", y="max_temp", x="min_temp", data=spark_cleaned_df.toPandas(), alpha=1.0, palette='rocket_r')
axes[1].legend(bbox_to_anchor=(1.01, 1), loc=2, borderaxespad=0.)
axes[1].set_title("min - max temperature seperated by wind speed")

sns.scatterplot(ax = axes[2], hue="month", y="max_temp", x="min_temp", data=spark_cleaned_df.toPandas(), alpha=1.0, palette='Spectral_r', hue_norm=(1,12), legend="full")
axes[2].legend(bbox_to_anchor=(1.01, 1), loc=2, borderaxespad=0.)
axes[2].set_title("min - max temperature seperated by month")

There are various things we can learn from these plots:

It is not surprising that most of the snowfall happens at low temperatures. We can also see that most snowfall is around zero degrees Celsius. We also observe some outliers in deep temperature and higher temperature areas.
The highest wind speeds are located in the middle of the plot, where we can also observe the most significant daily difference between the minimum and the maximum temperatures.
The last chart reflects the different seasons. Through the colored areas, we can distinguish the temperature ranges of the individual months.

Step #8 PySpark SQL

PySpark SQL combines classic SQL syntax with distributed computations, which results in increased query performance. Spark SQL is similar to conventional SQL dialects such as MySQL or Microsoft Server. However, Spark provides a dedicated set of functions for everyday data operations such as string transformations, type conversions, etc. The Spark SQL reference guide offers more information on this topic.

This section aims to extend our analysis by using PySpark SQL. We will display how the daily temperature averages (for minimum, maximum, and mean) have changed.

First, we have to register our dataset as a temporary view. Once we have done that, we can run ad-hoc queries against the View using Spark SQL commands. As you can see, we can use traditional SQL syntax here. The SQL function returns the results in the form of a DataFrame. Running the code below will perform these actions and visualize the results in a line plot.

# Data Analysis using PySpark.SQL

# register the dataset as a temp view, so we can use sq
spark_cleaned_df.createOrReplaceTempView("waeather_data_temp_view")

# see how event numbers have evolved over years
events_over_years_df = spark.sql( \
        'SELECT year, month, mean(avg_temp), mean(max_temp), mean(min_temp) \
        FROM waeather_data_temp_view \
        WHERE max_temp > 25 \
        GROUP BY month, year \
        ORDER BY year, month')
print(events_over_years_df.limit(5).toPandas())

plt.figure(figsize=(16,6))
fig = sns.lineplot(y="mean(avg_temp)", x="year", data=events_over_years_df.toPandas(), color= "orange")
fig = sns.lineplot(y="mean(max_temp)", x="year", data=events_over_years_df.toPandas(), color= "red")
fig = sns.lineplot(y="mean(min_temp)", x="year", data=events_over_years_df.toPandas(), color= "blue")
plt.grid()
plt.show()

We can express the same Spark query with the DataFrame API. The result is the same.

# Using PySpark standard functions
events_over_years_df = spark_cleaned_df\
    .filter(spark_cleaned_df.year < 2021)\
    .groupby(col("year"))\
    .agg(mean("avg_temp").alias("mean_avg_temp"), 
         mean("min_temp").alias("mean_min_temp"), 
         mean("max_temp").alias("mean_max_temp"))\
    .orderBy(col("year")).toPandas()

plt.figure(figsize=(15,6))

Summary

Congratulation, you have made your first steps with distributed computing! This tutorial has demonstrated PySpark – a distributed computing framework for big data processing and analytics. After initializing the Spark session, we prepared a weather dataset and tested several PySpark functions to process and analyze it. Among the primary data operations were:

Ingestion of CSV files with the read method
Joining PySpark DataFrames
Cleaning the data by treating duplicates and Null values
Filtering the data and selecting specific columns
Performing custom computations with UDFs
Running analytics and creating plots

Although we have used a wide range of functions, we touched only upon the surface of PySpark’s capabilities. We can do many cool things with PySpark, including real-time data streaming and efficient training of big data machine learning models. We will cover these topics in separate tutorials.

Thanks for reading. I am always happy to receive feedback on my articles. If you have any questions, please let me know in the comments.

Sources and Further Reading

The links above to Amazon are affiliate links. By buying through these links, you support the Relataly.com blog and help to cover the hosting costs. Using the links does not affect the price.

Articles

Getting Started with Big Data Analytics – Apache Spark Concepts and Architecture

The post Leveraging Distributed Computing for Weather Analytics with PySpark appeared first on relataly.com.

Color-Coded Cryptocurrency Price Charts in Python

Florian Follonier — Tue, 19 Jan 2021 21:03:16 +0000

Are you intrigued by the fascinating world of cryptocurrency and looking to visually decipher its price trends? Welcome aboard! In this comprehensive tutorial, we will explore creating color-coded line charts using Python and Matplotlib, a powerful tool for effective analysis of changes along a third dimension.

The past few years have witnessed a meteoric rise in the prices of cryptocurrencies, underscoring the need for accurate analysis and visualization of their price trends. An outstanding illustration of this is the color-coded Bitcoin stock-to-flow chart, a popular choice in the crypto space that uses color differentiation to denote time until the next Bitcoin halving event.

Drawing inspiration from this, our tutorial will guide you to create a similar dynamic color-coded line chart, tracing the price trends of two leading cryptocurrencies – Bitcoin and Ethereum. This visual aid will provide a deeper insight into their price trajectories over time, enabling you to make informed investment decisions.

As we dive in, we’ll break down the process into digestible chunks, making it easier for beginners to follow along. By the end of this tutorial, you’ll not only have a profound understanding of how to create and interpret such color-coded charts but also gain valuable insights into the world of cryptocurrency price trends.

Disclaimer: This article does not constitute financial advice. Stock markets can be very volatile and are generally difficult to predict. Predictive models and other forms of analytics applied in this article only illustrate machine learning use cases.

What are Color-coded Price Charts?

Color coding is beneficial for visualizing trading signals and statistical indicators in technical chart analysis. The idea of color-coding in chart analysis is to create visually comprehensible charts that let the user quickly interpret how price develops under certain conditions. A simple example is a candlestick chart, which uses color to signal whether the price moves up (green) or down (red). Candlestick charts visualize more as regular line charts, providing additional information on the opening and closing prices.

We can use color codings in line plots to visualize conditions of various types. We can derive them from the price itself and, for example, illustrate the price development independence of oscillation indicators or moving averages. Or they can be independent of the price and represent some other conditions, such as, for example, the spread of COVID-19 cases worldwide. These are just a few examples, and there are no limits to your creativity in choosing the conditions.

Also: Requesting Crypto Price Data from the Gate.io REST API in Python

Use Cases for Color-coded Price Charts

There are various use cases for color-coded line plots in the crypto space. For example, crypto enthusiasts employ them to visualize relationships between the price of bitcoin and statistical indicators, including momentum indicators such as the RSI. Color-coded line plots have also been used to show dependencies between price and specific events that develop parallel to the bitcoin price. For example, we can use color-coding to highlight the lag between the price and the bitcoin halving every four years.

The Stock to Flow Model is an example of a Color-coded price chart (Source: lookintobitcoin.com)

Implementing Color-coded Price Charts in Python

Are you ready to elevate your data visualization skills and create visually striking price charts with Python? In this tutorial, we’ll be walking you through the creation of two dynamic line charts that use color to reveal intriguing trends and patterns. The first chart will feature a color overlay on the price line to showcase how Bitcoin prices fluctuate based on RSI. The second chart will unveil the shifting correlation between Bitcoin and Ethereum over time. Buckle up, and let’s dive in!

We’ll start by using the Coinbase Pro API to download historical price data on BTC and ETH. We’ll then calculate two well-established indicators in financial analysis: the Relative Strength Index (RSI) and the Pearson Correlation between Bitcoin and Ethereum. Finally, we’ll use Matplotlib to create stunning color-coded line charts that highlight the changes in the indicators over extended periods.

Also: Geographic Heat Maps with GeoPandas: Visualizing COVID-19

The code is available on the GitHub repository.

View on GitHub Relataly GitHub Repo

Prerequisites

Before starting the coding part, make sure that you have set up your Python 3 environment and required packages. If you don’t have an environment, you can follow this tutorial to set up the Anaconda environment. Also, make sure you install all required packages. In this tutorial, we will be working with the following standard packages:

In addition, we will be using the Historic-Crypto Python Package, which lets us easily interact with the Coinbase Pro API.

You can install packages using console commands:

pip install
conda install (if you are using the anaconda packet manager)

Step #1 Load the Price Data via the Coinbase API

We begin by downloading the historical price data on Bitcoin (BTC-USD) and Ethereum (BTC-USD) from Coinbase Pro. Don’t worry; you don’t need to download the data manually. Instead, we will use the Historic_Crypto Python package to access the data via an API.

Accessing the data via the Coinbase Pro API requires us to specify several API parameters. We define a frequency of 21600 seconds so that we will obtain price points on a 6-hour basis. In addition, we define a from_date of “2017-01-01” and add “ETH-USD” and “BTC-USD” to a list of coins for which we want to obtain the historical price data.

We query the API separately for each of the two coins in our coin list. Depending on your internet connection, this can take several minutes. The response contains three different price values:

high: the daily price high
low: the daily price low
close: the daily closing price

Later in this article, we will require all three variables to calculate the indicator values. We will therefore add the variables as columns to a new dataframe.

# Tested with Python 3.8.8, Matplotlib 3.5, Seaborn 0.11.1, numpy 1.19.5

from Historic_Crypto import HistoricalData
import pandas as pd 
from scipy.stats import pearsonr
import matplotlib.pyplot as plt 
import matplotlib.colors as col 
import numpy as np 
import datetime

# the price frequency in seconds: 21600 = 6 hour price data, 86400 = daily price data
frequency = 21600

# The beginning of the period for which prices will be retrieved
from_date = '2017-01-01-00-00'
# The currency price pairs for which the data will be retrieved
coinlist = ['ETH-USD', 'BTC-USD']

# Query the data
for i in range(len(coinlist)):
    coinname = coinlist[i]
    pricedata = HistoricalData(coinname, frequency, from_date).retrieve_data()
    pricedf = pricedata[['close', 'low', 'high']]
    if i == 0:
        df = pd.DataFrame(pricedf.copy())
    else:
        df = pd.merge(left=df, right=pricedf, how='left', left_index=True, right_index=True)   
    df.rename(columns={"close": "close-" + coinname}, inplace=True)
    df.rename(columns={"low": "low-" + coinname}, inplace=True)
    df.rename(columns={"high": "high-" + coinname}, inplace=True)
df.head()

			time	close-ETH-USD	low-ETH-USD	high-ETH-USD	close-BTC-USD	low-BTC-USD	high-BTC-USD
2017-01-01 06:00:00	8.23			8.16		8.49			975.00			964.54		975.00
2017-01-01 12:00:00	8.33			8.20		8.44			994.42			974.01		994.97
2017-01-01 18:00:00	8.18			8.08		8.37			992.95			986.86		1000.00
2017-01-02 00:00:00	8.13			8.05		8.22			1003.64			990.52		1012.00
2017-01-02 06:00:00	8.10			8.09		8.20			1024.84			1002.92		102

Step #2 Visualizing the Time Series

At this point, we have created a dataframe that contains the price “close,” “low,” and “high” for BTC-USD and ETH-USD. Next, let’s take a quick look at what the data looks like:

# Create a Price Chart on BTC and ETH
x = df.index
fig, ax1 = plt.subplots(figsize=(16, 8), sharex=False)

# Price Chart for BTC-USD Close
color = 'tab:blue'
y = df['close-BTC-USD']
ax1.set_xlabel('time (s)')
ax1.set_ylabel('BTC-Close in $', color=color, fontsize=18)
ax1.plot(x, y, color=color)
ax1.tick_params(axis='y', labelcolor=color)
ax1.text(0.02, 0.95, 'BTC-USD',  transform=ax1.transAxes, color=color, fontsize=16)

# Price Chart for ETH-USD Close
color = 'tab:red'
y = df['close-ETH-USD']
ax2 = ax1.twinx()  # instantiate a second axes that shares the same x-axis
ax2.set_ylabel('ETH-Close in $', color=color, fontsize=18)  # we already handled the x-label with ax1
ax2.plot(x, y, color=color)
ax2.tick_params(axis='y', labelcolor=color)
ax2.text(0.02, 0.9, 'ETH-USD',  transform=ax2.transAxes, color=color, fontsize=16)

Next, we add two indicator values to our dataframe that we can later use to color the price chart.

Step #3 Calculate Indicator Values

The color overlay of the price chart is typically used to illustrate the relation between price and another variable, such as a statistical indicator. To demonstrate how this works, we will calculate two indicators and add them to our dataframe:

3.1 The Relative Strength Index

The Relative Strength Index (RSI) is a momentum indicator that signals the strength of a price trend. Its value range from 0 to 100%. A value above 70% signals that an asset is likely overbought. An overbought level is an area where the market is highly bullish and might decline. A value below 30% is typically a sign of an oversold condition. An oversold level is where the market is extremely bearish, and the price tends to reverse to the upper side.

3.2 The Pearson Correlation Coefficient

Pearson Correlation Coefficient: This indicator measures the correlation between two sets of stochastic variables. Its values range from -1 to 1. A value of 1 would imply a perfect stochastic correlation. For example, if the price of BTC changes by X percentage in a given period, we can expect ETH to experience the exact price change. A value of -1 would imply a perfect inverse correlation. For example, if the price of BTC were to increase by Y percent, we would also expect the ETH price to decrease by Y percent. A value of 0 implies no correlation. To learn more about correlation, check out my article about correlation in Python.

We embed the logic for calculating the two indicators in a different method called “add_indicators.”

def add_indicators(df):
    # Calculate the 30 day Pearson Correlation 
    cor_period = 30 #this corresponds to a monthly correlation period
    columntobeadded = [0] * cor_period
    df = df.fillna(0) 
    for i in range(len(df)-cor_period):
        btc = df['close-BTC-USD'][i:i+cor_period]
        eth = df['close-ETH-USD'][i:i+cor_period]
        corr, _ = pearsonr(btc, eth)
        columntobeadded.append(corr)    
    # insert the colours into our original dataframe    
    df.insert(2, "P_Correlation", columntobeadded, True)

    # Calculate the RSI
    # Moving Averages on high, lows, and std - different periods
    df['MA200_low'] = df['low-BTC-USD'].rolling(window=200).min()
    df['MA14_low'] = df['low-BTC-USD'].rolling(window=14).min()
    df['MA200_high'] = df['high-BTC-USD'].rolling(window=200).max()
    df['MA14_high'] = df['high-BTC-USD'].rolling(window=14).max()

    # Relative Strength Index (RSI)
    df['K-ratio'] = 100*((df['close-BTC-USD'] - df['MA14_low']) / (df['MA14_high'] - df['MA14_low']) )
    df['RSI'] = df['K-ratio'].rolling(window=3).mean() 

    # Replace nas 
    #nareplace = df.at[df.index.max(), 'close-BTC-USD']    
    df.fillna(0, inplace=True)    
    return df
    
dfcr = add_indicators(df)

At this point, we have added the RSI and the Correlation Coefficient to our dataframe. Let’s quickly visualize the two indicators in a line chart.

# Visualize measures
fig, ax1 = plt.subplots(figsize=(22, 4), sharex=False)
plt.ylabel('ETH-BTC Price Correlation', color=color)  # we already handled the x-label with ax1
x = y = dfcr.index
ax1.plot(x, dfcr['P_Correlation'], color='black')
ax2 = ax1.twinx()
ax2.plot(x, dfcr['RSI'], color='blue')
plt.tick_params(axis='y', labelcolor=color)

plt.show()

You may have noticed that the indicators remain at 0 at the time series beginning. However, this is perfectly fine. Since both indicators are calculated retrospectively, no values are available initially.

Step #4 Converting Indicator Values to Color Codes

Before creating the price charts, we have to color code the indicator values. We normalize the values and then assign a color to each indicator value using a color scale. We attach the colors to our existing dataframe to quickly access them when creating the plots.

# function that converts a given set of indicator values to colors
def get_colors(ind, colormap):
    colorlist = []
    norm = col.Normalize(vmin=ind.min(), vmax=ind.max())
    for i in ind:
        colorlist.append(list(colormap(norm(i))))
    return colorlist

# convert the RSI                         
y = np.array(dfcr['RSI'])
colormap = plt.get_cmap('plasma')
dfcr['rsi_colors'] = get_colors(y, colormap)

# convert the Pearson Correlation
y = np.array(dfcr['P_Correlation'])
colormap = plt.get_cmap('plasma')
dfcr['cor_colors'] = get_colors(y, colormap)

In our dataframe, two additional columns contain the color values for the two indicators. Now that we have all the data in our dataframe, the next step is creating the price charts.

Step #5 Creating Color-Coded Price Charts

Next, we use the color values to create two different color-coded price charts.

5.1 Bitcoin Price Chart Colored by RSI

We color the chart with the strength of the correlation between Bitcoin and Ethereum. Light-colored fields signal phases of a strong correlation. Price points colored in dark blue indicate phases where the correlation between the price movements of the two cryptocurrencies was negative.

# Create a Price Chart
pd.plotting.register_matplotlib_converters()
fig, ax1 = plt.subplots(figsize=(18, 10), sharex=False)
x = dfcr.index
y = dfcr['close-BTC-USD']
z = dfcr['rsi_colors']

# draw points
for i in range(len(dfcr)):
    ax1.plot(x[i], np.array(y[i]), 'o',  color=z[i], alpha = 0.5, markersize=5)
ax1.set_ylabel('BTC-Close in $')
ax1.tick_params(axis='y', labelcolor='black')
ax1.set_xlabel('Date')
ax1.text(0.02, 0.95, 'BTC-USD - Colored by RSI',  transform=ax1.transAxes, fontsize=16)

# plot the color bar
pos_neg_clipped = ax2.imshow(list(z), cmap='plasma', vmin=0, vmax=100, interpolation='none')
cb = plt.colorbar(pos_neg_clipped)

From the color overlay in the chart, we can tell that the RSI is low mainly (dark blue) when the Bitcoin price has seen a substantial decline and high (yellow) when the price has risen.

5.2 Bitcoin Price Chart colored by BTC-ETH Correlation

In this section, we will create another price chart for Bitcoin. This time we color code the price trend with the RSI. High RSI values are yellow, and low values are dark blue. Running the code below will create the color-coded bitcoin chart.

# create a price chart
pd.plotting.register_matplotlib_converters()
fig, ax1 = plt.subplots(figsize=(18, 10), sharex=False)
x = dfcr.index # datetime index
y = dfcr['close-BTC-USD'] # the price variable
z = dfcr['cor_colors'] # the color coded indicator values

# draw points
for i in range(len(dfcr)):
    ax1.plot(x[i], np.array(y[i]), 'o',  color=z[i], alpha = 0.5, markersize=5)
ax1.set_ylabel('BTC-Close in $')
ax1.tick_params(axis='y', labelcolor='black')
ax1.set_xlabel('Date')
ax1.text(0.02, 0.95, 'BTC-USD - Colored by 50-day ETH-BTC Correlation',  transform=ax1.transAxes, fontsize=16)

# plot the color bar
pos_neg_clipped = ax2.imshow(list(z), cmap='Spectral', vmin=-1, vmax=1, interpolation='none')
cb = plt.colorbar(pos_neg_clipped)

The chart shows that the correlation between Bitcoin and Ethereum (yellow color) was strong when the price of bitcoin rose. So when Bitcoin is in a bull market, Ethereum tends to follow a similar price logic. In contrast, the correlation was weak when the Bitcoin price declined (dark blue).

Summary

In this article, we demonstrated how to use Python and Seaborn to create a price chart that incorporates color as a third dimension. We used the Bitcoin price as an example and created two color-coded charts: one that highlights the RSI, and another that highlights the Pearson Correlation between Bitcoin and Ethereum.

By using color as an overlay, it is possible to highlight many interesting relationships in time-series data. A well-known example from the cryptocurrency world is the Bitcoin Rainbow Chart. This technique can be used to bring attention to various trends and patterns in the data.

I hope this article has helped to bring you closer to charts in Python. I am always interested to receive feedback from my audience. So, let me know if you liked this content, and if you have any questions, please post them in the comments.

Sources and Further Reading

The links above to Amazon are affiliate links. By buying through these links, you support the Relataly.com blog and help to cover the hosting costs. Using the links does not affect the price.

And if you are interested in stock-market prediction, check out the following articles:

The post Color-Coded Cryptocurrency Price Charts in Python appeared first on relataly.com.

Geographic Heat Maps with GeoPandas: Visualizing COVID-19 Data in Python

Florian Follonier — Wed, 08 Apr 2020 22:03:00 +0000

The spreading of COVID-19 has led to an increased interest in displaying region and country-specific information on geographic heat maps. Geographic heat maps use color shadings to visualize data that includes a spatial component and refers, for example, to countries, cities, towns, mountains, etc. The color shades are defined in a color palette and determined by numerical values on a scale. In this way, geographic heat maps give the viewer a quick overview of what is happening in different regions. This tutorial shows how to create geographic heat maps in Python using the GeoPandas library. We will work with COVID-19 data and visualize it using various color-coded maps.

The rest of this article proceeds as follows: We begin by going through the steps to visualize COVID-19 data on a geographic heat map. We will be using the GeoPandas library to plot the maps. Geopandas is an open-source project for working with geospatial data in Python. Our heat map will use color shades to visualize growth rates and total cases of COVID-19 in different countries. In addition, we will zoom in on specific map regions.

Also: Predictive Policing: Preventing Crime in San Francisco using XGBoost

What are Geographic Heat Maps?

Geographic heat maps are visual representations of data that use color or other visual encodings to show the density or intensity of data points in a geographic region. They are commonly used to represent data that is associated with a geographic location, such as population data, economic data, or weather data.

Geographic heat maps are typically created by overlaying a grid or mesh on a map and then assigning a color or other visual encoding to each grid cell based on the density or intensity of data points in that cell. The resulting heat map shows the distribution or pattern of data points across the geographic region and can provide valuable insights and information about the data.

Geographic heat map showing COVID-19 growth rates in different countries of the world. In this Python tutorial we will create similar maps.

Also: Color-Coded Cryptocurrency Price Charts in Python

What are Geographic Heat Maps used for?

Geographic heat maps are used for a variety of purposes, such as:

Visualizing data: geographic heat maps can provide a clear and intuitive way to visualize data that is associated with a geographic location, allowing analysts and users to quickly and easily understand the data and identify patterns, trends, and relationships.
Identifying spatial patterns: geographic heat maps can help to identify spatial patterns or trends in the data, such as clusters, outliers, or trends over time. This can provide valuable insights and information about the data and can help to inform decision-making and analysis.
Analyzing and comparing data: geographic heat maps can be used to compare and contrast different datasets or to analyze the relationship between different variables or data sources. This can help to identify correlations, trends, or patterns that may not be immediately apparent from the raw data.

What are the Potential Pitfalls of Using Geographic Heat Maps?

While geographic heat maps are useful for all kinds of purposes, there are a few potential limitations and pitfalls to consider when using heat maps:

Choosing an appropriate color scale: It’s essential to choose a color scale that accurately reflects the data being represented and is easy for viewers to interpret. If the color scale is not well-suited to the data, it can be difficult for viewers to understand the patterns being shown accurately.
Overloading the map with too much data: It’s possible to add too much data to a heat map, which can make it difficult to interpret and potentially obscure important patterns. It’s important to balance the need for detail with the need for clarity when creating a heat map.
Visual distortion: When working with large or irregularly shaped regions, it can be challenging to depict the data using a heat map accurately. This can lead to visual distortion, where the map does not accurately reflect the actual distribution of the data.
Misinterpretation of the data: Heat maps are a visual representation of data, which can be subject to misinterpretation. It’s important to carefully consider how the data is represented and provide clear context and explanations for the presented patterns.

Let’s keep these potential pitfalls in mind during the following tutorial.

What is GeoPandas?

GeoPandas is a Python package that provides tools for working with geospatial data. It extends the popular pandas package, which provides data manipulation and analysis tools, to include support for geographic data. GeoPandas allows users to manipulate and analyze geospatial data in a familiar pandas DataFrame structure and includes functions for reading and writing spatial data in various formats, as well as tools for visualizing and mapping data.

GeoPandas is built on top of other popular packages, such as Shapely and Fiona, and is a popular choice for working with geospatial data in Python.

With GeoPandas, users can:

Read and write spatial data in various formats, such as Shapefile, GeoJSON, and GeoPackage.
Perform geometric operations on spatial data, such as buffering, intersection, and union.
Create maps and visualize spatial data using matplotlib, a popular Python plotting library.
Analyze and manipulate spatial data in a pandas DataFrame structure, allowing users to use the powerful data manipulation and analysis tools provided by pandas.

Creating Geographic Heat Maps with Python and GeoPandas

In this tutorial, we will learn how to create geographic heat maps using Python and the GeoPandas package. Geographic heat maps are visualizations that show the intensity of data at different locations on a map. They are commonly used to represent the distribution of a variable across a geographic area, and can be useful for identifying patterns, trends, and anomalies in the data. In this tutorial, we will learn how to create geographic heat maps using Python and the GeoPandas package. In the following, we will walk through the steps of loading, manipulating, and visualizing spatial data with GeoPandas, and demonstrate how to create geographic heat maps for covid-19.

The code is available on the GitHub repository.

View on GitHub Relataly GitHub Repo

Prerequisites

Before starting the coding part, make sure that you have set up your Python 3 environment and required packages. If you don’t have an environment set up yet, you can follow the steps in this tutorial to set up the Anaconda environment.

Also, make sure you install all required packages. We will be working with the following standard packages:

You can install packages using console commands:

pip install 
conda install  (if you are using the anaconda packet manager)

We will create geographic heat maps with the GeoPandas Python library. You can install GeoPandas via the console by using the following command:

conda install –channel conda-forge geopandas
pip install geopandas

Update (2020-09-23): With the release of Python 3.8, there is a new install procedure:

conda create -n geo_env
conda activate geo_env
conda config --env --add channels conda-forge
conda config --env --set channel_priority strict
conda install python=3 geopandas

Download the Geographic Map Data From Naturalearthdata

First, we will get the map with the geospatial data. Rendering maps with GeoPandas requires a shapefile. A shapefile is a DataFrame with some graphical data attached. For instance, some shapefiles show cities, countries, continents, or maps of the entire world. So in our case, the shapefile is a list of countries, whereby each country has its graphical representation in polygons. The example presented in this tutorial will use a world map.

Various sources on the web provide shapefiles for different geographical regions and in varying detail. For example, n aturalearthdata.com provides a map of the world. To download the map, go to the natualearthdata webpage, and with a click on the green button, you can download version 4.1.0.

Once the download is complete, unpack the files into the folder of your Python notebook or a subfolder in the folder of your Python notebook (e.g., data/shapefiles/worldmap/).

naturalearthdata.com

Step #1 Loading the COVID-19 Data

Next, we retrieve the COVID-19 data for all countries via the statworx API. If you want to learn more about using REST APIs, check out this tutorial on accessing data sources via REST APIs.

Also: Accessing Remote Data Sources via REST APIs in Python

# Setting up Packages
import json
import country_converter as coco
from datetime import datetime, timedelta
import requests
import pandas as pd
import geopandas as gpd
import matplotlib.pyplot as plt
# Getting the data
PAYLOAD = {'code': 'ALL'}
URL = 'https://api.statworx.com/covid'
RESPONSE = requests.post(url=URL, data=json.dumps(PAYLOAD))
# Convert the response to a data frame
covid_df = pd.DataFrame.from_dict(json.loads(RESPONSE.text))
covid_df.head(3)

      date		day	month	year	cases	deaths	country		code	population	continent	cases_cum	deaths_cum
0	2019-12-31	31	12		2019	0		0		Afghanistan	AF		38041757.0	Asia		0			0
1	2020-01-01	1	1		2020	0		0		Afghanistan	AF		38041757.0	Asia		0			0
2	2020-01-02	2	1		2020	0		0		Afghanistan	AF		38041757.0	Asia		0			0

We continue by preparing the COVID-19 data for visualizing them on a heat map.

Step #2 Specifying a Shapefile

Next, we use the Geopandas library to read in a shapefile at “data/shapefiles/worldmap/ne_10m_admin_0_countries.shp”. We then select the columns “ADMIN,” “ADM0_A3”, and “geometry” from the shapefile and store them in a GeoDataFrame called “geo_df.” Finally, we display the first three rows of the GeoDataFrame.

# Setting the path to the shapefile
SHAPEFILE = 'data/shapefiles/worldmap/ne_10m_admin_0_countries.shp'
# Read shapefile using Geopandas
geo_df = gpd.read_file(SHAPEFILE)[['ADMIN', 'ADM0_A3', 'geometry']]
# Rename columns.
geo_df.columns = ['country', 'country_code', 'geometry']
geo_df.head(3)

	country		country_code	geometry
0	Indonesia	IDN				MULTIPOLYGON (((117.70361 4.16341, 117.70361 4...
1	Malaysia	MYS				MULTIPOLYGON (((117.70361 4.16341, 117.69711 4...
2	Chile		CHL				MULTIPOLYGON (((-69.51009 -17.50659, -69.50611

We have created a dataframe with three columns, as you can see above. The column geometry contains the graphical representation of countries. Now that we have prepared the data, we can plot our first geographic map. We create the map by using the GeoPandas plot function.

# Drop row for 'Antarctica'. It takes a lot of space in the map and is not of much use
geo_df = geo_df.drop(geo_df.loc[geo_df['country'] == 'Antarctica'].index)
# Print the map
geo_df.plot(figsize=(20, 20), edgecolor='white', linewidth=1, color='lightblue')

If you get an error: “ImportError: The Descartes package is required for plotting polygons in GeoPandas.” you first have to install the Descartes package. You can do this by typing in your console: conda install descartes

Step #3 Bringing It All Together

Next, we need to ensure that our data matches the country codes. The dataframe with the geospatial data of the world map contains country codes that adhere to iso3. However, our COVID-19 data uses iso2_codes. Luckily there is a country_converter available that does this job for us:

# Next, we need to ensure that our data matches with the country codes. 
iso3_codes = geo_df['country'].to_list()
# Convert to iso3_codes
iso2_codes_list = coco.convert(names=iso3_codes, to='ISO2', not_found='NULL')
# Add the list with iso2 codes to the dataframe
geo_df['iso2_code'] = iso2_codes_list
# There are some countries for which the converter could not find a country code. 
# We will drop these countries.
geo_df = geo_df.drop(geo_df.loc[geo_df['iso2_code'] == 'NULL'].index)

We have a list with all nations’ names (country) and codes (country_code). An additional column includes the geographical representation of each country.

Step #4 Preprocessing

Our COVID-19 data so far contains historical Covid-19 cases. We want to drop these historical cases and only get the data from the last day. Then we merge the data frames.

Before we plot the heat map, we have to specify a variable that determines the color of the countries on the map. Our goal is to color the countries depending on the growth rate of COVID-19 cases per day. The formula for the growth rate is ‘new cases’ / total present cases.

# We want to drop the history and only get the data from the last day
d = datetime.today()-timedelta(days=1)
date_yesterday = d.strftime("%Y-%m-%d")
# Preparing the data
covid_df = covid_df[covid_df['date'] == date_yesterday]
# Merge the two dataframes
merged_df = pd.merge(left=geo_df, right=covid_df, how='left', left_on='iso2_code', right_on='code')
# Delete some columns that we won't use
df = merged_df.drop(['day', 'month', 'year', 'country_y', 'code'], axis=1)
#Create the indicator values
df['case_growth_rate'] = round(df['cases']/df['cases_cum'], 2)
df['case_growth_rate'].fillna(0, inplace=True) 
df.head(3)

	country_x	country_code		geometry												iso2_code	date		cases	deaths	population	continent	cases_cum	deaths_cum	case_growth_rate
0	Indonesia	IDN					MULTIPOLYGON 	(((117.70361 4.16341, 117.70361 4...	ID			2020-06-28	1385.0	37.0	270625567.0	Asia		52812.0		2720.0		0.03
1	Malaysia	MYS					MULTIPOLYGON 	(((117.70361 4.16341, 117.69711 4...	MY			2020-06-28	10.0	0.0		31949789.0	Asia		8616.0		121.0		0.00
2	Chile		CHL					MULTIPOLYGON 	(((-69.51009 -17.50659, -69.50611...	CL			2020-06-28	4406.0	279.0	18952035.0	America		267766.0	5347.0		0.02

Step #5 Creating a Geographic Heat Map

In the previous step, we set up the data for our map. Next, we create the geographical heat map for the world.

We set the path to the shapefile and use Geopandas to read it. We then rename these columns. Next, we set the range for the choropleth and create a figure and axes for Matplotlib. We remove the axis and plot the choropleth using the data from the ‘df’ dataframe and the ‘case_growth_rate’ column, setting the edgecolor, linewidth, and cmap. We also add a title to the map and an annotation for the data source. Additionally, we create a colorbar as a legend, using the ScalarMappable function, and add it to the figure with a specified position.

# Print the map
# Set the range for the choropleth
title = 'Daily COVID-19 Growth Rates'
col = 'case_growth_rate'
source = 'Source: relataly.com \nGrowth Rate = New cases / All previous cases'
vmin = df[col].min()
vmax = df[col].max()
cmap = 'viridis'
# Create figure and axes for Matplotlib
fig, ax = plt.subplots(1, figsize=(20, 8))
# Remove the axis
ax.axis('off')
df.plot(column=col, ax=ax, edgecolor='0.8', linewidth=1, cmap=cmap)
# Add a title
ax.set_title(title, fontdict={'fontsize': '25', 'fontweight': '3'})
# Create an annotation for the data source
ax.annotate(source, xy=(0.1, .08), xycoords='figure fraction', horizontalalignment='left', 
            verticalalignment='bottom', fontsize=10)
            
# Create colorbar as a legend
sm = plt.cm.ScalarMappable(norm=plt.Normalize(vmin=vmin, vmax=vmax), cmap=cmap)
# Empty array for the data range
sm._A = []
# Add the colorbar to the figure
cbaxes = fig.add_axes([0.15, 0.25, 0.01, 0.4])
cbar = fig.colorbar(sm, cax=cbaxes)

Geographic heat map showing COVID-19 growth rates in different countries of the world

As shown in the map above, countries in Central Asia and Africa currently report the highest COVID-19 growth rates.

There are different color palettes. You can use them by altering the cmap variable. Below is a sample of ready-to-use color scales. You can find more color scales on the matblotlib page.

Colormaps

Step #6 Zooming in on Specific Regions

We have observed that many African countries are currently reporting rising case numbers, so we create a new dataframe based on a filter for African countries using the list of country codes.

In the following, we create a geographic map specifically for Africa. We can zoom in on a continent or a country by filtering our dataframe. The code below will filter the spatial-geo data to African countries and plot the heat map. We plot the map for Africa using this new dataframe, setting the title of the map to ‘COVID-19 Growth Rate per Day in Africa’ and adding a source annotation to the bottom left corner of the map.

# The map shows that many african countries are currently reporting increasing case numbers
# Next we create a new df based on a filter for african countries
africa_country_list = ['ZM', 'BF', 'TZ', 'EG', 'UG', 'TN', 'TG', 'SZ', 'SD', 
                       'EH', 'SS', 'ZW', 'ZA', 'SO', 'SL', 'SC', 'SN', 'ST', 
                       'SH', 'RW', 'RE', 'GW', 'NG', 'NE', 'NA', 'MZ', 'MA', 
                       'MU', 'MR', 'ML', 'MW', 'MG', 'LY', 'LR', 'LS', 'KE', 
                       'CI', 'GN', 'GH', 'GM', 'GA', 'DJ', 'ER', 'ET', 'GQ', 
                       'BJ', 'CD', 'CG', 'YT', 'KM', 'TD', 'CF', 'CV', 'CM', 
                       'BI', 'BW', 'AO', 'DZ']
africa_map_df = df[df['iso2_code'].isin(africa_country_list)]
# Plot the map for Africa
title = 'COVID-19 Growth Rate per Day in Africa'
col = 'case_growth_rate'
source = 'Source: relataly.com \nGrowth Rate = New cases / All previous cases'
vmin = df[col].min()
vmax = df[col].max()
fig, ax = plt.subplots(1, figsize=(20, 9))
ax.axis('off')
africa_map_df.plot(column=col, ax=ax, edgecolor='0.8', linewidth=1, cmap=cmap)
ax.set_title(title, fontdict={'fontsize': '25', 'fontweight': '3'})
ax.annotate(source, xy=(0.24, .08), xycoords='figure fraction',
            horizontalalignment='left',
            verticalalignment='bottom', fontsize=10)
sm = plt.cm.ScalarMappable(norm=plt.Normalize(vmin=vmin, vmax=vmax), cmap=cmap)
cbaxes = fig.add_axes([0.35, 0.25, 0.01, 0.5])
{"type":"block","srcIndex":53,"srcClientId":"2ddd9666-6def-46e0-803e-4bf7b0366a27","srcRootClientId":""}cbar = fig.colorbar(sm, cax=cbaxes)

Geographic heat map of Africa showing COVID-19 growth rates in different countries

In case you encounter an error with the mapclassify-package, you can try the following command to reinstall it: conda install -c conda-forge mapclassify

Voilá, now we only see the African continent. The map shows that the countries in Africa that currently report the highest total case numbers are South Africa, Algeria, Morocco, Kamerun, and Egypt.

Let’s take a look at the total cases per country in Africa:

# Insert cases per population
# Alternative: africa_map_df2['cases_population'] = round(africa_map_df['cases_cum'] / africa_map_df['population'] * 100)
africa_map_df2 = africa_map_df.copy()
# Remove NAs
africa_map_df2.loc[: , 'cases_cum'].fillna(0, inplace=True)
# Show the data
africa_map_df2.head()
# Plot the map
title = 'Total COVID-19 Cases on the African Continent'
col = 'cases_cum'
source = 'Source: relataly.com '
vmin = africa_map_df2[col].min()
vmax = africa_map_df2[col].max()
fig, ax = plt.subplots(1, figsize=(20, 9))
ax.axis('off')
africa_map_df2.plot(column=col, ax=ax, edgecolor='1', linewidth=1, cmap=cmap)
ax.set_title(title, fontdict={'fontsize': '25', 'fontweight' : '3'})
ax.annotate(
    source, xy=(0.24, .08), xycoords='figure fraction', horizontalalignment='left', 
    verticalalignment='bottom', fontsize=10)
sm = plt.cm.ScalarMappable(norm=plt.Normalize(vmin=vmin, vmax=vmax), cmap=cmap)
cbaxes = fig.add_axes([0.35, 0.25, 0.01, 0.5])
cbar = fig.colorbar(sm, cax=cbaxes)

The highest growth rate was reported by South Sudan, followed by Botswana and Niger.

Step #7 Saving a Geo-Heat Maps to PNG

If you want to save the map, you can do this with the following command.

# Safe the map to a png
fig.savefig('map_export.png', dpi=300)

Summary

This article showed how to create geographic heat maps using the Geopandas library in Python. It showed how to read in a shapefile and create a choropleth map using the data from a dataframe. Additionally, the article explained how to filter the data to display maps of specific regions, in this case Africa. We showed how to prepare spatial data and color-code the maps using COVID-19 data. In addition, we filtered the DataFrame to create maps for specific regions, zoom in on specific areas, and alter the color style using different color maps.

Geographic heat maps can provide valuable insights into the distribution of data and help to identify patterns and trends. The technique of creating heat maps with Geopandas is a powerful tool for data visualization and can be applied to a wide range of geographical data.

I hope this article was helpful. If you have any questions or remarks, please write them in the comments.

Looking for more exciting map visualizations? Consider this relataly tutorial on predicting and visualizing crimes on a map of San Francisco.

Sources and Further Reading

https://geopandas.org/en/stable/getting_started.html

The post Geographic Heat Maps with GeoPandas: Visualizing COVID-19 Data in Python appeared first on relataly.com.