Correlation Archives - relataly.com

On-Chain Analytics: Metrics for Analyzing Blockchains in Python

Florian Follonier — Sat, 12 Nov 2022 13:05:00 +0000

Cryptocurrencies like Bitcoin or Ethereum are built on public blockchains, meaning anyone can see the transactions and trades happening on these networks. This transparency makes on-chain data an excellent resource for data science and machine learning. By examining transaction activity and the holdings of Bitcoin addresses, analysts can better understand a cryptocurrency network’s health and growth. For instance, tracking the volume of transactions can give insight into network growth. On-chain analysis can be particularly helpful for investors and network participants because they often have difficulty accurately assessing the value of cryptocurrencies due to hype and speculation. In this article, we’ll show you how to use Python to analyze on-chain data. To make things easier, we’ll be accessing aggregated on-chain data from the CryptoCompare API instead of using raw blockchain data.

This article consists of two parts: The first part briefly discusses blockchain technology and how it relates to on-chain analysis. This is followed by a hands-on Python tutorial. In the tutorial, we will retrieve different types of blockchain data and analyze Bitcoin and Ethereum, exploring various aspects of blockchain technology, such as price correlatedness, network growth and usage, and network health. Specifically, we will examine seven key metrics useful for analyzing blockchain data. We will be using the CryptoCompare API as our data source, which provides access to various on-chain and off-chain data.

Disclaimer: This article does not constitute financial advice. Stock markets can be very volatile and are generally difficult to predict. Predictive models and other forms of analytics applied in this article only illustrate machine learning use cases.

What is OnChain Analysis?

Before discussing on-chain analysis, let’s start to recap what blockchain is. The blockchain is a decentralized distributed ledger that records transactions across a network of computers. The blockchain is composed of blocks. Each block contains a record of multiple transactions. Blocks are linked to one another, forming a chain of blocks, hence the name “blockchain.”. The blockchain is created by securely linking the blocks using cryptography, making them immutable. Each block added to the blockchain contains a cryptographic hash of the previous block, timestamp, and transaction data. In the case of Bitcoin, the data stored in the blocks include the transaction amount, the timestamp, and the unique addresses of the sender and the recipient.

Once a block has been added to the blockchain, changing the information is extremely difficult or even impossible. Moreover, unlike a normal database, the blockchain does not store its information in one place but decentrally at several participants in the network. This basic idea of decentral exchange and storage of transactions has inspired a wave of new business models and financial services that were not possible before.

neural network machine learning python affinity propagation midjourney relataly crypto-min

" data-image-caption="

neural network machine learning python affinity propagation midjourney relataly crypto-min

" data-large-file="https://www.relataly.com/wp-content/uploads/2023/03/neural-network-machine-learning-python-affinity-propagation-midjourney-relataly-crypto-min.png" src="https://www.relataly.com/wp-content/uploads/2023/03/neural-network-machine-learning-python-affinity-propagation-midjourney-relataly-crypto-min.png" alt="neural network machine learning python affinity propagation midjourney relataly crypto-min" class="wp-image-12694" srcset="https://www.relataly.com/wp-content/uploads/2023/03/neural-network-machine-learning-python-affinity-propagation-midjourney-relataly-crypto-min.png 509w, https://www.relataly.com/wp-content/uploads/2023/03/neural-network-machine-learning-python-affinity-propagation-midjourney-relataly-crypto-min.png 300w, https://www.relataly.com/wp-content/uploads/2023/03/neural-network-machine-learning-python-affinity-propagation-midjourney-relataly-crypto-min.png 140w" sizes="(max-width: 509px) 100vw, 509px" />

neural network machine learning python affinity propagation midjourney relataly crypto-min

On-Chain Data

On-chain data refers to data that is stored on the blockchain. It includes information such as the transaction history of a particular cryptocurrency, the balances of cryptocurrency addresses, and the smart contract code and execution history on a blockchain network. This data is stored on the blockchain and is publicly accessible to anyone with an internet connection. We can broadly classify this data into three distinct categories:

Transaction data (e.g., sending and receiving address, transferred amount, remaining value for a certain address)
Block data (e.g., timestamps, miner fees, rewards)
Smart contract code (i.e., codified business logic on a Blockchain)

On-chain data is an essential source of information for analysts and researchers because it provides a transparent and immutable record of activity on the blockchain. It can be used to study trends and patterns in cryptocurrency adoption and usage, as well as to track the growth and health of a blockchain network. In addition, analysts may combine on-chain data with data not stored on the blockchain. This so-called off-chain data includes, for example, price information and trading volumes.

The role of Cryptographic Proof Systems

Since the original idea, blockchain technology has evolved, and new blockchains have emerged. Changes relate in particular to the security mechanism that determines how transactions are confirmed in the network. A cryptographic proof system is a method of verifying the authenticity and integrity of data by using cryptographic techniques. Because the specific data that is stored on the blockchain may vary depending on the specific design of the blockchain and its cryptographic proof system. This means, depending on the type of blockchain, we will have different data available for our analysis.

blockchain mining python on-chain analysis

" data-image-caption="

blockchain mining python on-chain analysis

" data-large-file="https://www.relataly.com/wp-content/uploads/2023/02/blockchain-mining-python-on-chain-analysis.png" src="https://www.relataly.com/wp-content/uploads/2023/02/blockchain-mining-python-on-chain-analysis-1024x1024.png" alt="blockchain mining python on-chain analysis" class="wp-image-12342" srcset="https://www.relataly.com/wp-content/uploads/2023/02/blockchain-mining-python-on-chain-analysis.png 1024w, https://www.relataly.com/wp-content/uploads/2023/02/blockchain-mining-python-on-chain-analysis.png 300w, https://www.relataly.com/wp-content/uploads/2023/02/blockchain-mining-python-on-chain-analysis.png 140w, https://www.relataly.com/wp-content/uploads/2023/02/blockchain-mining-python-on-chain-analysis.png 768w" sizes="(max-width: 1024px) 100vw, 1024px" />

The classic cryptographic proof system is based on mining. However, modern systems such as proof of stake are gaining traction as they use far less energy. Image generated using Midjourney

Proof-of-work vs Proof-of-stake

The two most common consensus algorithms are proof-of-work and proof-of-stake (PoS). In the case of Bitcoin, security is guaranteed by means of the proof-of-work (PoW) procedure. In this process, so-called miners continuously spend computing power to solve cryptographic puzzles in competition with each other. The winner gets to sign a block and receives a reward for their efforts. The complexity of the puzzles is called the mining difficulty. While the mining dynamically adapts to the network’s available computing power (hash rate) and generally increases, the rewards are reduced every couple of years in a bitcoin halving event. In a PoW system, the data that is stored on the blockchain typically includes the transaction history of a particular cryptocurrency, the balances of cryptocurrency addresses, and the smart contract code and execution history on a blockchain network.

Proof of stake is an alternative to proof of work. The algorithm is designed to be more energy efficient than proof of work, as it does not require miners to perform computationally intensive work in order to create new blocks. The creator of a new block is chosen deterministically, depending on their stake in the cryptocurrency. This means that the more cryptocurrency a specific miner holds, the more likely the algorithm will enable them to create a new block. In a proof-of-stake (PoS) system, the data stored on the blockchain may include similar information, such as the transaction history and balances of cryptocurrency addresses, as well as information about the stake that is being used to secure the network.

Other types of cryptographic proof systems, such as proof-of-authority (PoA) and proof-of-elapsed-time (PoET), may store similar but not identical types of data on the blockchain.

Analyzing Blockchain Data for Bitcoin and Ethereum with Python

In this tutorial, we will explore how we can use on-chain data to gain insights into the historical development and adoption of Bitcoin and Ethereum, the two most well-known cryptocurrencies. Our analysis will focus on the adoption of the Bitcoin and Ethereum blockchains, network security, and health. By analyzing a range of data types, we can uncover interesting insights about the growth and usage of these blockchain networks. On-chain analysts use a variety of metrics to try to improve their understanding of a network and predict future price movements. The specific metrics we will be examining are:

Metric #1 Correlation with Bitcoin Price
Metric #2 Distribution by Holder Amount
Metric #3 Difficulty vs. Hashrate
Metric #4 Difficulty vs. Price
Metric #5 Active Addresses compared to Bitcoin
Metric #6 Transaction Count compared to Bitcoin
Metric #7 Large Transactions compared to Bitcoin

As always, you can find the code of this tutorial on the GitHub repository.

Analyzing Blockchain Data for Bitcoin and Ethereum with Python. Image generated using DALL-E 2 by OpenAI.

View on GitHub Relataly GitHub Repo

Prerequisites

Before you proceed, ensure that you have set up your Python environment (3.8 or higher) and the required packages. If you don’t have an environment, follow this tutorial to set up the Anaconda environment. Also, make sure you install all required packages. In this tutorial, we will be working with the following standard packages:

You can install packages using console commands:

pip install 
conda install  (if you are using the anaconda packet manager)

Obtain a CryptoCompare API Key

Accessing the CryptoCompare API requires an API key. Fortunately, there is a free tier that offers generous limits and a wide range of available data. In addition, the API has excellent documentation and offers an interactive API request builder.

You can obtain your free API Key from the CryptoCompare website by clicking “Get Your Free Key” and following the registration steps. Once you have completed the registration, you must provide your API key in any request sent to the API endpoints.

It’s a best practice not to store the key directly into your code and instead import and access the API key from a separate YAML file. Store your API key in a YAML file called “api_config_cryptocompare.yml” as follows:

api_key: “your cryptocompare api key”

Place the file into a folder from where you can import it into your Python project, e.g., “workspace/API Keys/”

If you use CryptoCompare for personal purposes, you can register for a free API key

api_config_cryptocompare.yml

Loading Packages and API Key

Let’s begin by loading the required packages and our CryptoCompare API key. The code below will load the API key from a YAML file. Should you prefer to set your key directly from the code, comment lines 18-20 and replace the “YOUR_API_KEY” with your actual API key. Make sure to keep your API key secret and secure, as it allows you to access data from the CryptoCompare API.

Note that the variables symbol_a and symbol_b define which cryptocurrencies are in the scope of the analysis. symbol_a needs to be Bitcoin because of the way how the code works. The following code sample will run the analysis for Ethereum and compare it against Bitcoin. But if you want to run the analysis for another cryptocurrency, you can change symbol_b. The prerequisite is that CryptoCompare has the respective data.

# A tutorial for this file will soon be available at www.relataly.com

# Tested with Python 3.9.13, Matplotlib 3.5.2, Seaborn 0.11.2, numpy 1.21.5, plotly 4.1.1, cryptocompare 0.7.6

import pandas as pd 
import matplotlib.pyplot as plt 
import numpy as np 
from datetime import date, timedelta, datetime
import seaborn as sns
sns.set_style('white', {'axes.spines.right': True, 'axes.spines.top': False})
import cryptocompare as cc
import requests
import IPython
import yaml
import json

# Set the API Key from a yaml file
yaml_file = open('API Keys/api_config_cryptocompare.yml', 'r')  
p = yaml.load(yaml_file, Loader=yaml.FullLoader)
api_key = p['api_key'] 
# alternatively if you have not stored your API key in a separate file
# api_key = YOUR_API_KEY

# Number of past days for which we retrieve data
data_limit = 2000

# Define coin symbols
symbol_a = 'BTC'
symbol_b = 'ETH'

We proceed by querying the CryptoCompare API to load the data for our analysis. Our data comes from three separate API endpoints:

Historical prices for Bitcoin and Ethereum
Onchain data for Bitcoin and Ethereum
Bitcoin address distribution data for Bitcoin

Loading Price Data

First, we will load the price data from the cryptocompare histoday-API endpoint. This API provides us with a JSON response with a timestamp and daily prices and volume. The code below also converts the JSON response into a Pandas dataframe.

# Query price data

# Generic function for an API call to a given URL
def api_call(url):
  # Set API Key as Header
  headers = {'authorization': 'Apikey ' + api_key,}
  session = requests.Session()
  session.headers.update(headers)

  # API call to cryptocompare
  response = session.get(url)

  # Conversion of the response to dataframe
  historic_blockdata_dict = json.loads(response.text)
  df = pd.DataFrame.from_dict(historic_blockdata_dict.get('Data').get('Data'), orient='columns', dtype=None, columns=None)
  return df

def prepare_pricedata(df):
  df['date'] = pd.to_datetime(df['time'], unit='s')
  df.drop(columns=['time', 'conversionType', 'conversionSymbol'], inplace=True)
  return df

# Load the price data
base_url = 'https://min-api.cryptocompare.com/data/v2/histoday?fsym='
df_a = api_call(f'{base_url}{symbol_a}&tsym=USD&limit={data_limit}')
coin_a_price_df = prepare_pricedata(df_a)
df_b = api_call(f'{base_url}{symbol_b}&tsym=USD&limit={data_limit}')
coin_b_price_df = prepare_pricedata(df_b)
coin_b_price_df.head(3)

		high		low			open		volumefrom	volumeto		close	date
0		322.28		285.89		315.86		829138.34	2.498194e+08	292.90	2017-06-29
1		305.30		270.43		292.90		715498.52	2.054092e+08	280.68	2017-06-30
2		281.81		253.18		280.68		812033.74	2.141271e+08	261.00	2017-07-01

Now that we have the price history for Bitcoin and Ethereum, we can display the data on a line chart. Because it’s such an important event, we will also add the relevant Bitcoin halving dates. The Bitcoin halving is a built-in feature of the Bitcoin protocol that occurs approximately every four years (210,000 blocks). The purpose of the halving is to control the supply of new Bitcoins and ensure that they are released at a predictable rate. The halving reduces the reward for mining new blocks by half, which means that miners receive fewer new Bitcoins for their efforts. This helps to keep the supply of new Bitcoins in check and maintain the value of existing Bitcoins.

# Query on-chain data

# Prepare the onchain dataframe
def prepare_onchain_data(df):
  # replace the timestamp with a data and filter some faulty values
  df['date'] = pd.to_datetime(df['time'], unit='s')
  df.drop(columns='time', inplace=True)
  df = df[df['hashrate'] > 0.0]
  return df
  
base_url = 'https://min-api.cryptocompare.com/data/blockchain/histo/day?fsym='
onchain_symbol_a_df = api_call(f'{base_url}{symbol_a}&limit={data_limit}')
onchain_symbol_b_df = api_call(f'{base_url}{symbol_b}&limit={data_limit}')

# Filter some faulty values
onchain_symbol_a_df = onchain_symbol_a_df[onchain_symbol_a_df['hashrate'] > 0.0]
onchain_symbol_a_df.head(3)

	id		symbol			time		zero_balance_addresses_all_time	unique_addresses_all_time	new_addresses	active_addresses	transaction_count	transaction_count_all_time	large_transaction_count	average_transaction_value	block_height	hashrate		difficulty		block_time	block_size	current_supply
0	1182	BTC				1498694400	259466917						277866951					334750			624172				231054				235758173					10173					13.791733					473438			4.216942e+06	7.116972e+11	724.865546	966836		1.641798e+07
1	1182	BTC				1498780800	259827041						278238910					371959			727417				267360				236025533					13985					12.997582					473592			5.447359e+06	7.116972e+11	561.137255	956314		1.641990e+07
2	1182	BTC				1498867200	260153302						278544516					305606			647826				221856				236247389					10484					10.441163					473756			5.816675e+06	7.116972e+11	525.509202	882732		1.642195e+07

We can already see that the Ethereum price has been keeping up with bitcoin over the past years. Recently, the correlation has

Now that we have the price data, let’s quickly visualize it to ensure that the price charts look as expected. We will also encapsulate some of the code in helper functions. We will reuse these functions several times throughout the rest of this tutorial. For example, we will add the Bitcoin halving dates and adjust the legend to account for the two assets in the plot.

# Lineplot Helper Functions

# Adding moving averages
rolling_window = 25
coin_a_price_df['close_avg'] = coin_a_price_df['close'].rolling(window=rolling_window).mean() 
coin_b_price_df['close_avg'] = coin_b_price_df['close'].rolling(window=rolling_window).mean() 

# This function adds bitcoin halving dates as vertical lines
def add_halving_dates(ax, df_x_dates, df_ax1_y):
    halving_dates = ['2009-01-03', '2012-11-28', '2016-07-09', '2020-05-11', '2024-03-12', '2028-06-01'] 
    dates_list = [datetime.strptime(date, '%Y-%m-%d').date() for date in halving_dates]
    for i, datex in enumerate(dates_list):
        halving_ts = pd.Timestamp(datex)
        x_max = df_x_dates.max() + timedelta(days=365)
        x_min = df_x_dates.min() - timedelta(days=365)
        if (halving_ts < x_max) and (halving_ts > x_min):
            ax.axvline(x=datex, color = 'purple', linewidth=1, linestyle='dashed')
            ax.text(x=datex  + timedelta(days=20), y=df_ax1_y.max()*0.99, s='BTC Halving ' + str(i) + '\n' + str(datex), color = 'purple')

# This function creates a nice legend for twinx plots
def add_twinx_legend(ax1, ax2, x_anchor=1.18, y_anchor=1.0):
    lines_1, labels_1 = ax1.get_legend_handles_labels()
    lines_2, labels_2 = ax2.get_legend_handles_labels()
    ax1.legend(lines_1 + lines_2, labels_1 + labels_2, loc=1, facecolor='white', framealpha=0, bbox_to_anchor=(x_anchor, y_anchor))
    ax2.get_legend().remove()

# Create the lineplot
fig, ax1 = plt.subplots(figsize=(16, 6))
sns.lineplot(data=coin_a_price_df, x='date', y='close', color='cornflowerblue', linewidth=0.5, label=f'{symbol_a} close price', ax=ax1)
sns.lineplot(data=coin_a_price_df, x='date', y='close_avg', color='blue', linestyle='dashed', linewidth=1.0, 
    label=f'{symbol_a} {rolling_window}-MA', ax=ax1)
ax1.set_ylabel(f'{symbol_a} Prices')
ax1.set(xlabel=None)
ax2 = ax1.twinx()
sns.lineplot(data=coin_b_price_df, x='date', y='close', color='lightcoral', linewidth=0.5, label=f'{symbol_b} close price', ax=ax2)
sns.lineplot(data=coin_b_price_df, x='date', y='close_avg', color='red', linestyle='dashed', linewidth=1.0, 
    label=f'{symbol_b} {rolling_window}-MA', ax=ax2)
ax2.set_ylabel(f'{symbol_b} Prices')
add_twinx_legend(ax1, ax2, 0.98, 0.2)
add_halving_dates(ax1, coin_a_price_df.date, coin_a_price_df.close)
#ax1.set_yscale('log'), ax2.set_yscale('log')
plt.title(f'Prices of {symbol_a} and {symbol_b}')
plt.show()

This looks nice and proves that we have brought the data into our project and that it has a useful shape.

Loading On-Chain Data

Next, let’s load the on-chain data. To understand how a blockchain network develops and thrives, we need to look beyond price. To assess network growth, it is important to determine whether the network is being used and can increase the number of its users. Therefore, we include transaction and address data in our analysis.

We make a first API call to “data/blockchain/histo/day” to retrieve a dataset with various blockchain data. The endpoint provides daily on-chain data that includes blockchain key indicators such as:

The number of addresses in the network
The number of daily transactions
Information about the blocks, incl. block size, block height, etc.
Mining-related information, such as the mining difficulty and the available hash rate

# Prepare the onchain dataframe
def prepare_onchain_data(df):
  # replace the timestamp with a data and filter some faulty values
  df['date'] = pd.to_datetime(df['time'], unit='s')
  df.drop(columns='time', inplace=True)
  df = df[df['hashrate'] > 0.0]
  return df

# Load onchain data for Bitcoin
base_url = 'https://min-api.cryptocompare.com/data/blockchain/histo/day?fsym='
df_a = api_call(f'{base_url}{symbol_a}&limit={data_limit}')
onchain_symbol_a_df = prepare_onchain_data(df_a)

# Load onchain data for Ethereum
df_b = api_call(f'{base_url}{symbol_b}&limit={data_limit}')
onchain_symbol_b_df = prepare_onchain_data(df_b)
onchain_symbol_b_df.head(3)

	id		symbol	zero_balance_addresses_all_time	unique_addresses_all_time	new_addresses	active_addresses	transaction_count	transaction_count_all_time	large_transaction_count	average_transaction_value	block_height	hashrate	difficulty		block_time	block_size	current_supply	date
0	7605	ETH		20466340						22937123					48698			144688				259915				33294361					11528					44.835955					3950122			56.027705	962749040901496	17.183446	9460		9.289708e+07	2017-06-29
1	7605	ETH		20485843						22984680					47557			145469				249348				33543709					10791					42.018967					3955158			56.652799	972009000387636	17.157299	8800		9.292394e+07	2017-06-30
2	7605	ETH		20498357						23020671					35991			130617				235306				33779015					8715					43.389381					3960167			57.544809	992636469502805	17.249800	8105		9.295062e+07	2017-07-01

Another important indicator is how the number of coins in a cryptocurrency is distributed among the stakeholders. Unfortunately, the data required for this is not yet included in our dataset. The following code retrieves the data from a separate API endpoint (data/blockchain/balancedistribution/histo).

# Prepare balance distribution dataframe
def prepare_balancedistribution_data(df):
  df['balance_distribution'] = df['balance_distribution'].apply(lambda x: [i for i in x])
  json_struct = json.loads(df[['time','balance_distribution']].to_json(orient="records"))    
  df_ = pd.json_normalize(json_struct)
  df_['date'] = pd.to_datetime(df_['time'], unit='s')
  df_flat = pd.concat([df_.explode('balance_distribution').drop(['balance_distribution'], axis=1),
           df_.explode('balance_distribution')['balance_distribution'].apply(pd.Series)], axis=1)
  df_flat.reset_index(drop=True, inplace=True)
  df_flat['range'] = ['' + str(float(df_flat['from'][x])) + '_to_' + str(float(df_flat['to'][x])) for x in range(df_flat.shape[0])]
  df_flat.drop(columns=['from','to', 'time'], inplace=True)

  # Data cleansing
  df_flat = df_flat[~df_flat['range'].isin(['100000.0_to_0.0'])]
  df_flat['range'].iloc[df_flat['range'] == '1e-08_to_0.001'] = '0.0_to_0.001'
  return df_flat

# Load the balance distribution data for Bitcoin
base_url = 'https://min-api.cryptocompare.com/data/blockchain/balancedistribution/histo/day?fsym='
df_raw = api_call(f'{base_url}{symbol_a}&limit={data_limit}')
df_distr = prepare_balancedistribution_data(df_raw)
df_distr.head(3)

	date		totalVolume		addressesCount	range
0	2017-06-29	2068.414842		10651502.0		0.0_to_0.001
1	2017-06-29	12083.780197	3172564.0		0.001_to_0.01
2	2017-06-29	85563.613579	2753955.0		0.01_to_0.1

Now, we have all the data that we need and can proceed with our key metrics.

Metric #1 Correlation with Bitcoin Price

The first metric we will be examining is the price correlation with Bitcoin. This is an important metric to consider, as Bitcoin has a dominant position in the cryptocurrency market, and other cryptocurrencies tend to follow its price, sometimes with larger fluctuations. During bull markets, when Bitcoin reaches new highs, other cryptocurrencies tend to see strong price performance. Conversely, during bear markets, when Bitcoin experiences prolonged price declines, most other cryptocurrencies tend to underperform. There are occasional deviations from this pattern, which are usually related to economic or technical changes on the respective networks. The rolling price correlation helps us to understand these types of developments better.

To illustrate how the correlation between the two cryptocurrencies has evolved, we calculate rolling correlations. This means we are applying a correlation between the two time series of Bitcoin and Ethereum as a rolling window calculation. We define 100 days as the window for each calculation.

# Calculate the Rolling Correlation Coefficient
rolling_window = 100 #days

# Generate a work dataframe that includes closing prices and date
df_price_merged = pd.DataFrame.from_dict(data={f'close_{symbol_b}': coin_b_price_df['close'], f'close_{symbol_a}': coin_a_price_df['close'], 'date': coin_a_price_df['date']})
# Create the rolling correlation dataframe
df_temp = pd.DataFrame({'cor': coin_b_price_df.close.rolling(rolling_window).corr(coin_a_price_df.close).dropna()})
# Reverse the index and join the df to create a date index
df_cor_dateindex = df_price_merged.join(df_temp[::-1].set_index(df_temp.index)).dropna().set_index('date')

# Create the plot
fig, ax1 = plt.subplots(figsize=(16, 6))
label = f'{symbol_a}-{symbol_b} correlation (rolling window={rolling_window})'
sns.lineplot(data=df_cor_dateindex, x=df_cor_dateindex.index, y='cor', color='royalblue', linewidth=1.0, label=label)
add_halving_dates(ax1, df_cor_dateindex.index, df_cor_dateindex[f'cor'])
plt.legend(framealpha=0)
plt.title(label)

The chart shows that the correlation between Bitcoin and Ethereum has been in the range between 0.95 and -0.2 for quite some time. Currently, both cryptocurrencies are heavily correlated.

Metric #2 Distribution by Holder Amount

Another important aspect to consider is the distribution of coin value among the players in the network. If the majority of coins are concentrated in the hands of a few players, this can pose a risk to the price. This is especially true for proof-of-value networks like Ethereum, where the number of coins owned by players in the network affects their importance to the network. In addition, the distribution of coins can provide insight into price movements. For example, an increase in the number of addresses with a disproportionately large number of coins may be interpreted as a bullish sign, indicating that large players with significant market power are becoming optimistic. On the other hand, a decrease in the number of large addresses may be seen as a bearish sign.

The following code block will display the historical distribution of coins in the Bitcoin network. The data includes the number of addresses in the network that hold a specific amount of Bitcoins, and it distinguishes between different address sizes (e.g., “0.001 – 0.01 BTC”, “0.01 – 0.1 BTC”, and “0.1 – 1 BTC”). We will specifically look at the growth rates in the different holding ranges. A rising line thus means that the growth of th number of addresses in this range accelerates.

# Prepare address distribution data for plotting
df_distr_add = df_distr.copy()
for i in list(df_distr_add.range.unique()):
    df_distr_add.loc[df_distr.range == i, 'addressesCount_pct_cum'] = df_distr_add[df_distr_add.range == i]['addressesCount'].pct_change().dropna().cumsum().rolling(window=50).mean()
df_distr_add.dropna(inplace=True)
# Lineplot: Address Count by Holder Amount
fig, ax1 = plt.subplots(figsize=(16, 6))
sns.lineplot(data=df_distr_add, x='date', hue='range', linewidth = 1.0, y='addressesCount_pct_cum', ax=ax1, palette='bright')
plt.ylabel('Percentage Growth')
ax1.tick_params(axis="x", rotation=90, labelsize=10, length=0)
ax1.set(xlabel=None)
plt.title(f'Percentage Growth in the Distribution of Total Address Count for {symbol_a} by Holder Amount')
plt.legend(framealpha=0)
plt.show()

There are a couple of things to denote:

We can see that the number of large Bitcoin addresses (yellow line) has recently declined (negative growth) but is currently increasing again. This may be a sign that large whales are accumulating Bitcoins again.
We can also see that the growth rates of smaller addresses are accelerating (orange, green, red, purple lines), which means that the holdings get spread across a wider network.

Metric #3 Difficulty vs. Hashrate

The distribution of coins within a blockchain network is an important factor to consider. The hash rate measures the total computing power available on the network, and a higher hash rate makes it more difficult for attackers to launch a 51% attack, leading to increased network security. The mining difficulty determines how hard it is to mine the next block, and it is measured by the number of hashes that must be generated to find a valid solution.

The difficulty is adjusted periodically to ensure that new blocks are added to the blockchain at a consistent rate. If the hash rate of the network increases significantly, the difficulty will also increase to compensate. This helps to ensure that the rate at which new blocks are added to the blockchain remains constant, regardless of changes in the hash rate.

# Lineplot: Difficulty vs Hashrate
fig, ax1 = plt.subplots(figsize=(16, 6))
sns.lineplot(data=onchain_symbol_a_df, x='date', y='difficulty', 
    linewidth=1.0, color='royalblue', ax=ax1, label=f'{symbol_a} mining difficulty')
ax2 = ax1.twinx()
sns.lineplot(data=onchain_symbol_a_df[::5], x='date', y='hashrate', 
    linewidth=1.0, color='red', ax=ax2, label=f'{symbol_a} network hashrate')
add_twinx_legend(ax1, ax2, 0.98, 0.2)
add_halving_dates(ax1, onchain_symbol_a_df.date, onchain_symbol_a_df.difficulty)
ax1.set(xlabel=None)
plt.title(f'{symbol_a} Mining Difficulty vs Hashrate')
plt.show()

Hash rate and mining difficulty are closely related, which results from the fact that mining difficulty is adjusted periodically. However, it is essential to note that the hash rate does not show the distribution of the computing power in the network. A high hash rate alone does not guarantee network security if it is provided by a small number of parties. To assess the security of a proof-of-work blockchain, we therefore must also look at how the hash rate is distributed.

Metric #4 Difficulty vs. Price

Next, we will compare the difficulty vs. Price. The price of Bitcoin is an essential indicator of the demand for cryptocurrency. When the price is high, it can attract more miners to the network, as they are motivated by the potential to earn a high return on their investment. This can lead to an increase in the overall hash rate of the network, which makes it more secure against attacks. On the other hand, when the price is low, it may discourage miners from joining the network, leading to a decrease in the hash rate and potentially making the network more vulnerable to attacks.

# Add a moving average
rolling_window = 25
coin_a_price_df['close_avg'] = coin_a_price_df['close'].rolling(window=rolling_window).mean() 
# Creating a Lineplot: Mining Difficulty vs Price
fig, ax1 = plt.subplots(figsize=(16, 6))
sns.lineplot(data=onchain_symbol_a_df, x='date', y='difficulty', linewidth=1.0, color='orangered', ax=ax1, label=f'mining difficulty')
ax2 = ax1.twinx()
sns.lineplot(data=coin_a_price_df, x='date', y='close', linewidth=0.5, color='skyblue', ax=ax2, label=f'close price')
sns.lineplot(data=coin_a_price_df, x='date', y='close_avg', linewidth=1.0, linestyle='--', color='royalblue', ax=ax2, label=f'MA-100')
add_twinx_legend(ax1, ax2, 0.98, 0.2)
add_halving_dates(ax1, onchain_symbol_a_df.date, onchain_symbol_a_df.hashrate)
ax1.set(xlabel=None)
ax1.set(ylabel='Mining Difficulty')
plt.title(f'{symbol_a} Mining Difficulty vs Price')
plt.show()

Currently, the price of Bitcoin is low while the mining difficulty is high. As a result, it may be less attractive for miners to join the network. This is because the low price means that the potential reward for mining a new block may not be as high as it could be if the price were higher. At the same time, the high difficulty means that it will be more challenging for miners to find a valid solution to the mathematical problems they are working on, which could lead to lower profits.

In this situation, the overall hash rate of the network may decrease, as some miners may choose to leave the network or scale back their mining operations. This could make the network more vulnerable to attacks, as a lower hash rate means that there is less computing power available to secure the network.

Metric #5 Active Addresses compared to Bitcoin

Next, let’s compare the number of active addresses between Ethereum and Bitcoin. An active address in a blockchain is a unique address that has conducted a transaction within a certain time period. The number of active addresses on a blockchain can be a useful metric for analyzing the usage and adoption of the network. There are several reasons why active addresses are important:

Network usage: The number of active addresses can give an indication of how much the network is being used. A higher number of active addresses may suggest that more people are using the network to send and receive transactions.
Network growth: An increase in the number of active addresses over time may indicate that the network is growing and attracting more users. This could be a positive sign for the long-term health and sustainability of the network.
Network health: The number of active addresses may also provide insight into the overall health of the network. For example, a sudden drop in the number of active addresses could be a sign of trouble, such as a loss of user confidence or a technical issue.
Network security: The number of active addresses can also be used as a rough proxy for the level of decentralization on the network. A large and diverse set of active addresses may indicate that the network is decentralized and less vulnerable to attacks.

# Calculate active addresses moving average
rolling_window=25
y_a_add_ma = onchain_symbol_a_df['active_addresses'].rolling(window=rolling_window).mean() 
y_b_add_ma = onchain_symbol_b_df['active_addresses'].rolling(window=rolling_window).mean() 

# Lineplot: Active Addresses
fig, ax1 = plt.subplots(figsize=(16, 6))
sns.lineplot(data=onchain_symbol_a_df[-1*data_limit::10], x='date', y='active_addresses', 
    linewidth=0.5, color='skyblue', ax=ax1, label=f'{symbol_a} active addresses')
sns.lineplot(data=onchain_symbol_a_df[-1*data_limit::10], x='date', y=y_a_add_ma, 
    linewidth=1.0, color='royalblue', linestyle='--', ax=ax1, label=f'{symbol_a} active addresses {rolling_window}-Day MA')
sns.lineplot(data=onchain_symbol_b_df[-1*data_limit::10], x='date', y='active_addresses', 
    linewidth=0.5, color='lightcoral', ax=ax1, label=f'{symbol_b} active addresses')
sns.lineplot(data=onchain_symbol_b_df[-1*data_limit::10], x='date', y=y_b_add_ma, 
    linewidth=1.0, color='red', linestyle='--', ax=ax1, label=f'{symbol_b} active addresses {rolling_window}-Day MA')
ax1.set(xlabel=None)
ax1.set(ylabel='Active Addresses')
plt.title(f'Active Addresses: {symbol_b} vs {symbol_a}')
plt.legend(framealpha=0)
plt.show()

Metric #6 Transaction Count compared to Bitcoin

Transaction count is an important metric for analyzing the usage and adoption of a blockchain. It refers to the total number of transactions that have been processed on the network over a given time period.

There are several reasons why transaction count is important:

Network usage: The transaction count can give an indication of network usage. A higher transaction count may suggest that more people are using the network to send and receive transactions.
Network growth: An increase in the transaction count over time may indicate that the network is growing and attracting more users. This could be a positive sign for the long-term health and sustainability of the network.
Network health: The transaction count may also provide insight into the overall health of the network. For example, a sudden drop in the transaction count could be a sign of trouble, such as a loss of user confidence or a technical issue.
Network security: The transaction count can be used as a rough proxy for the level of decentralization on the network. A large and diverse set of transactions may indicate that the network is decentralized and less vulnerable to attacks.

# Calculate Transaction Count Moving Averages
rolling_window=25
y_a_trx_ma = onchain_symbol_a_df['transaction_count'].rolling(window=rolling_window).mean() 
y_b_trx_ma = onchain_symbol_b_df['transaction_count'].rolling(window=rolling_window).mean() 

# Lineplot: Transactions Count
fig, ax1 = plt.subplots(figsize=(16, 6))
sns.lineplot(data=onchain_symbol_a_df[-1*data_limit::10], x='date', y='transaction_count', 
    linewidth=0.5, color='skyblue', ax=ax1, label=f'{symbol_a} transactions')
sns.lineplot(data=onchain_symbol_a_df[-1*data_limit::10], x='date', y=y_a_trx_ma, 
    linewidth=1.0, color='royalblue', linestyle='--', ax=ax1, label=f'{symbol_a} transactions {rolling_window}-Day MA')
sns.lineplot(data=onchain_symbol_b_df[-1*data_limit::10], x='date', y='transaction_count', 
    linewidth=0.5, color='lightcoral', ax=ax1, label=f'{symbol_b} transactions')
sns.lineplot(data=onchain_symbol_b_df[-1*data_limit::10], x='date', y=y_b_trx_ma, 
    linewidth=1.0, color='red', linestyle='--', ax=ax1, label=f'{symbol_b} transactions {rolling_window}-Day MA')
ax1.set(xlabel=None)
ax1.set(ylabel='Transaction Count')
plt.legend(framealpha=0)
plt.title(f'Transactions: {symbol_b} vs {symbol_a}')
plt.show()

As the first blockchain, Bitcoin has always been the most crucial cryptocurrency in the crypto space. However, there are now blockchains based on more modern methods. Recently, the crypto community has been discussing whether Ethereum is about to overtake Bitcoin. But how is the situation in terms of transactions?

The chart shows that the use of blockchains has changed throughout the last few years. Ethereum has seen strong growth in the number of transactions, while the number of Bitcoin transactions has stagnated for some time.

Metric #7 Large Transactions compared to Bitcoin

Another metric to look at is the number of large transactions. Large transactions on a blockchain, also known as “whale transactions,” refer to transactions involving a significant amount of cryptocurrency. These transactions may be important to analyze for a number of reasons:

Market impact: Large transactions can have a significant impact on the market, as they involve a large amount of cryptocurrency being bought or sold. This can affect the supply and demand for the cryptocurrency and potentially impact its price.
Network security: Large transactions may also be of interest from a security standpoint, as they may be more likely to attract the attention of attackers. If a large transaction is successfully compromised, it could have a significant impact on the network.
Network health: Large transactions may provide insight into the overall health of the network. For example, a sudden increase in large transactions may indicate an increased demand for cryptocurrency, while a decrease in large transactions could be a sign of trouble.
Network usage: Large transactions can also be an indicator of how the network is being used. For example, a high number of large transactions may suggest that the network is being used for high-value transactions, while a low number may suggest that it is being used for smaller, everyday transactions.

# Calculate Large Transactions Moving Averages
rolling_window=25
y_a_ltrx_ma = onchain_symbol_a_df['large_transaction_count'].rolling(window=rolling_window).mean() 
y_b_ltrx_ma = onchain_symbol_b_df['large_transaction_count'].rolling(window=rolling_window).mean() 
# Lineplot: Large Transactions
fig, ax1 = plt.subplots(figsize=(16, 6))
sns.lineplot(data=onchain_symbol_a_df[-1*data_limit::10], x='date', y='large_transaction_count', 
    linewidth=0.5, color='skyblue', ax=ax1, label=f'{symbol_a} large transactions')
sns.lineplot(data=onchain_symbol_a_df[-1*data_limit::10], x='date', y=y_a_ltrx_ma, 
    linewidth=1.0, color='royalblue', linestyle='--', ax=ax1, label=f'{symbol_a} large transactions MA-{window}')
sns.lineplot(data=onchain_symbol_b_df[-1*data_limit::10], x='date', y='large_transaction_count', 
    linewidth=0.5, color='lightcoral', ax=ax1, label=f'{symbol_b} large transactions')
sns.lineplot(data=onchain_symbol_b_df[-1*data_limit::10], x='date', y=y_b_ltrx_ma, 
    linewidth=1.0, color='red', linestyle='--', ax=ax1, label=f'{symbol_b} large transaction MA-{window}')
ax1.set(ylabel='Large Transactions')
plt.title(f'Large Transactions > 100k: {symbol_b} vs {symbol_a}')
plt.legend(framealpha=0)
plt.show()

As we can see, both networks have recently experienced a decline in the number of large transactions.

Summary

Bitcoin and blockchain technology are transforming the financial sector and have seen increasing adoption during the past decade. Due to the increasing need to better understand complex blockchain networks, the importance of on-chain analytics is growing. This article has demonstrated how we can analyze blockchain data with Python. We used the CryptoCompare API to query various On-Chain and Off-Chain data for Bitcoin and Ethereum. By combining blockchain with these data, we gained several important insights into what has been happening in the crypto space over the past few years. Among other things, we have

…compared the historical evolution of mining difficulty and network hash rate.
…analyzed the usage of the Ethereum and Bitcoin blockchains.
…and highlighted how the distribution of Bitcoin holdings has evolved in the past years.

Our analysis in this article focused solely on Bitcoin and Ethereum. However, you can easily analyze other blockchains by replacing the symbols used in the API calls.

I hope you liked this post, and I would appreciate your feedback. Is OnChain analysis a topic that you want to see covered more often? Or do you want to see more articles on deep learning and machine learning? Let me know in the comments.

Sources and Further Reading

Antony Lewis (2018) Basics of Bitcoins and Blockchains
CryptoCompare API
glassnode.com
OpenAI ChatGPT was carefully used to revise certain parts of this article

The links above to Amazon are affiliate links. By buying through these links, you support the Relataly.com blog and help to cover the hosting costs. Using the links does not affect the price.

The post On-Chain Analytics: Metrics for Analyzing Blockchains in Python appeared first on relataly.com.

Create a Personalized Movie Recommendation Engine using Content-based Filtering in Python

Florian Follonier — Mon, 25 Jul 2022 11:29:00 +0000

Content-based recommender systems are a popular type of machine learning algorithm that recommends relevant articles based on what a user has previously consumed or liked. This approach aims to identify items with certain keywords, understand what the customer likes, and then identify other items that are similar to items the user has previously consumed or rated. The recommendations are based on the similarity of the items, represented by similarity scores in a vector matrix. The attributes used to describe an item are called “content.” For example, in the case of movie recommendations, content could be the genre, actors, director, year of release, etc. A well-designed content-based recommendation service will suggest movies of the same genre, actors, or keywords. This tutorial will implement a content-based recommendation service for movies using Python and Scikit-learn.

The rest of this tutorial proceeds as follows: After a brief introduction to content-based recommenders, we will work with a database that contains several thousands of IMDB movie titles and create a feature model that uses actors, release year, and a short description for each movie. In this tutorial, you will also learn how to deal with some challenges of building a content-based recommender. For example, we will look at how we can engineer features for content-based model words and reduce the dimensionality of our model. Finally, we use our model to generate some sample predictions.

Note: Another popular type of recommender system that I have covered in a previous article is collaborative filtering.

Recommendation systems can ease decision-making. Image created with Midjourney.

What is Content-Based Filtering?

The idea behind content-based recommenders is to generate recommendations based on user’s preferences and tastes. These preferences revolve around past user choices, for example, the number of times a user has watched a movie, purchased an item, or clicked on a link.

Content-based filtering uses domain-specific item features to measure the similarity between items. Given the user preferences, the algorithm will recommend items similar to what the user has consumed or liked before. For movie recommendations, this content can be the genre, actors, release year, director, film length, or keywords used to describe the movies. This approach works particularly well for domains with a lot of textual metadata, such as movies and videos, books, or products.

Content-based movie recommendations will suggest more of the same, for example, actors, genres, stories, and directors.

Basic Steps to Building a Content-based Recommender System

The approach to building a content-based recommender involves four essential steps:

The first step is to create a so-called ‘bag of words’ model from the input data, which is a list of words used to characterize the items. This step involves selecting useful content for describing and differentiating the items. The more precise the information, the better the recommendations will be.
The next step is to turn the bag (of words) into a feature vector. Different algorithms can be used for this step, for example, the Tfdif vectorizer or the count vectorizer. The result is a vector matrix with items as records and features as columns. This step often also includes applying techniques for dimensionality reduction.
The idea of content-based recommendations is based on measuring item similarity. Similarity scores are assigned through pairwise comparison. Here again, we can choose between different measures, e.g., the dot product or cosine similarity.
Once you have the similarity scores, you can return the most similar items by sorting the data by similarity scores. Given user preferences (single or multiple items a user consumed or liked), the algorithm will then recommend the most similar items.

Approach to Building a Content-based Recommender System

Similarity Scoring

The quality of the content-based recommendations is significantly influenced by how well the algorithm succeeds in measuring the similarity of the items. There are different techniques to calculate similarity, including Cosine Similarity, Pearson Similarity, Dot Product, and Euclidian Distance. They have in common that they use numerical characteristics of the text to calculate the distance between text vectors in an n-dimensional vector space.

It is worth denoting that these techniques can only measure word-level similarity. This means the algorithms compare the word of the item for word without considering the semantic meaning of the sentences. In some instances, this can lead to errors. For example, how similar are “now that they were sitting on a bank, he noticed she stole his heart, and he was in love” and “They are gangsters who love to steal from a large bank”? By just looking at the words, one may appear similar because the words have a good overlap.

Pros and Cons of Content-based Filtering

Like most machine learning algorithms, content-based recommenders have their strength and weaknesses.

Advantages

Content-based filtering is good at capturing a user’s specific interests and will recommend more of the same (for example, genre, actors, directors, etc.). It will also recommend niche items if they match the user preferences, even if these items draw little attention.
Another advantage is that the model can generate recommendations for a specific user without the knowledge of other users. This is particularly helpful if you want to generate predictions for many users.

Disadvantages

On the other hand, there are also a couple of downsides. The feature representation of the items has to be done manually to a certain extent, and the prediction quality strongly depends on whether items are described in detail. Therefore, content-based filtering requires a lot of expertise.
Since recommendations are based on the user’s previous interests. However, the recommendations are unlikely to go beyond that and expand to areas (e.g., genres) that are still unknown to the user. Content-based models thus tend to develop some tunnel vision, so that the model recommends more and more of the same.

Implementing a Content-based Movie Recommender in Python

In the following, we will implement a content-based movie recommender using Python and Scikit-learn. We will carry out all steps necessary to create a content-based recommender. The data comes from an IMDB dataset containing more than 40k films between 1996 and 2018. Based on the data, we define the features we want to use for recommending the movies. These features include the genre, director, main actors, plot keywords, or other metadata associated with the movies. Then we preprocess the data to extract these features and create a feature matrix. The feature matrix becomes the foundation for a similarity matrix that measures the similarity between the items based on their feature vectors. Finally, we use the similarity matrix to generate recommendations for a given item.

By the end of this Python tutorial, you will have learned how to implement a content-based recommendation system for movies using Python and Scikit-learn. This knowledge can be applied to other types of recommendations, such as articles, products, or songs.

The code is available on the GitHub repository.

View on GitHub Relataly GitHub Repo

This Robot doesn’t know what to watch on tv. Let’s build a recommender system for him! Image generated using DALL-E 2 by OpenAI.

Prerequisites

Before you start with the coding part, ensure you have set up your Python 3 environment and required packages. If you don’t have an environment, consider the Anaconda Python environment. Follow this tutorial to set it up.

Also, make sure you install all required packages. In this tutorial, we will be working with the following standard packages:

In addition, we will be using Seaborn for visualization and the natural language processing library nltk.

You can install these packages by using one of the following commands:

pip install
conda install (if you are using the anaconda packet manager)

About the IMDB Movies Dataset

We will train our movie recommender on a popular Movies Dataset (you can download it from grouplens.org). The MovieLens recommendation service collected the Dataset from 610 users between 1996 and 2018. Unpack the data into the working folder of your project.

The full Dataset contains metadata on over 45,000 movies and 26 million ratings from over 270,000 users. The Dataset contains the following files (Source of the data description: Kaggle.com):

movies_metadata.csv: The main Movies Metadata file contains information on 45,000 movies featured in the Full MovieLens Dataset. Features include posters, backdrops, budget, revenue, release dates, languages, production countries, and companies.
ratings_small.csv: The subset of 100,000 ratings from 700 users on 9,000 movies. Each line corresponds to a 5-star movie rating with half-star increments (0.5 – 5.0 stars).
keywords.csv: Contains the movie plot keywords for our MovieLens movies. Available in the form of a stringified JSON Object.
credits.csv: Consists of Cast and Crew Information for all our films. Available in the form of a stringified JSON Object.

recomender systems collaborative filtering imdb movies

" data-image-caption="

recomender systems collaborative filtering imdb movies

" data-large-file="https://www.relataly.com/wp-content/uploads/2022/04/MDB-Movie-Database.png" src="https://www.relataly.com/wp-content/uploads/2022/04/MDB-Movie-Database.png" alt="MDB Movie Database Recommender Systems Collaborative Filtering " class="wp-image-7128" width="366" height="192" srcset="https://www.relataly.com/wp-content/uploads/2022/04/MDB-Movie-Database.png 1024w, https://www.relataly.com/wp-content/uploads/2022/04/MDB-Movie-Database.png 300w, https://www.relataly.com/wp-content/uploads/2022/04/MDB-Movie-Database.png 768w" sizes="(max-width: 366px) 100vw, 366px" />

IMDB Movie Database

Several other files are included that we won’t use, incl. ratings_small, links_small, and links.

You can download it here or from Kaggle.

Step #1: Load the Data

Our goal is to create a content-based recommender system for movie recommendations. In this case, the content will be meta information on movies, such as genre, actors, the description.

We begin by making imports and loading the data from three files:

movies_metadata.csv
credits.csv
keywords.csv

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
sns.set_style('white', { 'axes.spines.right': False, 'axes.spines.top': False})
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.decomposition import TruncatedSVD
from nltk.corpus import stopwords

# the IMDB movies data is available on Kaggle.com
# https://www.kaggle.com/datasets/rounakbanik/the-movies-dataset

# in case you have placed the files outside of your working directory, you need to specify the path
path = 'data/movie_recommendations/' 

# load the movie metadata
df_meta=pd.read_csv(path + 'movies_metadata.csv', low_memory=False, encoding='UTF-8') 

# some records have invalid ids, which is why we remove them
df_meta = df_meta.drop([19730, 29503, 35587])

# convert the id to type int and set id as index
df_meta = df_meta.set_index(df_meta['id'].str.strip().replace(',','').astype(int))
pd.set_option('display.max_colwidth', 20)
df_meta.head(2)

		adult	belongs_to_collection			budget		genres				homepage			id		imdb_id		original_language	original_title	overview			...	release_date	revenue		runtime	spoken_languages	status		tagline	title	video	vote_average	vote_count
id																					
862		False	{'id': 10194, 'n...				30000000	[{'id': 16, 'nam...	http://toystory....	862		tt0114709	en					Toy Story		Led by Woody, An...	...	1995-10-30		373554033.0	81.0	[{'iso_639_1': '...	Released	NaN	Toy Story	False	7.7	5415.0
8844	False	NaN								65000000	[{'id': 12, 'nam...	NaN					8844	tt0113497	en				Jumanji				When siblings Ju...	...	1995-12-15		262797249.0	104.0	[{'iso_639_1': '...	Released	Roll the dice an...	Jumanji	False	6.9	2413.0

After we have loaded credits and keywords, we will combine the data into a single dataframe. Now we have various input fields available. However, we will only use keywords, cast, year of release, genres, and overview. If you like, you can enhance the data with additional inputs, for example, budget, running time, or film language.

Once we have gathered our data in a single dataframe, we print out the first rows to gain an overview of the data.

# load the movie credits
df_credits = pd.read_csv(path + 'credits.csv', encoding='UTF-8')
df_credits = df_credits.set_index('id')

# load the movie keywords
df_keywords=pd.read_csv(path + 'keywords.csv', low_memory=False, encoding='UTF-8') 
df_keywords = df_keywords.set_index('id')

# merge everything into a single dataframe 
df_k_c = df_keywords.merge(df_credits, left_index=True, right_on='id')
df = df_k_c.merge(df_meta[['release_date','genres','overview','title']], left_index=True, right_on='id')
df.head(3)

		keywords			cast				crew				release_date	genres				overview			title
id							
862		[{'id': 931, 'na...	[{'cast_id': 14,...	[{'credit_id': '...	1995-10-30		[{'id': 16, 'nam...	Led by Woody, An...	Toy Story
8844	[{'id': 10090, '...	[{'cast_id': 1, ...	[{'credit_id': '...	1995-12-15		[{'id': 12, 'nam...	When siblings Ju...	Jumanji
15602	[{'id': 1495, 'n...	[{'cast_id': 2, ...	[{'credit_id': '...	1995-12-22		[{'id': 10749, '...	A family wedding...	Grumpier Old Men

We can see cast, crew, and genres have a dictionary-like structure. To create a cosine similarity matrix, we need to extract the keywords from these columns and gather them in a single column. This is what we will do in the next step.

Step #2: Feature Engineering and Data Cleaning

A problem with modeling text is that machine learning algorithms have difficulty processing text directly. An essential step in creating content-based recommenders is bringing the text into a machine-readable form. This is what we call feature engineering.

2.1 Creating a Bag-of-Words Model

We begin with feature engineering and creating the bag of words. As mentioned, a bag of words is a list of words relevant to describe items in a dataset, such as films, and differentiate them. Creating a bag of words removes stopwords but preserves multiplicity so that words can occur multiple times in the concatenated text. Later, each word can be used as a feature in calculating cosine similarities.

The input for a bag of words does not necessarily come from a single input column. We will use keywords, genres, cast, and overview and merge them into a new single column that we call tags. Make sure to capture the text field’s nature. We will keep names and surnames together and not split them, as we will do with the words from the overview column. The result of this process is our bag.

In addition, we add the movie title and a new index (id), which will later ease working with the similarity matrix. Finally, we print the first rows of our feature dataframe.

# create an empty DataFrame
df_movies = pd.DataFrame()

# extract the keywords
df_movies['keywords'] = df['keywords'].apply(lambda x: [i['name'] for i in eval(x)])
df_movies['keywords'] = df_movies['keywords'].apply(lambda x: ' '.join([i.replace(" ", "") for i in x]))

# extract the overview
df_movies['overview'] = df['overview'].fillna('')

# extract the release year 
df_movies['release_date'] = pd.to_datetime(df['release_date'], errors='coerce').apply(lambda x: str(x).split('-')[0] if x != np.nan else np.nan)

# extract the actors
df_movies['cast'] = df['cast'].apply(lambda x: [i['name'] for i in eval(x)])
df_movies['cast'] = df_movies['cast'].apply(lambda x: ' '.join([i.replace(" ", "") for i in x]))

# extract genres
df_movies['genres'] = df['genres'].apply(lambda x: [i['name'] for i in eval(x)])
df_movies['genres'] = df_movies['genres'].apply(lambda x: ' '.join([i.replace(" ", "") for i in x]))

# add the title
df_movies['title'] = df['title']

# merge fields into a tag field
df_movies['tags'] = df_movies['keywords'] + df_movies['cast']+' '+df_movies['genres']+' '+df_movies['release_date']

# drop records with empty tags and dublicates
df_movies.drop(df_movies[df_movies['tags']==''].index, inplace=True)
df_movies.drop_duplicates(inplace=True)

# add a fresh index to the dataframe, which we will later use when refering to items in a vector matrix
df_movies['new_id'] = range(0, len(df_movies))

# Reduce the data to relevant columns
df_movies = df_movies[['new_id', 'title', 'tags']]

# display the data
pd.set_option('display.max_colwidth', 500)
pd.set_option('display.expand_frame_repr', False)
print(df_movies.shape)
df_movies.head(5)

		new_id	title							tags
id			
862		0		Toy Story						jealousy toy boy friendship friends rivalry boynextdoor newtoy toycomestolifeTomHanks TimAllen DonRickles JimVarney WallaceShawn JohnRatzenberger AnniePotts JohnMorris ErikvonDetten LaurieMetcalf R.LeeErmey SarahFreeman PennJillette Animation Comedy Family 1995
8844	1		Jumanji							boardgame disappearance basedonchildren'sbook newhome recluse giantinsectRobinWilliams JonathanHyde KirstenDunst BradleyPierce BonnieHunt BebeNeuwirth DavidAlanGrier PatriciaClarkson AdamHann-Byrd LauraBellBundy JamesHandy GillianBarber BrandonObray CyrusThiedeke GaryJosephThorup LeonardZola LloydBerry MalcolmStewart AnnabelKershaw DarrylHenriques RobynDriscoll PeterBryant SarahGilson FloricaVlad JuneLion BrendaLockmuller Adventure Fantasy Family 1995
15602	2		Grumpier Old Men				fishing bestfriend duringcreditsstinger oldmenWalterMatthau JackLemmon Ann-Margret SophiaLoren DarylHannah BurgessMeredith KevinPollak Romance Comedy 1995
31357	3		Waiting to Exhale				basedonnovel interracialrelationship singlemother divorce chickflickWhitneyHouston AngelaBassett LorettaDevine LelaRochon GregoryHines DennisHaysbert MichaelBeach MykeltiWilliamson LamontJohnson WesleySnipes Comedy Drama Romance 1995
11862	4		Father of the Bride Part II		baby midlifecrisis confidence aging daughter motherdaughterrelationship pregnancy contraception gynecologistSteveMartin DianeKeaton MartinShort KimberlyWilliams-Paisley GeorgeNewbern KieranCulkin BDWong PeterMichaelGoetz KateMcGregor-Stewart JaneAdams EugeneLevy LoriAlan Comedy 1995

2.2 Visualizing Text Length

We can use a bar chart to illustrate each movie’s word bag length. This gives us an idea of how detailed the movie descriptions are. Items with short descriptions have, in principle, a lower probability of being recommended later. Recommenders produce better results if the length of the descriptions is somewhat balanced.

# add the tag length to the movies df
df_movies['tag_len'] = df_movies['tags'].apply(lambda x: len(x))

# illustrate the tag text length
sns.displot(data=df_movies.dropna(), bins=list(range(0, 2000, 25)), height=5, x='tag_len', aspect=3, kde=True)
plt.title('Distribution of tag text length')
plt.xlim([0, 2500])

Step #3: Vectorization using TfidfVectorizer

The next step is to create a vector matrix from the Bag of Words model. Each column from the matrix represents a word feature. This step is the basis for determining the similarity of the movies afterward. Before the vectorization, we will remove stop words from the text (e.g., and, it, that, or, why, where, etc.). In addition, I limited the number of features in the matrix to 5000 to reduce training time.

A simple vectorization approach is to determine the word frequency for each movie using a count vectorizer. However, a frequently mentioned disadvantage of this approach is that it does not consider how often a word occurs. For example, some words may appear in almost all items. On the other hand, some words may be prevalent in a few items but are rare in general. So we can argue that observing rare words in an item is more informative than observing common words. Instead of a count vectorizer, we will use a more practical approach called TfidfVectorizer from the scikit-learn package.

Tfidf stands for term frequency-inverse document frequency. Compared to a count vectorizer, the tf-idf vectorizer considers the overall word frequencies and weights the general importance of the words when spanning the vectors. This way, tf-idf can determine which words are more important than others, reducing the model’s complexity and improving performance. This medium article explains the math behind tf-idf vectorization in more detail.

# set a custom stop list from nltk
stop = list(stopwords.words('english'))

# create the tfid vectorizer, alternatively you can also use countVectorizer
tfidf =  TfidfVectorizer(max_features=5000, analyzer = 'word', stop_words=set(stop))
vectorized_data = tfidf.fit_transform(df_movies['tags'])
count_matrix = pd.DataFrame(vectorized_data.toarray(), index=df_movies['tags'].index.tolist())
print(count_matrix)

			0     1     2     3     4     5     6     7     8     9     ...  4990  4991  4992  4993  4994  4995  4996  4997  4998  4999
862      	0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0  ...   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0
8844     	0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0  ...   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0
15602    	0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0  ...   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0
31357    	0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0  ...   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0
11862    	0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0  ...   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0
...      	...   ...   ...   ...   ...   ...   ...   ...   ...   ...  ...   ...   ...   ...   ...   ...   ...   ...   ...   ...   ...
439050   	0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0  ...   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0
111109   	0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0  ...   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0
67758    	0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0  ...   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0
227506   	0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0  ...   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0
461257   	0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0  ...   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0

[45432 rows x 5000 columns]

The vectorization process results in a feature matrix in which each feature is a word from the text bag of words.

We can display features with the get_feature_names_out function from the tfidf vectorizer.

# print feature names
print(tfidf.get_feature_names_out()[940:990])

['climbing' 'clinteastwood' 'clinthoward' 'clive' 'cliveowen'
 'cliverevill' 'cliverussell' 'clone' 'clorisleachman' 'cloviscornillac'
 'clown' 'clugulager' 'clydekusatsu' 'co' 'coach' 'cobb' 'cocaine' 'code'
 'coffin' 'cohen' 'coldwar' 'cole' 'colehauser' 'coleman' 'colinfarrell'
 'colinfirth' 'colinhanks' 'colinkenny' 'colinsalmon' 'colleencamp'
 'college' 'colmfeore' 'colmmeaney' 'coma' 'combat' 'comedian' 'comedy'
 'comicbook' 'comingofage' 'comingout' 'common' 'communism' 'communist'
 'company' 'competition' 'composer' 'computer' 'con' 'concentrationcamp'
 'concert']

As you can see, features are specific words,

Step #4 Dimensionality Reduction and Calculate Consine Similarities

In the previous section, we created a vector matrix that contains movies and features. This matrix is the foundation for calculating similarity scores for all movies. Before we assign feature scores, we will apply dimensionality reduction.

4.1 Dimensionality Reduction using SVD

The matrix spans a high-dimensional vector space with more than 5000 feature columns. Do we need all of these features? The answer is most likely not. There are likely a lot of words in the matrix that only occur once or twice. On the other hand, words may occur in almost all movies. How can we deal with this issue?

The reason for this is that we have a very dimensional vector space. By reducing this space to fewer, more essential features, we can save some time training our recommender model. We will use TruncatedSVD from the scikit-learn package, a popular algorithm for dimensionality reduction. The algorithm smoothens the matrix and approximates it to a lower dimensional space, thereby reducing noise and model complexity.

This way, we will reduce the vector space from 5000 to 3000 features.

# reduce dimensionality for improved performance
svd = TruncatedSVD(n_components=3000)
reduced_data = svd.fit_transform(count_matrix)

4.2 Calculate Text Similarity Scores for all Movies

Now that we have reduced the complexity of our vector matrix, we can calculate the similarity scores for all movies. In this process, we assign a similarity score to all item pairs that measure content closeness according to the position of the items in the vector space.

We use the cosine function to calculate the similarity value of the movies. The cosine similarity is a mathematical calculation to determine the mathematical similarity of two vectors. In our case, the vectors are the movie descriptions. The cosine similarity function uses these feature vectors to compare each movie to every other and assigns them a similarity value.

A similarity value of -1 means that two feature vectors are correlated, and the movies are entirely different.
A value of 1 means that the two movies are identical.
A value of 0 is between and means f an average match of the feature vectors.

The cosine similarity function will calculate pairwise similarities for all movies in our vector matrix. We can determine the number of pairwise comparisons with the formula k²/2, whereby k is the number of items in the vector matrix. In our case, we have a k of 45000 movies. This means the cosine similarity function must calculate about 1 billion similarity scores. So don’t worry if the process takes some time to complete.

# compute the cosine similarity matrix
similarity = cosine_similarity(reduced_data)
similarity

array([[ 1.00000000e+00,  9.75542082e-02,  6.00755620e-02, ...,
        -3.03965235e-04,  0.00000000e+00,  5.81243547e-05],
       [ 9.75542082e-02,  1.00000000e+00,  5.92929339e-02, ...,
        -2.97565163e-03,  0.00000000e+00,  4.57945869e-05],
       [ 6.00755620e-02,  5.92929339e-02,  1.00000000e+00, ...,
         9.40459504e-03,  0.00000000e+00, -2.22415551e-04],
       ...,
       [-3.03965235e-04, -2.97565163e-03,  9.40459504e-03, ...,
         1.00000000e+00,  0.00000000e+00, -2.60823346e-04],
       [ 0.00000000e+00,  0.00000000e+00,  0.00000000e+00, ...,
         0.00000000e+00,  0.00000000e+00,  0.00000000e+00],
       [ 5.81243547e-05,  4.57945869e-05, -2.22415551e-04, ...,
        -2.60823346e-04,  0.00000000e+00,  1.00000000e+00]])

Step #5: Generate Content-based Movie Recommendations

Once you have created the similarity matrix, it’s time to generate some recommendations. We begin by generating recommendations based on a single movie. In the cosine similarity matrix, the most similar movies have the highest similarity scores. Once we have the film with the highest scores, we can visualize the results in a bar chart that shows the cosine similarity scores.

The example below displays the results of the movie “The Matrix.” Oh, how I love this movie 🙂

# create a function that takes in movie title as input and returns a list of the most similar movies
def get_recommendations(title, n, cosine_sim=similarity):
    
    # get the index of the movie that matches the title
    movie_index = df_movies[df_movies.title==title].new_id.values[0]
    print(movie_index, title)
    
    # get the pairwsie similarity scores of all movies with that movie and sort the movies based on the similarity scores
    sim_scores_all = sorted(list(enumerate(cosine_sim[movie_index])), key=lambda x: x[1], reverse=True)
    
    # checks if recommendations are limited
    if n > 0:
        sim_scores_all = sim_scores_all[1:n+1]
        
    # get the movie indices of the top similar movies
    movie_indices = [i[0] for i in sim_scores_all]
    scores = [i[1] for i in sim_scores_all]
    
    # return the top n most similar movies from the movies df
    top_titles_df = pd.DataFrame(df_movies.iloc[movie_indices]['title'])
    top_titles_df['sim_scores'] = scores
    top_titles_df['ranking'] = range(1, len(top_titles_df) + 1)
    
    return top_titles_df, sim_scores_all

# generate a list of recommendations for a specific movie title
movie_name = 'The Matrix'
number_of_recommendations = 15
top_titles_df, _ = get_recommendations(movie_name, number_of_recommendations)
 
# visualize the results
def show_results(movie_name, top_titles_df):
    fix, ax = plt.subplots(figsize=(11, 5))
    sns.barplot(data=top_titles_df, y='title', x= 'sim_scores', color='blue')
    plt.xlim((0,1))
    plt.title(f'Top 15 recommendations for {movie_name}')
    pct_values = ['{:.2}'.format(elm) for elm in list(top_titles_df['sim_scores'])]
    ax.bar_label(container=ax.containers[0], labels=pct_values, size=12)

show_results(movie_name, top_titles_df)

Example for the movies “Spectre” and “The Lion King”

Step #6: Generate Content-based Movie Recommendations

But what if you want to generate recommendations for specific users that have seen several movies? For this, we can aggregate the similarity scores for all films the user has seen. This way, we create a new dataframe that sums up similarity scores. To return the top-recommended movies, we can sort this dataframe by similarity scores and replace the top elements.

# list of movies a user has seen
movie_list = ['The Lion King', 'Seven', 'RoboCop 3', 'Blade Runner', 'Quantum of Solace', 'Casino Royale', 'Skyfall']

# create a copy of the movie dataframe and add a column in which we aggregated the scores
user_scores = pd.DataFrame(df_movies['title'])
user_scores['sim_scores'] = 0.0

# top number of scores to be considered for each movie
number_of_recommendations = 10000
for movie_name in movie_list:
    top_titles_df, _ = get_recommendations(movie_name, number_of_recommendations)
    # aggregate the scores
    user_scores = pd.concat([user_scores, top_titles_df[['title', 'sim_scores']]]).groupby(['title'], as_index=False).sum({'sim_scores'})
# sort and print the aggregated scores
user_scores.sort_values(by='sim_scores', ascending=False)[1:20]

Summary

In this tutorial, you have learned to implement a simple content-based recommender system for movie recommendations in Python. We have used several movie-specific details to calculate a similarity matrix for all movies in our dataset. Finally, we have used this model to generate recommendations for two cases:

Films that are similar to a specific movie
Films that are recommended based on the watchlist of a particular user.

A downside of content-based recommenders is that you cannot test their performance unless you know how users perceived the recommendations. This is because content-based recommenders can only determine which items in a dataset are similar. To understand how well the suggestions work, you must include additional data about actual user preferences.

More advanced recommenders will combine content-based recommendations with user-item interactions (e.g., collaborative filtering). Such models are called hybrid recommenders, but this is something for another article.

Image created with Midjourney

Sources and Further Reading

Below are some resources for further reading on recommender systems and content-based models.

Books

The links above to Amazon are affiliate links. By buying through these links, you support the Relataly.com blog and help to cover the hosting costs. Using the links does not affect the price.

Articles

The post Create a Personalized Movie Recommendation Engine using Content-based Filtering in Python appeared first on relataly.com.

Unveiling Hidden Patterns in the Cryptocurrency Market with Affinity Propagation and Python

Florian Follonier — Mon, 02 May 2022 18:34:02 +0000

Affinity propagation is a powerful unsupervised clustering technique that can identify hidden patterns in large datasets. In the cryptocurrency world, where new coins are constantly emerging and prices can be highly volatile, affinity propagation can help investors simplify the chaos.

By analyzing historical price data, affinity propagation groups coins into clusters based on their past price fluctuations. Such a cluster analysis enables crypto investors to identify promising entry and exit points, ultimately helping them make smarter investment decisions.

To use this technique effectively, it’s important to understand essential concepts such as covariance, lasso regression, and affinity propagation. Once you understand these concepts, you can apply them to analyze price time series data and identify hidden patterns.

Finally, visualizing the results in two and three dimensions can better understand the relationships between coins and their respective clusters. The resulting crypto market map can be a powerful tool for investors to gain insight into the market’s structure and make informed investment decisions.

Disclaimer

This article does not constitute financial advice. Stock markets can be very volatile and are generally difficult to predict. Predictive models and other forms of analytics applied in this article only serve the purpose of illustrating machine learning use cases.

What is Stock Market Clustering?

Clustering stock markets refers to grouping stocks based on their similarities or common characteristics. This can be done using various clustering algorithms, which analyze the data and assign each stock market to a cluster based on its similarity to other stock markets in the same cluster. In this article, we will run a cluster analysis on historical time series data. This approach involves grouping stocks into clusters based on their historical performance over a certain period of time.

Clustering stock market data can be useful for a variety of purposes, such as identifying patterns or trends in the data, comparing the performance of different stocks or sectors, or generating investment recommendations. However, it’s important to keep in mind that clustering is just one tool among many for analyzing stock market data, and it’s important to consider a range of factors when making investment decisions. It can also be used to compare the performance of different stock markets and identify potential risks or correlations between them.

Also: Color-Coded Cryptocurrency Price Charts in Python

neural network machine learning python affinity propagation midjourney relataly crypto-min

" data-image-caption="

neural network machine learning python affinity propagation midjourney relataly crypto-min

" data-large-file="https://www.relataly.com/wp-content/uploads/2023/03/neural-network-machine-learning-python-affinity-propagation-midjourney-relataly-crypto-min.png" src="https://www.relataly.com/wp-content/uploads/2023/03/neural-network-machine-learning-python-affinity-propagation-midjourney-relataly-crypto-min.png" alt="neural network machine learning python midjourney relataly crypto market map" class="wp-image-12694" srcset="https://www.relataly.com/wp-content/uploads/2023/03/neural-network-machine-learning-python-affinity-propagation-midjourney-relataly-crypto-min.png 509w, https://www.relataly.com/wp-content/uploads/2023/03/neural-network-machine-learning-python-affinity-propagation-midjourney-relataly-crypto-min.png 300w, https://www.relataly.com/wp-content/uploads/2023/03/neural-network-machine-learning-python-affinity-propagation-midjourney-relataly-crypto-min.png 140w" sizes="(max-width: 509px) 100vw, 509px" />

We can use a crypto market map to illustrate the price correlation between cryptocurrencies.

What’s the Problem with Prototype-based Clustering?

Clustering is an unsupervised learning technique that groups similar objects into clusters and separates them from different ones. One of the most popular clustering techniques is k-means. K-means belongs to the so-called prototype-based clustering techniques, which divide data points into a predefined number of groups (in the case of k-means, the groups are of equal variance).

The prototype-based clustering approach works great if the number of clusters in a dataset is known and the clusters have similar despair. However, when we deal with real-world problems, we often encounter more complex data for which the optimal number of clusters is unknown and difficult or even impossible to guess. In such a case, affinity propagation has a significant advantage because it can automatically estimate the number of clusters.

Affinity Propagation: What it is and How it Works

The idea of affinity propagation is to identify clusters by measuring the similarity of data points relative to one another. The algorithm chooses data points as cluster centers that best represent other data points near them.

We can imagine the process of identifying these representative data points as an election. Each data point (i) is a voter who casts votes and a candidate (k) who can receive votes from other voters. Votes are a measure of the similarity of data points. A voter who gives many votes to a candidate expresses that this data point is similar to him and therefore is suitable for representing him as a cluster center. The voting process continues until the algorithm reaches a consensus and selects a set number of cluster candidates.

Affinity Propagation: Data points cast votes for candidates and receive votes from other data points

The clustering process involves many separate steps (This article provides a detailed description of the steps involved) and works with several matrices:

The similarity matrix assesses the suitability of data points (candidates) to act as cluster centers.
The availability matrix (or responsibility matrix) collects the support of the data points for the candidates (potential cluster centers) and their suitability to represent them.
The criterion matrix sums up the results and defines the clusters. Data points with equal scores in the criterion matrix are considered part of the same cluster.

Criterion Matrix: Data Points (Cryptos) with equal numbers are part of the same cluster

Time Series Clustering using Affinity Propagation – Visualizing Cryptocurrency Market Structures in Python

Ready to implement affinity propagation in Python to analyze the crypto market structure and create a visual representation of price similarity? Let’s dive in!

First, we define a portfolio of cryptocurrencies and download their historical price quotes from coinmarketcap. We then visualize the time series on separate line charts to ensure that the data has been loaded successfully. After preparing and cleaning the data, we can move on to clustering the cryptocurrencies into groups with similar price movements using Affinity Propagation.

Unlike other clustering algorithms, we don’t set the number of clusters in advance. Instead, we let affinity propagation determine the optimal number of clusters for our portfolio. Finally, we calculate the covariance matrix between clusters and arrange the cryptocurrencies on a 2D map into clusters. We create a network overlay based on covariance to better understand the relationships between different clusters.

With affinity propagation, we can identify hidden patterns in the crypto market and group coins into clusters based on their past price fluctuations. This process allows us to identify promising entry and exit points, ultimately helping us make smarter investment decisions. Plus, the 2D map and network overlay help us visualize the relationships between different clusters and coins.

We can use affinity propagation to cluster financial assets and visualize them on a map.

The Python code for this tutorial is available in the relataly repository on GitHub.

View on GitHub Relataly GitHub Repo

Prerequisites

Before beginning the coding part, ensure that you have set up your Python 3 environment and required packages. Consider Anaconda if you don’t have a Python environment set up yet. To set it up, you can follow the steps in this tutorial. Also, make sure you install all required packages. In this tutorial, we will be working with the following standard packages:

Please also make sure you have the Cmcscaper package installed. We will be using it to download past crypto prices from coinmarketcap.

You can install these packages using console commands:

pip install
conda install (if you are using the anaconda packet manager)

Step #1: Load the Stock Market Data

We start by loading historical crypto price data from Coinmarketcap. To download the data, we use Cmcscraper, a Python library that allows us to collect Coinmarketcap data without signing up for the official API.

The download returns a dataframe with daily price quotes (Close, Open, Avg) for cryptocurrencies between 2016 and today. You can use the dictionary (“symbol_dict”) to control which cryptos you want to include in the data. We limit the data we use in our cluster analysis to the last 50 days. In this way, we let the correlation consider earlier price developments. But it’s up to you to specify a different period. In addition, instead of using absolute price values, we will use daily percentage fluctuations.

Also: Requesting Crypto Price Data from the Gate.io REST API in Python

Loading the data can take several minutes, depending on how many cryptocurrencies we include in the request. So it makes sense not to load the data every time you run the code. Therefore, the code below stores the historical prices in a CSV file.

The script will check if the data already exists if you run the code below. If it does, it will use the data from the CSV file. Otherwise, it will load a fresh copy of the data from coinmarketcap.

# A tutorial for this file is available at www.relataly.com
# Tested with Python 3.8.8, Matplotlib 3.5, Scikit-learn 0.24.1, Seaborn 0.11.1, numpy 1.19.5

from cryptocmd import CmcScraper
import pandas as pd 
import matplotlib.pyplot as plt 
import numpy as np 
import seaborn as sns
from sklearn import cluster, covariance, manifold
import requests
import json


#get a dictionary of the top 100 coin symbols and names from an API
def get_symbol_dict():
    url = 'https://api.coinmarketcap.com/data-api/v3/cryptocurrency/listing?start=1&limit=50&sortBy=market_cap&sortType=desc&convert=USD&cryptoType=all&tagType=all&audited=false'
    response = requests.get(url)
    data = json.loads(response.text)
    df = pd.DataFrame(data['data']['cryptoCurrencyList'])

    # exclude stable coins
    df = df[~df['symbol'].isin(['USDT', 'USDC', 'BUSD', 'DAI', 'TUSD', 'PAX', 'GUSD', 'HUSD', 'USDK', 'USDS', 'USDP', 'USDN', 'USDSB', 'USDX', 'USD++', 'BIDR', 'IDRT', 'VAI', 'BGBP'])]
    df = df[['symbol', 'name']]
    df = df.set_index('symbol')
    df = df.to_dict()
    df = df['name']
    return df

symbol_dict = get_symbol_dict()


# Download historic crypto prices via CmcScraper
def load_fresh_data_and_save_to_disc(symbol_dict, save_path):
    # Extract symbols and names from the symbol_dict
    symbols, names = np.array(sorted(symbol_dict.items())).T
    
    # Initialize an empty DataFrame for storing the prices
    df_crypto = pd.DataFrame()

    # Download and process the price data for each symbol
    for symbol in symbols:
        print(f"Fetching prices for {symbol}...")
        
        # Download the price data using CmcScraper
        scraper = CmcScraper(symbol)
        df_coin_prices = scraper.get_dataframe()

        # Process the price data and add it to df_crypto
        df = pd.DataFrame({
            f"{symbol}_Open": df_coin_prices["Open"],
            f"{symbol}_Close": df_coin_prices["Close"],
            f"{symbol}_Avg": (df_coin_prices["Close"] + df_coin_prices["Open"]) / 2,
            f"{symbol}_p": (df_coin_prices["Open"] - df_coin_prices["Close"]) / df_coin_prices["Open"]
        })
        df_crypto = pd.concat([df_crypto, df], axis=1)

    # Save the price data to a CSV file
    X_df_filtered = df_crypto.filter(like="_p")
    X_df_filtered.to_csv(save_path + "historical_crypto_prices.csv")

    return names, symbols, X_df_filtered
        

# If set to False the data will only be downloaded when you execute the code
# Set to True, if you want a fresh copy of the data.  
fetch_new_data = True 
save_path = '' # path where the price data will be stored in a csv file

# Fetch fresh data via the scraping package, or use data from the csv file on disk
if fetch_new_data == False:
    try:
        print('loading from disk')
        X_df_filtered = pd.read_csv(save_path + 'historical_crypto_prices.csv')
        if 'Unnamed: 0' in X_df_filtered.columns: 
            X_df_filtered = X_df_filtered.drop(['Unnamed: 0'], axis=1)
            symbols, names = np.array(sorted(symbol_dict.items())).T
        print(list(X_df_filtered.columns))
    except:
        print('no existing price data found - loading fresh data from coinmarketcap and saving them to disk')
        names, symbols, X_df_filtered = load_fresh_data_and_save_to_disc(symbol_dict, save_path)
        print(list(symbols))
else:
       print('loading fresh data from coinmarketcap and saving them to disk')
       names, symbols, X_df_filtered = load_fresh_data_and_save_to_disc(symbol_dict, save_path)
       print(list(symbols))

# Limit the price data to the last t days
t= 14 # in days
X_df_filtered = X_df_filtered[:t]
X_df_filtered.head()

	ACM_p		ADA_p		ARK_p		ATM_p		ATOM_p		AVAX_p		BAT_p		BCH_p		BLZ_p		BNB_p		...	THETA_p		UNI_p		USDT_p		VET_p		WAVES_p		XLM_p		XMR_p		XRP_p		ZIL_p		ZRX_p
0	0.031987	-0.037645	-0.005702	0.030928	-0.005897	-0.012404	-0.012262	-0.022529	0.008072	-0.007111	...	-0.021994	-0.023758	-0.000103	-0.021024	-0.015416	-0.004096	-0.022988	-0.027397	-0.016659	-0.012255
1	0.028192	0.065034	0.122306	0.010310	0.093558	0.106811	0.082863	0.075567	0.062105	0.054733	...	0.067264	0.081040	0.000136	0.077203	0.092987	0.078562	0.111519	0.071696	0.076484	0.085094
2	0.040771	0.016097	-0.133345	0.018963	0.011304	-0.033328	-0.007616	0.011458	-0.019993	0.005134	...	-0.005104	-0.024190	0.000077	0.002218	0.008920	0.004139	-0.031822	-0.012107	-0.003906	-0.021170
3	-0.027698	0.005129	-0.031516	-0.002639	0.022235	-0.008117	0.003969	0.019119	0.015403	0.005920	...	0.007992	0.027203	0.000003	0.000701	0.010739	0.005324	-0.007914	0.007168	0.004556	-0.003786
4	-0.021129	-0.019053	0.003273	-0.008121	0.002883	-0.004927	0.002548	-0.000599	0.028492	-0.012181	...	0.000198	-0.025817	-0.000047	-0.002800	-0.051515	-0.004861	0.015134	-0.000596	-0.010343	0.004530

The data looks good, so let’s continue.

Step #2 Plotting Crypto Price Charts

Now that the data is available, we can visualize it in various line graphs. The visualization helps us better understand what kind of data we are dealing with and check if the download was successful.

# Create Prices Charts for all Cryptocurrencies
list_length = X_df_filtered.shape[1]
ncols = 10
nrows = int(round(list_length / ncols, 0))
height = list_length/3 if list_length > 30 else 4
fig, axs = plt.subplots(nrows=nrows, ncols=ncols, sharex=True, sharey=True, figsize=(20, height))
for i, ax in enumerate(fig.axes):
        if i < list_length:
            sns.lineplot(data=X_df_filtered, x=X_df_filtered.index, y=X_df_filtered.iloc[:, i], ax=ax)
            ax.set_title(X_df_filtered.columns[i])
plt.show()

We can see the lineplots for all cryptocurrencies and everything looks as expected.

Step #3 Clustering Cryptocurrencies using Affinity Propagation

Next, we must prepare the data and run the affinity propagation algorithm. For some cryptocurrencies, we may encounter data that contains NaN values. Because clustering is sensitive to missing values, we must ensure good data quality. In addition, the Python code below will convert the DataFrame into a NumPy array and transpose it into a form where we have crypto assets as records and the days as columns.

Running the code below returns a dictionary of clusters with the cryptocurrencies assigned to them by the affinity propagation algorithm.

# Drop NaN values
X_df = pd.DataFrame(np.array(X_df_filtered)).dropna()
# Transpose the data to structure prices along columns
X = X_df.copy()
X /= X.std(axis=0)
X = np.array(X)
# Define an edge model based on covariance
edge_model = covariance.GraphicalLassoCV()
# Standardize the time series
edge_model.fit(X)
# Group cryptos to clusters using affinity propagation
# The number of clusters will be determined by the algorithm
cluster_centers_indices , labels = cluster.affinity_propagation(edge_model.covariance_, random_state=1)
cluster_dict = {}
n_labels = labels.max()
print(f"{n_labels} Clusters")
for i in range(n_labels + 1):
    clusters = ', '.join(names[labels == i])
    print('Cluster %i: %s' % ((i + 1), clusters))
    cluster_dict[i] = (clusters)

9 Clusters
Cluster 1: Binance Coin, Cake Defi
Cluster 2: Bitcoin Cash, Bitcoin, BitTorrent, Decred, EOS, Ethereum Classic, Ethereum, Ampleforth, Komodo, Solana, Sys Coin, DOT
Cluster 3: Celsius
Cluster 4: Doge Coin
Cluster 5: Cardano, ATOM, Avalance, Enjin, Internet Computer, Link, Loopring, Polygon, IOTA, NEO, Synthetix, Theta, Vechain
Cluster 6: Litecoin
Cluster 7: ACM Token, Atletico Madrid Token, Chilliz, Juventus Turin Token, PSG Token
Cluster 8: LRC
Cluster 9: Tether
Cluster 10: ARK, Battoken, BLZ, Digibyte, AS Rom Token, WAVES, Stellar Lumen, Monero, Ripple, Zilliqa, Zer0

We can see that the algorithm has identified 13 different clusters in the data and a couple of clusters with only a single member. You will most likely encounter different results depending on when you run it.

Step #4 Create a 2D Positioning Model based on the Graph Structure

In addition to clusters, we want to show the covariance between cryptocurrencies in our Crypto Market map. We need a graph-like structure that contains the covariance and position data of the cryptocurrencies for each crypto pair.

In addition, we use a node position model that calculates their relative position on a 2D plane from the covariance of the cryptocurrencies. However, the positions are only relative, so the absolute axes have no meaning.

# Create a node_position_model that find the best position of the cryptos on a 2D plane
# The number of components defines the dimensions in which the nodes will be positioned
node_position_model = manifold.LocallyLinearEmbedding(n_components=2, eigen_solver='dense', n_neighbors=20)
embedding = node_position_model.fit_transform(X.T).T
# The result are x and y coordindates for all cryptocurrencies
pd.DataFrame(embedding)
# Create an edge_model that represents the partial correlations between the nodes
partial_correlations = edge_model.precision_.copy()
d = 1 / np.sqrt(np.diag(partial_correlations))
partial_correlations *= d
partial_correlations *= d[:, np.newaxis]
# Only consider partial correlations above a specific threshold (0.02)
non_zero = (np.abs(np.triu(partial_correlations, k=1)) > 0.02)
# Convert the Positioning Model into a DataFrame
data = pd.DataFrame.from_dict({"embedding_x":embedding[0],"embedding_y":embedding[1]})
# Add the labels to the 2D positioning model
data["labels"] = labels
print(data.shape)
data.head()

(48, 3)
	embedding_x	embedding_y	labels
0	0.400590	-0.136473	6
1	-0.081908	-0.086039	4
2	-0.033982	-0.038526	9
3	0.416745	0.076849	6
4	-0.041938	0.031966	4

The next step is to create a graph of the partial correlations.

Step #5 Visualize the Crypto Market Structure

Our goal is to visualize differences in the covariance between crypto pairs by varying the connection strengths. We calculate the line strength by normalizing the covariance of the crypto pairs. In addition, we visualize the distribution of the covariance.

# Create an array with the segments for connecting the data points
start_idx, end_idx = np.where(non_zero) 
segments = [[np.array([embedding[:, start], embedding[:, stop]]).T, start, stop] for start, stop in zip(start_idx, end_idx)]
# Create a normalized representation of partial correlation between crypto currencies
# We can later use covariance to vizualize the strength of the connections
pc = np.abs(partial_correlations[non_zero])
normalized = (pc-min(pc))/(max(pc)-min(pc))
# plot the distribution of covariance between the cryptocurrencies
sns.histplot(pc)

The hist plot shows that the covariance between the crypto pairs is mostly below 0.005.

Finally, it is time to map cryptocurrencies on a 2D plane. To do this, we first define the cryptocurrencies using their relative position data with a scatterplot. We set the color of the points based on their clusters so that points in the same cluster are colored the same. Subsequently, we connect the points to the data from the edge model. The covariance between the crypto pairs determines the strength of their connections.

We also define the color of the connections as follows.

The map only shows connections with a covariance greater than 0.002.
Connections with a covariance greater than 0.05 are colored red.
Otherwise, connections between points within a cluster are shown in the cluster’s color.
We color connections in grey that are between points of different clusters.

Last but not least, we add the labels of the cryptocurrencies.

# Visualization
plt.figure(1, facecolor='w', figsize=(20, 8))
plt.clf()
ax = plt.axes([0., 0., 1., 1.])

# Plot the nodes using the coordinates of our embedding
sc = sns.scatterplot(
    data=data,
    x="embedding_x",
    y="embedding_y",
    zorder=1,
    s=350 * d ** 2,
    c=labels,
    cmap=plt.cm.nipy_spectral,
    alpha=.9,
    #palette="muted",
)

# Plot the covariance edges between the nodes (scatter points)
line_strength = 3.2
    
for index, ((x, y), start, stop) in enumerate(segments):     
    norm_partial_correlation = normalized[index]
    if list(data.iloc[[start]]['labels'])[0] == list(data.iloc[[stop]]['labels'])[0]:
        if norm_partial_correlation > 0.5:
            color = 'red'; linestyle='solid'
        else:
            color = plt.cm.nipy_spectral(list(data.iloc[[start]]['labels'])[0] / float(n_labels)); linestyle='solid'
    else:
        if norm_partial_correlation > 0.5:
            color = 'red'; linestyle='solid'
        else:
            color = 'grey'; linestyle='dashed'
    # Plot the edges
    # if x and y larger than 0
    if x[0] > 0 and y[0] > 0:
        plt.plot(x, y, alpha=.4, zorder=0, linewidth=normalized[index]*line_strength, color=color, linestyle=linestyle)

    
# Labels the nodes and position the labels to avoid overlap with other labels
for name, label, (x, y) in zip(names, labels, embedding.T):
    color = plt.cm.nipy_spectral(label / float(n_labels))
    ax.annotate(
        name,
        xy=(x, y),
        xytext=(5, 2),
        textcoords='offset points',
        ha='right',
        va='bottom',
        fontsize=10,
        color='black',
        bbox=dict(facecolor='w', edgecolor="w", alpha=.0),
     )

Note that you will likely see a different map when you run the code on your machine. Differences result from changes in market prices and covariance that lead to other graph structures.

Let’s see what the crypto market map tells us.

Interpreting the Cryptomarket Map

The 2D crypto market map tells us several things:

Most cryptos fall into the light green and dark green clusters corresponding to different types of crypto (Decentralized Finance Coins, NFT/Metaverse Coins).
There is a significant covariance between large-cap players in the crypto space, such as Cardano and Loopring and Ethereum and Bitcoin, which is plausible considering recent price movements. Some results are surprising, for example, the partial correlation between NEO and Ethereum Classic.
Some clusters are isolated and contain only a single member, for example, Tether, Komodo, AC Milan token, Wave token, and Dogecoin). The reason is that the prices of these coins/tokens have developed independently of the market.
- Tether is a stablecoin that does not change in price. It, therefore, strongly differs from the other cryptocurrencies on our map.
- Komodo has been trading sideways without following the general market trend.
- And the MCM token is a soccer token that has recently outperformed the market.
Soccer tokens are colored in dark blue. These tokens’ prices correlate with how the soccer clubs performed during the current season. It, therefore, makes perfect sense that these tokens are grouped into a cluster. An exception is the AC Milan token, which recently performed better than the other soccer tokens.

Step #6 Creating a 3D Representation

Instead of a 2D representation of the data points, we can also use a 3D node positioning model. For this purpose, the node positioning model distributes the affinity values over three dimensions.

# Find the best position of the cryptos on a 3D plane
node_position_model = manifold.LocallyLinearEmbedding(n_components=3, eigen_solver='dense', n_neighbors=20)
embedding = node_position_model.fit_transform(X.T).T
# The result are x and y coordindates for all cryptocurrencies
pd.DataFrame(embedding)
# Display a graph of the partial correlations
partial_correlations = edge_model.precision_.copy()
d = 1 / np.sqrt(np.diag(partial_correlations))
partial_correlations *= d
partial_correlations *= d[:, np.newaxis]
non_zero = (np.abs(np.triu(partial_correlations, k=1)) > 0.02)
data = pd.DataFrame.from_dict({"embedding_x":embedding[0],"embedding_y":embedding[1],"embedding_z":embedding[1]})
data["labels"] = labels
data["names"] = names
import matplotlib.pyplot as plt
import numpy as np
fig = plt.figure(figsize=(20,20))
ax = fig.add_subplot(projection='3d')
xs = data["embedding_x"]
ys = data["embedding_y"]
zs = data["embedding_z"]
sc = ax.scatter(xs, ys, zs, c=labels, s=100)
    
for i in range(len(data)):
    x = xs[i]
    y = ys[i]
    z = zs[i]
    label = data["names"][i]
    ax.text(x, y, z, label)
    
plt.legend(*sc.legend_elements(), bbox_to_anchor=(1.05, 1), loc=2)
plt.show()

Summary

Affinity propagation is a powerful technique for clustering items when the optimal number of clusters is unknown. In this article, we’ve demonstrated how to apply affinity propagation to analyze the cryptocurrency market and identify groups of assets based on similar price fluctuations.

In our example, we identified 13 groups of cryptocurrencies without specifying the number of clusters in advance. We also visualized the market structure on a 2D and 3D map using a node distribution technique. This approach can be extended to analyze and cluster stock markets, highlighting complex price patterns among multiple financial assets.

Once you’ve identified clusters, you can dive deeper into individual groups. Sometimes, outliers that temporarily break out of their usual pattern indicate interesting investment opportunities. These outliers can eventually return to the price pattern of their group, or they may represent forerunners of their group, indicating broader market movements.

By using affinity propagation, we can visualize financial assets in a new and exciting way. If you have any questions or comments about this approach, please let me know.

Sources and Further Reading

This article modifies some of the code from Scikit-learn and adapts it from the stock market to cryptocurrencies.

Jansen (2020) Machine Learning for Algorithmic Trading: Predictive models to extract signals from market and alternative data for systematic trading strategies with Python
Aurélien Géron (2019) Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems
David Forsyth (2019) Applied Machine Learning Springer
Andriy Burkov (2020) Machine Learning Engineering
Images are created using Midjourney, an AI that creates images from text.

The links above to Amazon are affiliate links. By buying through these links, you support the Relataly.com blog and help to cover the hosting costs. Using the links does not affect the price.

The post Unveiling Hidden Patterns in the Cryptocurrency Market with Affinity Propagation and Python appeared first on relataly.com.

Cluster Analysis with k-Means in Python

Florian Follonier — Sun, 27 Jun 2021 18:21:01 +0000

Embark on a journey into the world of unsupervised machine learning with this beginner-friendly Python tutorial focusing on K-Means clustering, a powerful technique used to group similar data points into distinct clusters. This invaluable tool helps us make sense of complex datasets, finding hidden patterns and associations without the need for a predetermined target variable. This comes in handy, especially when dealing with data whose similarities and differences aren’t immediately apparent.

Divided into two insightful sections, this blog post first delves into the theoretical foundation of the K-Means clustering algorithm. We’ll explore its real-world applications, its strengths, and its weaknesses, providing a comprehensive overview of what makes K-Means an essential tool in any data scientist’s toolkit.

In the second part of the blog, we switch gears and dive into a hands-on Python tutorial. Here, we’ll walk you through a practical example of K-Means clustering, using it to uncover three distinct spherical clusters within a synthetic dataset. To round up, we’ll put our model to the test, using it to predict clusters within a test dataset, and then we’ll visualize the results for easy understanding.

Whether you’re a seasoned data scientist or a beginner looking to dip your toes into the field, this blog post offers a simple and accessible introduction to cluster engineering with K-Means in Python. Follow along, and uncover the hidden potential of your data today!

Patterns are everywhere, and machine learning can help to expose and understand them better. Image created with Midjourney.

Findings Clusters in Multivariate Data with k-Means

K-Means clustering is a prominent unsupervised machine learning algorithm utilized to partition datasets into a predefined number (‘k’) of distinct clusters, each brimming with similar data points. To begin, the algorithm identifies ‘k’ random data points which serve as the initial centroids – the central points representing each cluster. It then proceeds in an iterative fashion, continuously reassigning each data point to the centroid nearest to it, followed by updating each centroid to the average of the data points assigned to its cluster. This process repeats until the centroids no longer change positions or until a set maximum number of iterations is reached.

The underlying objective of the K-Means algorithm is to minimize the within-cluster sum of squares (WCSS), a metric indicating the cumulative distance between each data point within a cluster and its corresponding centroid. By reducing the WCSS, K-Means aims to form the most compact and clearly separable clusters.

The K-Means algorithm has garnered substantial popularity in the field of data science for its speed and simplicity in implementation. However, it isn’t without its limitations. It requires the number of clusters to be specified beforehand and operates under the assumption that all clusters are spherical and uniformly sized. In the following sections, we’ll delve into the intricacies of this powerful yet straightforward algorithm and discuss its use in real-world applications.

Three clusters in a dataset that were separated by the k-Means algorithms

Understanding the Mechanics of K-Means Clustering Algorithm

At its core, K-Means clustering is a dynamic algorithm designed to simplify complex data patterns. This machine learning model is a champion of unsupervised learning, adept at grouping similar data points into a predetermined number (‘k’) of unique clusters.

Starting the process, the K-Means algorithm selects ‘k’ random data points, which act as the initial centroids or the central figures of the clusters. The algorithm then goes through a series of iterative steps, continually allocating each data point to its closest centroid, and recalibrating the centroid to be the mean of all data points that are assigned to its cluster. This cycle repeats until we achieve stable centroids, i.e., they stop moving, or until a pre-set maximum number of iterations is completed.

The fundamental goal of K-Means lies in minimizing the within-cluster sum of squares (WCSS). This term refers to the sum of the distances of each data point in a cluster to its centroid. By lowering WCSS, K-Means strives to construct compact clusters that are distinctly separate from each other.

In the illustration below, we assume a cluster with three cluster centers. K-Means carries out several steps to partition the data.

Initially, the algorithm k chooses random starting positions for the centroids. Alternatively, we can also position the centroids manually.
Then the algorithm calculates the distance between the data points and the three centroids. The data points are assigned to the closest centroid/cluster where the cluster variance increases the least.
Next, the algorithm calculates the Euclidean distance between the centroids and their assigned data points. The result is linear decision boundaries that separate the clusters but are not yet optimal.
From then on, the algorithm optimizes the centroids’ positions to lower the resulting clusters’ variance. Then, the previous steps are repeated: averaging, assigning the data points to clusters, and shifting the centroids.

The process ends when the positions of the centroids do not change anymore.

k-Means iteratively optimizes the decision boundaries between clusters

How many Clusters?

A particular challenge is that k-means requires estimating the number of clusters k. When tackling new clustering problems, we usually don’t know the optimal number of clusters. Unless the data is not too complex, we can often estimate the number of centers by looking at one or more scatter plots. However, this approach only works when the data has a few dimensions. For complex data with many dimensions, it is common to experiment with varying numbers of k to find an appropriate size for the problem. We can automate this process using hyperparameter tuning techniques such as grid search or random search. The idea is to try out different cluster sizes and identify the size that best differentiates between clusters.

Pros and Cons of k-Means Clustering

Although K-Means is esteemed for its swift operation and uncomplicated implementation, it comes with a set of constraints. We need to know the strengths and weaknesses of clustering techniques such as k-Means. In general, clustering can reveal structures and relationships in data supervised machine learning methods like classification likely would not uncover. In particular, when we suspect different subgroups in the data that differ in their behavior, clustering can help discover what makes these groups unique.

K-Means, in particular, possesses a unique ability to detect and segregate spherical-shaped clusters efficiently. However, its performance may falter when faced with clusters embodying more intricate structures like half-moons or circles, often struggling to differentiate them accurately.

Another potential limitation of K-Means lies in its prerequisite of specifying the number of clusters in advance. This requirement can prove challenging when the true number of clusters within a dataset is unknown or ambiguous. In such scenarios, alternative clustering techniques, such as affinity propagation or hi erarchical clustering, might be more suitable, as they possess the ability to automatically determine the optimal number of clusters.

Applications of Clustering

The k-means algorithm is used in a variety of applications, including the following:

A typical use case for clustering is in marketing and market segmentation. Here clustering is used to identify meaningful segments of similar customers. The similarity can be based on demographic data (age, gender, etc.) or customer behavior (for example, the time and amount of a purchase).
Medical research uses clustering to divide patient groups into different subgroups, for example, to assess stroke risk. After clustering, the next step is to develop separate prediction models for the subgroups to estimate the risk more accurately.
An application in the financial sector is outlier detection in fraud detection. Banks and credit card companies use clustering to detect unusual transactions and flag them for verification.
Spam filtering: The input data are attributes of emails (text length, words contained, etc.) and help separate spam from non-spam emails.

Implementing a K-Means Clustering Model in Python

In the following, we run a cluster analysis on synthetic data using Python and scikit-learn. We aim to train a K-Means cluster model in Python that distinguishes three clusters in the data. Given that the data is synthetic, we’re already privy to which cluster each data point pertains to. This foreknowledge enables us to evaluate the performance of our model post-training, gauging how effectively it can differentiate between the three predefined clusters.

The code is available on the GitHub repository.

View on GitHub Relataly GitHub Repo

Prerequisites

Before beginning the coding part, make sure that you have set up your Python 3 environment and required packages. If you don’t have an environment, consider the Anaconda Python environment. To set it up, you can follow the steps in this tutorial.

Also, make sure you install all required packages. In this tutorial, we will be working with the following standard packages:

This article uses the k-means clustering algorithm from the Python library Scikit-learn. We also use Seaborn for visualization.

You can install these libraries using console commands:

pip install
conda install (if you are using the anaconda packet manager)

Step #1: Generate Synthetic Data

We start with the generation of synthetic data. For this purpose, we use the make_blobs function from the scikit-learn library. The function generates random clusters in two dimensions spherically arranged around a center. In addition, the data contains the respective cluster to which the data points belong. We use a scatterplot to visualize the data.

import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt
from sklearn.model_selection import train_test_split as train_test_split
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, confusion_matrix

# Generate synthetic data
features, labels = make_blobs(
    n_samples=400,
    centers=3,
    cluster_std=2.75,
    random_state=42
)

# Visualize the data in scatterplots
def scatter_plots(df, palette):
    fig, ax = plt.subplots(nrows=1, ncols=2, sharex=True, figsize=(20, 8))
    fig.subplots_adjust(hspace=0.5, wspace=0.2)
    

    ax1 = plt.subplot(1,2,1)
    sns.scatterplot(ax = ax1, data=df, x='x', y='y', hue= 'labels', palette=palette)
    plt.title('Clusters')
    
    ax2 = plt.subplot(1,2,2)
    sns.scatterplot(ax = ax2, data=df, x='x', y='y')
    plt.title('How the model sees the data during training')

palette = {1:"tab:cyan",0:"tab:orange", 2:"tab:purple"}
df = pd.DataFrame(features, columns=['x', 'y'])
df['labels'] = labels
scatter_plots(df, palette)

Step #2: Preprocessing

There are some general things to keep in mind when preparing data for use with K-Means clustering:

Missing data and outliers: if we have missing entries in our data, we need to handle these, for example, by removing the records or filling the missing values with a mean or median. In addition, K-means is sensitive to outliers. Therefore, make sure that you eliminate outliers from the training data.
Normalization: K-Means can only deal with integer values. So, either we map the categorical variables to integer values or use one-hot-encoding to create separate binary variables.
Dimensionality reduction: In general, having too many variables in a dataset can negatively affect the performance of clustering algorithms. A good practice is to keep the number of variables below 30, for example, by using techniques for dimensionality reduction such as Principal-Component-Analysis.
Scaling: Important to note is also that K-means require scaling the data.

Our synthetic dataset is free of outliers or missing values. Therefore, we only need to scale the data. In addition, we separate the class labels of the clusters from the training set and split the data into a train- and a test dataset.

# Scale the data
scaler = StandardScaler()
scaled_features = scaler.fit_transform(features)

X = scaled_features #Training data
y = labels #Prediction label

# Split the data into x_train and y_train data sets
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, random_state=0)

Step #3: Train a k-Means Clustering Model

Once we have prepared the data, we can begin the cluster analysis by training a K-means model. Our model uses the k-means algorithm from Python scikit-learn library. We have various options to configure the clustering process:

n_clusters: The number of clusters we expect in the data.
n_init: The number of iterations k-means will run with different initial centroids. The algorithm returns the best model.
max_iter: The max number of iterations for a single run

We expect three clusters and configure the algorithm to run ten iterations.

# Create a k-means model with n clusters
kmeans = KMeans(
    init="random",
    n_clusters=3,
    n_init=10,
    max_iter=300,
    random_state=42
)

# fit the model
kmeans.fit(X_train)
print(f'Converged after {kmeans.n_iter_} iterations')

Converged after 4 iterations

Our model has already converged after four iterations. Next, we will look at the results.

Step #4: Make and Visualize Predictions

Next, we’ll be delving into the practical aspect of our Python tutorial by analyzing the performance of our trained K-Means model using synthetic data.

In the following Python code, we:

Extract the cluster centers from the trained K-Means model and unscale them.
Add the predicted and actual labels to our unscaled training data.
Define a function, scatter_plots(), that creates two scatter plots: one for the predicted labels and another for the actual labels.
Call this function, passing the training data, cluster centers, and a color palette as arguments.

Please note:

We use a dictionary as a color palette to differentiate between clusters.
The colors between the two plots may not match due to K-Means assigning numbers to clusters without prior knowledge of initial labels.

# Get the cluster centers from the trained K-means model
cluster_center = scaler.inverse_transform(kmeans.cluster_centers_)
df_cluster_centers = pd.DataFrame(cluster_center, columns=['x', 'y'])

# Unscale the predictions
X_train_unscaled = scaler.inverse_transform(X_train)
df_train = pd.DataFrame(X_train_unscaled, columns=['x', 'y'])
df_train['pred_label'] = kmeans.labels_
df_train['true_label'] = y_train

def scatter_plots(df, cc, palette):
    fig, ax = plt.subplots(nrows=1, ncols=2, sharex=True, figsize=(20, 8))
    fig.subplots_adjust(hspace=0.5, wspace=0.2)
    
    # Print the predictions    
    ax2 = plt.subplot(1,2,1)
    sns.scatterplot(ax = ax2, data=df, x='x', y='y', hue='pred_label', palette=palette)
    sns.scatterplot(ax = ax2, data=cc, x='x', y='y', color='r', marker="X")
    plt.title('Predicted Labels')

    # Print the actual values
    ax1 = plt.subplot(1,2,2)
    sns.scatterplot(ax = ax1, data=df, x='x', y='y', hue= 'true_label', palette=palette)
    sns.scatterplot(ax = ax1, data=cc, x='x', y='y', color='r', marker="X")
    plt.title('Actual Labels')


# The colors between the two plots may not match.
# This is because K-means does not know the initial labels and assigns numbers to clusters 
palette = {1:"tab:cyan",0:"tab:orange", 2:"tab:purple"}
scatter_plots(df_train, df_cluster_centers, palette)

The scatterplot above shows that K-Means found the three clusters. As a side note, the colors between the two plots do not match because K-means does not know the initial labels and assigns numbers to clusters.

Step #5: Measuring Model Performance

Next, we measure the performance of our clustering model. K-means is an unsupervised machine learning algorithm, which means that it is used to cluster data points into groups based on their similarity, without the use of labeled training data. However, we can compare the cluster assignments to the labels attached to the data and see if the model can predict them correctly. In this way, we can use traditional classification metrics such as accuracy and f1_score to measure the performance of our clustering model.

First, we unify the cluster labels as A, B, and C.

# The predictive model has the labels 0 and 1 reversed. We will correct that first. 
#df_train['pred_test'] = df_train['pred_labels'].map({2:2, 1:3, 0:1})
df_eval = df_train.copy()
df_eval['true_label'] = df_eval['true_label'].map({0:'A', 1:'B', 2:'C'})
df_eval['pred_label'] = df_eval['pred_label'].map({0:'B', 1:'A', 2:'C'})
df_eval.head(10)

	x			y			pred_label	true_label
0	-9.007547	-10.302910	C			C
1	1.009238	7.009681	A			B
2	-6.565501	-6.466780	C			C
3	2.389772	7.727235	B			B
4	-5.422666	-2.915796	C			C
5	-12.024305	-7.846772	C			C
6	-4.006250	9.319323	A			A
7	-6.297788	6.435267	A			A
8	2.169238	3.325947	B			B
9	-5.140506	-4.205585	C			C

It is a common practice to create scatterplots on the predictions to visually verify the quality of the clusters and their decision boundaries. The following scatter plot shows the correctly assigned values and where our model was wrong.

# Scatterplot on correctly and falsely classified values
df_eval.loc[df_eval['pred_label'] == df_eval['true_label'], 'Correctly classified?'] = 'True' 
df_eval.loc[df_eval['pred_label'] != df_eval['true_label'], 'Correctly classified?'] = 'False' 

plt.rcParams["figure.figsize"] = (10,8)
sns.scatterplot(data=df_eval, x='x', y='y', color='r', hue='Correctly classified?')

The K-Means model correctly assigned most data points (blue) to their actual cluster. The few misclassified points are located at a decision boundary between two clusters (marked in orange).

# Create a confusion matrix
def evaluate_results(model, y_true, y_pred, class_names):
    tick_marks = [0.5, 1.5, 2.5]
    
    # Print the Confusion Matrix
    fig, ax = plt.subplots(figsize=(10, 6))
    results_log = classification_report(y_true, y_pred, target_names=class_names, output_dict=True)
    results_df_log = pd.DataFrame(results_log).transpose()
    matrix = confusion_matrix(y_true,  y_pred)
    model_score = score(y_pred, y_true, average='macro')
    
    sns.heatmap(pd.DataFrame(matrix), annot=True, fmt="d", linewidths=.5, cmap="YlGnBu")
    plt.xlabel('Predicted label'); plt.ylabel('Actual label')
    plt.title('Confusion Matrix on the Test Data')
    plt.yticks(tick_marks, class_names); plt.xticks(tick_marks, class_names)
    
    print(results_df_log)


y_true = df_eval['true_label']
y_pred = df_eval['pred_label']
class_names = ['A', 'B', 'C']
evaluate_results(kmeans, y_true, y_pred, class_names)

As we can see, the model has done a good job of grouping the labels into the attached classes.

Summary

algorithm, capable of parsing intricate datasets into discrete, non-overlapping clusters. With this guide, you should have a clear understanding of how this algorithm operates, as well as its unique strengths and potential limitations. Remember, k-Means excels when dealing with spherical clusters but may struggle with clusters of more complex shapes.

We walked through how to implement the k-Means algorithm in Python, using it to identify spherical clusters within synthetic data. Moreover, we explored different methods to evaluate and visualize a clustering model’s performance, providing you with valuable tools to effectively analyze and group your data.

Whether it’s anomaly detection in time-series data or customer segmentation, k-Means can be a powerful tool in your data science arsenal. If you’re interested in anomaly detection, feel free to explore my recent article on the subject, written with Python enthusiasts in mind. Armed with this knowledge, you’re ready to embark on your own data exploration journey using k-Means clustering.

If you are interested in this topic, check out my recent article on anomaly detection with Python. And if you have any questions, please ask them in the comments.

Did you know that customer segmentation is an area where real-world data can be prone to bias and unfairness? If you’re concerned that your models may reflect the same bias, check out our latest article on addressing fairness in machine learning with fairlearn.

Sources and Further Reading

Books on Clustering

“Data Clustering: Algorithms and Applications” by Charu C. Aggarwal: This book covers a wide range of clustering algorithms, including hierarchical clustering, and discusses their applications in various fields.
“Data Mining: Practical Machine Learning Tools and Techniques” by Ian H. Witten and Eibe Frank: This book is a comprehensive introduction to data mining and machine learning, including a chapter on hierarchical clustering.

Books on Machine Learning

The links above to Amazon are affiliate links. By buying through these links, you support the Relataly.com blog and help to cover the hosting costs. Using the links does not affect the price.

Color-Coded Cryptocurrency Price Charts in Python

Florian Follonier — Tue, 19 Jan 2021 21:03:16 +0000

Are you intrigued by the fascinating world of cryptocurrency and looking to visually decipher its price trends? Welcome aboard! In this comprehensive tutorial, we will explore creating color-coded line charts using Python and Matplotlib, a powerful tool for effective analysis of changes along a third dimension.

The past few years have witnessed a meteoric rise in the prices of cryptocurrencies, underscoring the need for accurate analysis and visualization of their price trends. An outstanding illustration of this is the color-coded Bitcoin stock-to-flow chart, a popular choice in the crypto space that uses color differentiation to denote time until the next Bitcoin halving event.

Drawing inspiration from this, our tutorial will guide you to create a similar dynamic color-coded line chart, tracing the price trends of two leading cryptocurrencies – Bitcoin and Ethereum. This visual aid will provide a deeper insight into their price trajectories over time, enabling you to make informed investment decisions.

As we dive in, we’ll break down the process into digestible chunks, making it easier for beginners to follow along. By the end of this tutorial, you’ll not only have a profound understanding of how to create and interpret such color-coded charts but also gain valuable insights into the world of cryptocurrency price trends.

What are Color-coded Price Charts?

Color coding is beneficial for visualizing trading signals and statistical indicators in technical chart analysis. The idea of color-coding in chart analysis is to create visually comprehensible charts that let the user quickly interpret how price develops under certain conditions. A simple example is a candlestick chart, which uses color to signal whether the price moves up (green) or down (red). Candlestick charts visualize more as regular line charts, providing additional information on the opening and closing prices.

We can use color codings in line plots to visualize conditions of various types. We can derive them from the price itself and, for example, illustrate the price development independence of oscillation indicators or moving averages. Or they can be independent of the price and represent some other conditions, such as, for example, the spread of COVID-19 cases worldwide. These are just a few examples, and there are no limits to your creativity in choosing the conditions.

Also: Requesting Crypto Price Data from the Gate.io REST API in Python

Use Cases for Color-coded Price Charts

There are various use cases for color-coded line plots in the crypto space. For example, crypto enthusiasts employ them to visualize relationships between the price of bitcoin and statistical indicators, including momentum indicators such as the RSI. Color-coded line plots have also been used to show dependencies between price and specific events that develop parallel to the bitcoin price. For example, we can use color-coding to highlight the lag between the price and the bitcoin halving every four years.

The Stock to Flow Model is an example of a Color-coded price chart (Source: lookintobitcoin.com)

Implementing Color-coded Price Charts in Python

Are you ready to elevate your data visualization skills and create visually striking price charts with Python? In this tutorial, we’ll be walking you through the creation of two dynamic line charts that use color to reveal intriguing trends and patterns. The first chart will feature a color overlay on the price line to showcase how Bitcoin prices fluctuate based on RSI. The second chart will unveil the shifting correlation between Bitcoin and Ethereum over time. Buckle up, and let’s dive in!

We’ll start by using the Coinbase Pro API to download historical price data on BTC and ETH. We’ll then calculate two well-established indicators in financial analysis: the Relative Strength Index (RSI) and the Pearson Correlation between Bitcoin and Ethereum. Finally, we’ll use Matplotlib to create stunning color-coded line charts that highlight the changes in the indicators over extended periods.

Also: Geographic Heat Maps with GeoPandas: Visualizing COVID-19

The code is available on the GitHub repository.

View on GitHub Relataly GitHub Repo

Prerequisites

Before starting the coding part, make sure that you have set up your Python 3 environment and required packages. If you don’t have an environment, you can follow this tutorial to set up the Anaconda environment. Also, make sure you install all required packages. In this tutorial, we will be working with the following standard packages:

In addition, we will be using the Historic-Crypto Python Package, which lets us easily interact with the Coinbase Pro API.

You can install packages using console commands:

pip install
conda install (if you are using the anaconda packet manager)

Step #1 Load the Price Data via the Coinbase API

We begin by downloading the historical price data on Bitcoin (BTC-USD) and Ethereum (BTC-USD) from Coinbase Pro. Don’t worry; you don’t need to download the data manually. Instead, we will use the Historic_Crypto Python package to access the data via an API.

Accessing the data via the Coinbase Pro API requires us to specify several API parameters. We define a frequency of 21600 seconds so that we will obtain price points on a 6-hour basis. In addition, we define a from_date of “2017-01-01” and add “ETH-USD” and “BTC-USD” to a list of coins for which we want to obtain the historical price data.

We query the API separately for each of the two coins in our coin list. Depending on your internet connection, this can take several minutes. The response contains three different price values:

high: the daily price high
low: the daily price low
close: the daily closing price

Later in this article, we will require all three variables to calculate the indicator values. We will therefore add the variables as columns to a new dataframe.

# Tested with Python 3.8.8, Matplotlib 3.5, Seaborn 0.11.1, numpy 1.19.5

from Historic_Crypto import HistoricalData
import pandas as pd 
from scipy.stats import pearsonr
import matplotlib.pyplot as plt 
import matplotlib.colors as col 
import numpy as np 
import datetime

# the price frequency in seconds: 21600 = 6 hour price data, 86400 = daily price data
frequency = 21600

# The beginning of the period for which prices will be retrieved
from_date = '2017-01-01-00-00'
# The currency price pairs for which the data will be retrieved
coinlist = ['ETH-USD', 'BTC-USD']

# Query the data
for i in range(len(coinlist)):
    coinname = coinlist[i]
    pricedata = HistoricalData(coinname, frequency, from_date).retrieve_data()
    pricedf = pricedata[['close', 'low', 'high']]
    if i == 0:
        df = pd.DataFrame(pricedf.copy())
    else:
        df = pd.merge(left=df, right=pricedf, how='left', left_index=True, right_index=True)   
    df.rename(columns={"close": "close-" + coinname}, inplace=True)
    df.rename(columns={"low": "low-" + coinname}, inplace=True)
    df.rename(columns={"high": "high-" + coinname}, inplace=True)
df.head()

			time	close-ETH-USD	low-ETH-USD	high-ETH-USD	close-BTC-USD	low-BTC-USD	high-BTC-USD
2017-01-01 06:00:00	8.23			8.16		8.49			975.00			964.54		975.00
2017-01-01 12:00:00	8.33			8.20		8.44			994.42			974.01		994.97
2017-01-01 18:00:00	8.18			8.08		8.37			992.95			986.86		1000.00
2017-01-02 00:00:00	8.13			8.05		8.22			1003.64			990.52		1012.00
2017-01-02 06:00:00	8.10			8.09		8.20			1024.84			1002.92		102

Step #2 Visualizing the Time Series

At this point, we have created a dataframe that contains the price “close,” “low,” and “high” for BTC-USD and ETH-USD. Next, let’s take a quick look at what the data looks like:

# Create a Price Chart on BTC and ETH
x = df.index
fig, ax1 = plt.subplots(figsize=(16, 8), sharex=False)

# Price Chart for BTC-USD Close
color = 'tab:blue'
y = df['close-BTC-USD']
ax1.set_xlabel('time (s)')
ax1.set_ylabel('BTC-Close in $', color=color, fontsize=18)
ax1.plot(x, y, color=color)
ax1.tick_params(axis='y', labelcolor=color)
ax1.text(0.02, 0.95, 'BTC-USD',  transform=ax1.transAxes, color=color, fontsize=16)

# Price Chart for ETH-USD Close
color = 'tab:red'
y = df['close-ETH-USD']
ax2 = ax1.twinx()  # instantiate a second axes that shares the same x-axis
ax2.set_ylabel('ETH-Close in $', color=color, fontsize=18)  # we already handled the x-label with ax1
ax2.plot(x, y, color=color)
ax2.tick_params(axis='y', labelcolor=color)
ax2.text(0.02, 0.9, 'ETH-USD',  transform=ax2.transAxes, color=color, fontsize=16)

Next, we add two indicator values to our dataframe that we can later use to color the price chart.

Step #3 Calculate Indicator Values

The color overlay of the price chart is typically used to illustrate the relation between price and another variable, such as a statistical indicator. To demonstrate how this works, we will calculate two indicators and add them to our dataframe:

3.1 The Relative Strength Index

The Relative Strength Index (RSI) is a momentum indicator that signals the strength of a price trend. Its value range from 0 to 100%. A value above 70% signals that an asset is likely overbought. An overbought level is an area where the market is highly bullish and might decline. A value below 30% is typically a sign of an oversold condition. An oversold level is where the market is extremely bearish, and the price tends to reverse to the upper side.

3.2 The Pearson Correlation Coefficient

Pearson Correlation Coefficient: This indicator measures the correlation between two sets of stochastic variables. Its values range from -1 to 1. A value of 1 would imply a perfect stochastic correlation. For example, if the price of BTC changes by X percentage in a given period, we can expect ETH to experience the exact price change. A value of -1 would imply a perfect inverse correlation. For example, if the price of BTC were to increase by Y percent, we would also expect the ETH price to decrease by Y percent. A value of 0 implies no correlation. To learn more about correlation, check out my article about correlation in Python.

We embed the logic for calculating the two indicators in a different method called “add_indicators.”

def add_indicators(df):
    # Calculate the 30 day Pearson Correlation 
    cor_period = 30 #this corresponds to a monthly correlation period
    columntobeadded = [0] * cor_period
    df = df.fillna(0) 
    for i in range(len(df)-cor_period):
        btc = df['close-BTC-USD'][i:i+cor_period]
        eth = df['close-ETH-USD'][i:i+cor_period]
        corr, _ = pearsonr(btc, eth)
        columntobeadded.append(corr)    
    # insert the colours into our original dataframe    
    df.insert(2, "P_Correlation", columntobeadded, True)

    # Calculate the RSI
    # Moving Averages on high, lows, and std - different periods
    df['MA200_low'] = df['low-BTC-USD'].rolling(window=200).min()
    df['MA14_low'] = df['low-BTC-USD'].rolling(window=14).min()
    df['MA200_high'] = df['high-BTC-USD'].rolling(window=200).max()
    df['MA14_high'] = df['high-BTC-USD'].rolling(window=14).max()

    # Relative Strength Index (RSI)
    df['K-ratio'] = 100*((df['close-BTC-USD'] - df['MA14_low']) / (df['MA14_high'] - df['MA14_low']) )
    df['RSI'] = df['K-ratio'].rolling(window=3).mean() 

    # Replace nas 
    #nareplace = df.at[df.index.max(), 'close-BTC-USD']    
    df.fillna(0, inplace=True)    
    return df
    
dfcr = add_indicators(df)

At this point, we have added the RSI and the Correlation Coefficient to our dataframe. Let’s quickly visualize the two indicators in a line chart.

# Visualize measures
fig, ax1 = plt.subplots(figsize=(22, 4), sharex=False)
plt.ylabel('ETH-BTC Price Correlation', color=color)  # we already handled the x-label with ax1
x = y = dfcr.index
ax1.plot(x, dfcr['P_Correlation'], color='black')
ax2 = ax1.twinx()
ax2.plot(x, dfcr['RSI'], color='blue')
plt.tick_params(axis='y', labelcolor=color)

plt.show()

You may have noticed that the indicators remain at 0 at the time series beginning. However, this is perfectly fine. Since both indicators are calculated retrospectively, no values are available initially.

Step #4 Converting Indicator Values to Color Codes

Before creating the price charts, we have to color code the indicator values. We normalize the values and then assign a color to each indicator value using a color scale. We attach the colors to our existing dataframe to quickly access them when creating the plots.

# function that converts a given set of indicator values to colors
def get_colors(ind, colormap):
    colorlist = []
    norm = col.Normalize(vmin=ind.min(), vmax=ind.max())
    for i in ind:
        colorlist.append(list(colormap(norm(i))))
    return colorlist

# convert the RSI                         
y = np.array(dfcr['RSI'])
colormap = plt.get_cmap('plasma')
dfcr['rsi_colors'] = get_colors(y, colormap)

# convert the Pearson Correlation
y = np.array(dfcr['P_Correlation'])
colormap = plt.get_cmap('plasma')
dfcr['cor_colors'] = get_colors(y, colormap)

In our dataframe, two additional columns contain the color values for the two indicators. Now that we have all the data in our dataframe, the next step is creating the price charts.

Step #5 Creating Color-Coded Price Charts

Next, we use the color values to create two different color-coded price charts.

5.1 Bitcoin Price Chart Colored by RSI

We color the chart with the strength of the correlation between Bitcoin and Ethereum. Light-colored fields signal phases of a strong correlation. Price points colored in dark blue indicate phases where the correlation between the price movements of the two cryptocurrencies was negative.

# Create a Price Chart
pd.plotting.register_matplotlib_converters()
fig, ax1 = plt.subplots(figsize=(18, 10), sharex=False)
x = dfcr.index
y = dfcr['close-BTC-USD']
z = dfcr['rsi_colors']

# draw points
for i in range(len(dfcr)):
    ax1.plot(x[i], np.array(y[i]), 'o',  color=z[i], alpha = 0.5, markersize=5)
ax1.set_ylabel('BTC-Close in $')
ax1.tick_params(axis='y', labelcolor='black')
ax1.set_xlabel('Date')
ax1.text(0.02, 0.95, 'BTC-USD - Colored by RSI',  transform=ax1.transAxes, fontsize=16)

# plot the color bar
pos_neg_clipped = ax2.imshow(list(z), cmap='plasma', vmin=0, vmax=100, interpolation='none')
cb = plt.colorbar(pos_neg_clipped)

From the color overlay in the chart, we can tell that the RSI is low mainly (dark blue) when the Bitcoin price has seen a substantial decline and high (yellow) when the price has risen.

5.2 Bitcoin Price Chart colored by BTC-ETH Correlation

In this section, we will create another price chart for Bitcoin. This time we color code the price trend with the RSI. High RSI values are yellow, and low values are dark blue. Running the code below will create the color-coded bitcoin chart.

# create a price chart
pd.plotting.register_matplotlib_converters()
fig, ax1 = plt.subplots(figsize=(18, 10), sharex=False)
x = dfcr.index # datetime index
y = dfcr['close-BTC-USD'] # the price variable
z = dfcr['cor_colors'] # the color coded indicator values

# draw points
for i in range(len(dfcr)):
    ax1.plot(x[i], np.array(y[i]), 'o',  color=z[i], alpha = 0.5, markersize=5)
ax1.set_ylabel('BTC-Close in $')
ax1.tick_params(axis='y', labelcolor='black')
ax1.set_xlabel('Date')
ax1.text(0.02, 0.95, 'BTC-USD - Colored by 50-day ETH-BTC Correlation',  transform=ax1.transAxes, fontsize=16)

# plot the color bar
pos_neg_clipped = ax2.imshow(list(z), cmap='Spectral', vmin=-1, vmax=1, interpolation='none')
cb = plt.colorbar(pos_neg_clipped)

The chart shows that the correlation between Bitcoin and Ethereum (yellow color) was strong when the price of bitcoin rose. So when Bitcoin is in a bull market, Ethereum tends to follow a similar price logic. In contrast, the correlation was weak when the Bitcoin price declined (dark blue).

Summary

In this article, we demonstrated how to use Python and Seaborn to create a price chart that incorporates color as a third dimension. We used the Bitcoin price as an example and created two color-coded charts: one that highlights the RSI, and another that highlights the Pearson Correlation between Bitcoin and Ethereum.

By using color as an overlay, it is possible to highlight many interesting relationships in time-series data. A well-known example from the cryptocurrency world is the Bitcoin Rainbow Chart. This technique can be used to bring attention to various trends and patterns in the data.

I hope this article has helped to bring you closer to charts in Python. I am always interested to receive feedback from my audience. So, let me know if you liked this content, and if you have any questions, please post them in the comments.

Sources and Further Reading

The links above to Amazon are affiliate links. By buying through these links, you support the Relataly.com blog and help to cover the hosting costs. Using the links does not affect the price.

And if you are interested in stock-market prediction, check out the following articles:

The post Color-Coded Cryptocurrency Price Charts in Python appeared first on relataly.com.

Correlation Matrix in Python: How Correlated are COVID-19 Cases and Different Financial Assets?

Florian Follonier — Sun, 05 Apr 2020 16:08:00 +0000

Correlation analysis is a powerful tool in financial market analysis, helping investors to better understand the interdependence of different assets. But what happens when an unprecedented global pandemic like COVID-19 shakes up the market? In this tutorial, we will show you how to create a correlation matrix in Python that will help you visualize the relationship between COVID-19 and various financial assets.

First, we will delve into the nitty-gritty of correlation coefficients and how to interpret them. We’ll focus specifically on the Pearson Correlation Coefficient, a popular measure used to quantify the strength of the relationship between two variables.

Next, we’ll dive right into the practical part of this tutorial and create a stock market correlation matrix in Python. Our matrix will measure the correlation between COVID-19 cases and various financial assets such as gold, Bitcoin, and other popular investments. With this matrix, investors can identify the extent to which COVID-19 has impacted different asset classes and make more informed investment decisions.

Whether you’re a seasoned investor or just starting out, this tutorial will equip you with the knowledge and tools you need to analyze the correlation between COVID-19 and financial markets. So, let’s dive in and start exploring the fascinating world of correlation analysis!

Also: Color-coded Cryptocurrency Price Charts with Python

A correlation matrix, as we will create it in this article.

Different Types of Correlation Coefficients

There are various types of correlation coefficients used to measure the strength and direction of the relationship between two variables. The most common is the Pearson correlation coefficient, which measures the linear relationship between two variables. This is the correlation coefficient on which we will focus in this article. However, if the relationship between two variables is more complex, other coefficients are a better choice for the analysis.

For example, in situations where the data is not normally distributed or when there are outliers, the Spearman correlation coefficient is used. This coefficient measures the relationship between two variables using ranks instead of the actual data. It is also known as the rank correlation coefficient. For ordinal data, the Kendall correlation coefficient is used. This coefficient measures the strength and direction of the relationship between two variables, taking into account the order of the data points. Finally, the Point-Biserial and Biserial correlation coefficients are used when one variable is dichotomous and the other is continuous. These coefficients measure the strength and direction of the relationship between these variables.

Let’s take a closer look at Pearson Correlation.

Pearson Correlation

The Pearson correlation coefficient r is a standard measure for quantifying a linear relationship between two variables. In other words, r is a measure of how strongly two continuous variables (for example, price or volume) tend to make similar changes. For the Pearson correlation coefficient to return a meaningful value, the following conditions must be met:

Both variables, x and y, are metrically scaled and continuous.
The relationship between the two variables is approximately linear.
The two samples of the variables x and y are independent of each other.

Correlation measures how much two variables are associated. The Pearson correlation is calculated by dividing the covariance of two variables (x, y) by their standard deviations.

\[r = \frac{s_{xy}}{s_x\ast s_y}\ =\ \frac{\sum{x_iy_i\ -\ n\bar{x}\bar{y}}}{\sqrt{\sum{x_i^2\ -\ n{\bar{x}}^2}}\sqrt{\sum{y_i^{2\ }-\ n{\bar{y}}^2}}}\]

Interpreting the Pearson Correlation Coefficient

The value of r is restricted to the range between 1 and -1. Interpreting r requires us to differentiate the following cases:

The closer r is to 1, the stronger the relationship is, and the better the points (Xi / Yi) fit on the regression line.
The closer r is to 0, the weaker the correlation is, and the more widely are the points spread around the regression line.
The extreme cases r = 1 or r = -1 result from a functional relation, defined by a linear equation of y = a + b*x can be described exactly. In this case, all points (xi / Yi) is located on the regression line.

Graphical representation of different correlation coefficients

Be aware that the correlation coefficient is often subject to misinterpretation. For example, an empirical correlation coefficient whose value is > 0 merely states that we can prove a relation based on a sample. However, it does not explain why this relationship exists. In addition, if r ~ 0 does not mean that the two variables are independent. Instead, it only means that we cannot prove a linear relation.

Implementing a Correlation Matrix in Python

In the following, we’ll dig deep into the data and analyze the spread of COVID-19 cases and casualties. To create this correlation matrix, we’ll utilize the Pandas library, a fantastic tool for data analysis that enables us to work with data in a variety of formats.

First, we’ll load our data into a Pandas DataFrame, allowing us to manipulate and calculate correlations with ease. We’ll then use the corr() method to compute the correlation coefficients between the different asset classes and COVID-19. This generates a matrix that provides a clear view of the correlations between our variables.

To make this information more visually appealing, we’ll create a heatmap using the Seaborn library. This heatmap will enable us to easily identify which asset classes are strongly correlated with COVID-19 and which are not.

By creating a correlation matrix in Python, we can gain invaluable insights into the relationship between COVID-19 and the financial market. This knowledge can help us make informed investment decisions by identifying patterns and trends. So let’s dive in and create a correlation matrix that reveals the connection between COVID-19 and the financial market!

The code is available on the GitHub repository.

View on GitHub Relataly GitHub Repo

Prerequisites

Before starting the coding part, make sure that you have set up your Python 3 environment and required packages. If you don’t have an environment set up yet, you can follow this tutorial to set up the Anaconda environment. Also, make sure you install all required packages. In this tutorial, we will be working with the following standard packages:

In addition, we will be using the pandas-DataReader package and Seaborn for visualization. You can install packages using console commands:

pip install 
conda install  (if you are using the anaconda packet manager)

Step #1 Load Data

We begin by loading data about historic COVID-19 cases and price Information on different financial assets.

1.1 Load Historic COVID-19 Data

We begin by downloading the COVID-19 data. For this purpose, we will use the Statworx API. It provides historical time series data on the number of COVID-19 cases in different countries. In addition, the data contains the number of casualties. If you are not yet familiar with APIs, consider my recent tutorial on working with APIs in Python.

# A tutorial for this file is available at www.relataly.com

# Imports
import pandas as pd
import pandas_datareader as web
import numpy as np
from datetime import datetime
import seaborn as sns
from matplotlib import pyplot as plt
import matplotlib.dates as mdates
import requests
import json
from pandas.plotting import register_matplotlib_converters

# Load second Dataset with Corona Cases
payload = {"code": "ALL"}
URL = "https://api.statworx.com/covid"
response = requests.post(url=URL, data=json.dumps(payload))
df_covid = pd.DataFrame.from_dict(json.loads(response.text))
# df_covid = df_covid[df_covid['code'] == 'US']

# Add the date column as variable
df_covid["Date"] = pd.to_datetime(df_covid["date"])

# Delete some columns that we won't use
df_covid.drop(
    ["day", "month", "year", "country", "code", "population", "date"],
    axis=1,
    inplace=True,
)

# Summarize cases over all countries
df_covid = df_covid.groupby(["Date"]).sum()
df_covid.head()

			cases	deaths	cases_cum	deaths_cum
Date				
2019-12-31	27		0		27			0
2020-01-01	0		0		27			0
2020-01-02	0		0		27			0
2020-01-03	17		0		44			0
2020-01-04	0		0		44			0

1.2 Loading Data on Selected Financial Assets

We continue by downloading historical price data on different financial assets. For this purpose, we use the Yahoo Finance API. We limit the period to the time after the first documented COVID-19 cases. When you execute the code of this tutorial as it is, you will receive price information for the following financial assets:

Stock Market Indexes

S&P500
DAX
Niki
N225
S&P500 Futures

Stocks: Online Services

Amazon
Netflix
Apple
Google
Microsoft

Stocks: Airlines

Lufthansa Stock
American Airlines

Resource Futures

Crude Oil Price
Gold
Soybean Price

Treasury Bonds Futures

US Treasury Bonds

Exchange Rates

EUR-USD
CHF-EUR
GBP-USD
GBP-EUR

Crypto Currencies

BTC-USD
ETH-USD

Be aware that stock symbols can change from time to time. If the API does not find a specific stock symbol, you have to look up the current Symbol on Yahoo Finance.

df_covid_new = df_covid.copy()

# Read the data for different assets
today_date = datetime.today().strftime("%Y-%m-%d")
start_date = "2020-01-01"
asset_dict = {
    "^GSPC": "SP500",
    "DAX": "DAX",
    "^N225": "N225",
    "ES=F": "SP500FutJune20",
    "LHA.DE": "Lufthansa",
    "AAL": "AmericanAirlines",
    "NFLX": "Netflix",
    "AMZN": "Amazon",
    "AAPL": "Apple",
    "MSFT": "Microsoft",
    "GOOG": "Google",
    "BTC-USD": "BTCUSD",
    "ETH-USD": "ETHUSD",
    "CL=F": "Oil",
    "GC=F": "Gold",
    #"SM=F": "Soybean",
    "ZB=F": "UsTreasuryBond",
    "GBPEUR=X": "GBPEUR",
    "EURUSD=X": "EURUSD",
    "CHFEUR=X": "CHFEUR",
    "GBPUSD=X": "GBPUSD"}

col_list = []
# Join the dataframes
for key, value in asset_dict.items():
    print(key, value)    
    try:
        df_temp = web.DataReader(
            key, start=start_date, end=today_date, data_source="yahoo")
    except ValueError: 
        print(f' {key} symbol not found')
    # convert index to Date Format
    df_temp.index = pd.to_datetime(df_temp.index) 
    df_temp.rename(columns={"Close": value}, inplace=True) # Rename Close Column       
    df_covid_new = pd.merge(
        left=df_covid_new,
        right=df_temp[value],
        how="inner",
        left_index=True, right_index=True)     

df_covid_new.head()

	cases	deaths	cases_cum	deaths_cum	SP500	DAX			N225		SP500FutJune20	Lufthansa		AmericanAirlines	...	Google		BTCUSD		ETHUSD		Oil	Gold	UsTreasuryBond	GBPEUR	EURUSD	CHFEUR	GBPUSD
Date																					
2020-01-06	0		0			59			0		3246.280029	28.004999	23204.859375	3243.50	15.340	27.320000			...	1394.209961	7769.219238	144.304153	63.270000	1566.199951		157.84375	1.17169	1.116196	0.922110	1.308010
2020-01-07	0		0			59			0		3237.179932	27.955000	23575.720703	3235.25	15.365	27.219999			...	1393.339966	8163.692383	143.543991	62.700001	1571.800049		157.40625	1.17635	1.119799	0.922212	1.317003
2020-01-08	0		0			59			0		3253.050049	28.260000	23204.759766	3260.25	15.540	27.840000			...	1404.319946	8079.862793	141.258133	59.610001	1557.400024		156.37500	1.17551	1.115474	0.925181	1.311372
2020-01-09	0		0			59			0		3274.699951	28.450001	23739.869141	3276.00	16.160	27.950001			...	1419.829956	7879.071289	138.979202	59.560001	1551.699951		156.81250	1.17912	1.111321	0.924505	1.310513
2020-01-10	0		0			59			0		3265.350098	28.500000	23850.570312	3264.75	15.815	27.320000			...	1429.729980	8166.554199	143.963776	59.040001	1557.500000		157.62500	1.17620	1.111111	0.924796	1.307019
5 rows × 24 columns

You can add assets of your choice to the asset list if you want. You can find the respective symbols on finance.yahoo.com.

Step #2 Exploring the Data

Next, we will visualize the historical data using line charts.

# Create lineplots
list_length = df_covid_new.shape[1]
ncols = 6
nrows = int(round(list_length / ncols, 0))
height = list_length/3 if list_length > 30 else 16

fig, axs = plt.subplots(nrows=nrows, ncols=ncols, sharex=True, figsize=(20, height))

for i, ax in enumerate(fig.axes):
        if i < list_length:
            sns.lineplot(data=df_covid_new, x=df_covid_new.index, y=df_covid_new.iloc[:, i], ax=ax)
            ax.set_title(df_covid_new.columns[i])
            ax.tick_params(labelrotation=45)

plt.show()

We can easily spot pairs that seem to have experienced similar price developments. This does not mean, however, that these pairs are correlated.

Step #3 Correlation Matrix

Next, we will calculate the correlation matrix. Various Python libraries make this an easy task that only requires a few lines of code. We will use the standard math package for this purpose.

# Plotting a diagonal correlation matrix
sns.set(style="white")

# Compute the correlation matrix
df = pd.DataFrame(df_covid_new, columns=col_list)
corr = df_covid_new.corr()
corr

					cases	deaths	cases_cum	deaths_cum	SP500	DAX	N225	SP500FutJune20	Lufthansa	AmericanAirlines	...	Google	BTCUSD	ETHUSD	Oil	Gold	UsTreasuryBond	GBPEUR	EURUSD	CHFEUR	GBPUSD
cases				1.000000	0.853512	0.972691	0.966481	0.663638	0.519676	0.660547	0.659832	-0.451801	-0.413463	...	0.796671	0.898456	0.899876	0.073393	0.719520	0.147347	-0.566227	0.843788	-0.538949	0.513913
deaths				0.853512	1.000000	0.778833	0.804270	0.399756	0.259080	0.400126	0.395697	-0.590251	-0.589090	...	0.567708	0.705201	0.718329	-0.228573	0.664476	0.399694	-0.574079	0.628463	-0.291254	0.245614
cases_cum			0.972691	0.778833	1.000000	0.974553	0.714616	0.571317	0.711905	0.711552	-0.379420	-0.325739	...	0.812816	0.922179	0.932026	0.142586	0.682001	0.059693	-0.516654	0.865846	-0.584541	0.584691
deaths_cum			0.966481	0.804270	0.974553	1.000000	0.712595	0.587606	0.681964	0.709312	-0.498761	-0.441631	...	0.808086	0.875724	0.925765	0.097746	0.805602	0.193165	-0.626253	0.902159	-0.603867	0.529622
SP500				0.663638	0.399756	0.714616	0.712595	1.000000	0.960100	0.956142	0.999766	0.140961	0.205127	...	0.944084	0.806056	0.801970	0.623960	0.553991	-0.359058	-0.043646	0.738902	-0.791377	0.853893
DAX					0.519676	0.259080	0.571317	0.587606	0.960100	1.000000	0.934535	0.960816	0.246646	0.304234	...	0.860881	0.678125	0.688038	0.715992	0.500840	-0.387279	-0.002362	0.685518	-0.844509	0.826270
N225				0.660547	0.400126	0.711905	0.681964	0.956142	0.934535	1.000000	0.956710	0.240638	0.281306	...	0.922091	0.829050	0.761729	0.655562	0.425364	-0.436453	-0.005655	0.673853	-0.790071	0.810057
SP500FutJune20		0.659832	0.395697	0.711552	0.709312	0.999766	0.960816	0.956710	1.000000	0.147155	0.211133	...	0.943475	0.804529	0.799886	0.627447	0.549565	-0.363198	-0.039701	0.736997	-0.792258	0.855152
Lufthansa			-0.451801	-0.590251	-0.379420	-0.498761	0.140961	0.246646	0.240638	0.147155	1.000000	0.964624	...	-0.006089	-0.135931	-0.296115	0.629831	-0.665533	-0.853762	0.815127	-0.388975	-0.107357	0.262015
AmericanAirlines	-0.413463	-0.589090	-0.325739	-0.441631	0.205127	0.304234	0.281306	0.211133	0.964624	1.000000	...	0.026610	-0.115151	-0.245080	0.658176	-0.603162	-0.877327	0.790366	-0.312451	-0.143469	0.330665
Netflix				0.750950	0.701806	0.721492	0.840104	0.601819	0.523924	0.493603	0.596449	-0.637187	-0.578967	...	0.672056	0.614683	0.749042	-0.027917	0.914606	0.438247	-0.652950	0.766065	-0.460004	0.338608
Amazon				0.801935	0.710040	0.776041	0.887487	0.669833	0.597223	0.564001	0.665996	-0.591990	-0.528531	...	0.732905	0.672651	0.809639	0.049571	0.936733	0.365580	-0.664869	0.848771	-0.562987	0.428907
Apple				0.840178	0.631516	0.862322	0.917166	0.843786	0.765495	0.750124	0.841533	-0.357089	-0.275023	...	0.860493	0.800416	0.906042	0.295665	0.851025	0.081060	-0.499164	0.927081	-0.719334	0.673724
Microsoft			0.772067	0.647593	0.751721	0.849898	0.792196	0.723458	0.689305	0.788468	-0.416892	-0.354098	...	0.833358	0.723236	0.819949	0.206249	0.871319	0.209342	-0.496853	0.807662	-0.598330	0.529434
Google				0.796671	0.567708	0.812816	0.808086	0.944084	0.860881	0.922091	0.943475	-0.006089	0.026610	...	1.000000	0.902355	0.866670	0.492750	0.593879	-0.219884	-0.174525	0.765421	-0.713271	0.770065
BTCUSD				0.898456	0.705201	0.922179	0.875724	0.806056	0.678125	0.829050	0.804529	-0.135931	-0.115151	...	0.902355	1.000000	0.942019	0.315591	0.568836	-0.099474	-0.285379	0.777073	-0.620303	0.685506
ETHUSD				0.899876	0.718329	0.932026	0.925765	0.801970	0.688038	0.761729	0.799886	-0.296115	-0.245080	...	0.866670	0.942019	1.000000	0.242502	0.740186	0.068097	-0.419289	0.886153	-0.644605	0.696074
Oil					0.073393	-0.228573	0.142586	0.097746	0.623960	0.715992	0.655562	0.627447	0.629831	0.658176	...	0.492750	0.315591	0.242502	1.000000	-0.035808	-0.685471	0.344647	0.261168	-0.615400	0.626496
Gold				0.719520	0.664476	0.682001	0.805602	0.553991	0.500840	0.425364	0.549565	-0.665533	-0.603162	...	0.593879	0.568836	0.740186	-0.035808	1.000000	0.485554	-0.672429	0.815864	-0.489188	0.381673
UsTreasuryBond		0.147347	0.399694	0.059693	0.193165	-0.359058	-0.387279	-0.436453	-0.363198	-0.853762	-0.877327	...	-0.219884	-0.099474	0.068097	-0.685471	0.485554	1.000000	-0.667468	0.154001	0.278546	-0.412731
GBPEUR				-0.566227	-0.574079	-0.516654	-0.626253	-0.043646	-0.002362	-0.005655	-0.039701	0.815127	0.790366	...	-0.174525	-0.285379	-0.419289	0.344647	-0.672429	-0.667468	1.000000	-0.586152	0.230223	0.187170
EURUSD				0.843788	0.628463	0.865846	0.902159	0.738902	0.685518	0.673853	0.736997	-0.388975	-0.312451	...	0.765421	0.777073	0.886153	0.261168	0.815864	0.154001	-0.586152	1.000000	-0.756216	0.686032
CHFEUR				-0.538949	-0.291254	-0.584541	-0.603867	-0.791377	-0.844509	-0.790071	-0.792258	-0.107357	-0.143469	...	-0.713271	-0.620303	-0.644605	-0.615400	-0.489188	0.278546	0.230223	-0.756216	1.000000	-0.711504
GBPUSD				0.513913	0.245614	0.584691	0.529622	0.853893	0.826270	0.810057	0.855152	0.262015	0.330665	...	0.770065	0.685506	0.696074	0.626496	0.381673	-0.412731	0.187170	0.686032	-0.711504	1.000000
24 rows × 24 columns

The matrix shows the Pearson correlation coefficients of all the pairs (X, Y) in our dataset.

Step #4 Visualizing the Correlation Matrix in a Heatmap

Heatmaps are an excellent choice for visualizing a correlation matrix. The heatmap applies a color palette to represent numeric values on a scale in different colors. This makes it easier to capture differences and similarities among the correlation coefficients. In Python, we can create heatmaps using the Seaborn package.

# Generate a mask for the upper triangle
mask = np.triu(np.ones_like(corr, dtype=np.bool))

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(11, 9))

# Generate a custom diverging colormap
cmap = "RdBu"

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(
    corr,
    mask=mask,
    cmap=cmap,
    center=0,
    square=True,
    linewidths=0.5,
    cbar_kws={"shrink": 0.5},
)

Visualization of the Correlation Matrix in the form of a Heatmap

The correlation matrix is symmetric. This is because the correlation between a pair of variables X and Y is the same as between Y and X.

Step #5 Interpretation

The heatmap uses a color palette that ranges from blue (positive correlation) over white (no correlation) to red (negative correlation). The different shades of the three colors visualize the extent of the correlation. We can distinguish between correlated pairs, uncorrelated pairs, and negatively correlated pairs. We will compare the different asset classes step by step in the following.

5.1 Stock Market Indices / COVID-19

Let us start with the pairs of Stock market indices and COVID-19 data. The heatmap signals a negative correlation between the indices (DAX, S&P500, NIKI) and COVID-19. In other words, when the number of cases rises, stock market indices tend to fall in value. If we look precisely, the total number of new cases seems more correlated than the number of cases (cases_cum) or deaths (deaths_cum). In addition, one can observe that the stock market indices are correlated.

5.2 Stock Market Indices / Online Service Provider Stocks

The situation is heterogeneous when we compare the stock markets with the shares of online service providers. There is a positive correlation between the shares of Microsoft and Google and the overall development of the markets. On the other hand, the shares of Netflix, Amazon, and Apple are hardly correlated with market development.

5.3 Stock Market Indices / Airline Stocks

Airlines are heavily affected by the pandemic. Thus it is plausible that we observe a strong positive correlation between airline stocks and the general stock market indices.

5.4 Stock Market Indices / Crypto-Currencies

Next, we compare Cryptocurrencies with the stock market indices. The results are surprising. BTC-USD correlates surprisingly strong positive with the general development of the stock markets. However, the correlation is only slightly positive for ETH-USD and the markets.

5.5 COVID-19 / Currency Exchange Rates

The correlation between exchange rates and COVID-19 cases is relatively weak. Only GBP/EUR, EUR/USD, and GBP/USD show a slightly negative correlation. An exception is CHF/EUR, which positively correlates to the number of COVID-19 cases.

5.6 Treasury Bonds / Resources

Looking at the coefficients of resources and US Treasury Bonds, we can observe a strong negative correlation between COVID-19 cases and the oil price and a strong positive correlation with the gold price.

5.7 Crypto-Currencies / Resources

Finally, let us consider the coefficients of resources and cryptocurrencies. It is noticeable that BTCUSD correlates with the oil price. Based on the absence of a correlation with gold, one might conclude that BTC-USD is not a comparable crisis currency. However, the correlation between market indices and cryptocurrencies such as ETH-USD is relatively low. Thus, they were less affected by the recent market slump.

Also: Stock Market Prediction using Multivariate Data

Summary

Congratulation, you have reached the end of this tutorial! In this article, we have load data on COVID-19 and financial assets via an API. We have created a correlation matrix in Python that shows the linear correlation between financial assets and COVID-19 cases. Finally, we have visualized the matrix in a heatmap and concluded the correlation of different asset pairs. However, we must remember that we may still be unaware of potential non-linear correlations.

Please show your appreciation by leaving a like or comment if you found this article helpful.

And if you are interested to learn more about an advanced use case for correlation analysis, please take a look at this article on clustering cryptocurrencies.

Sources and Further Reading

Images created with Midjourney.

The post Correlation Matrix in Python: How Correlated are COVID-19 Cases and Different Financial Assets? appeared first on relataly.com.

Correlation Archives - relataly.com

On-Chain Analytics: Metrics for Analyzing Blockchains in Python

What is OnChain Analysis?

On-Chain Data

The role of Cryptographic Proof Systems

Proof-of-work vs Proof-of-stake

Analyzing Blockchain Data for Bitcoin and Ethereum with Python

Prerequisites

Obtain a CryptoCompare API Key

Loading Packages and API Key

Loading Price Data

Loading On-Chain Data

Metric #1 Correlation with Bitcoin Price

Metric #2 Distribution by Holder Amount

Metric #3 Difficulty vs. Hashrate

Metric #4 Difficulty vs. Price

Metric #5 Active Addresses compared to Bitcoin

Metric #6 Transaction Count compared to Bitcoin

Metric #7 Large Transactions compared to Bitcoin

Summary

Sources and Further Reading

Create a Personalized Movie Recommendation Engine using Content-based Filtering in Python

What is Content-Based Filtering?

Basic Steps to Building a Content-based Recommender System

Similarity Scoring

Pros and Cons of Content-based Filtering

Implementing a Content-based Movie Recommender in Python

Prerequisites

About the IMDB Movies Dataset

Step #1: Load the Data

Step #2: Feature Engineering and Data Cleaning

2.1 Creating a Bag-of-Words Model

2.2 Visualizing Text Length

Step #3: Vectorization using TfidfVectorizer

Step #4 Dimensionality Reduction and Calculate Consine Similarities

4.1 Dimensionality Reduction using SVD

4.2 Calculate Text Similarity Scores for all Movies

Step #5: Generate Content-based Movie Recommendations

Step #6: Generate Content-based Movie Recommendations

Summary

Sources and Further Reading

Unveiling Hidden Patterns in the Cryptocurrency Market with Affinity Propagation and Python

Disclaimer

What is Stock Market Clustering?

What’s the Problem with Prototype-based Clustering?

Affinity Propagation: What it is and How it Works

Time Series Clustering using Affinity Propagation – Visualizing Cryptocurrency Market Structures in Python

Prerequisites

Step #1: Load the Stock Market Data

Step #2 Plotting Crypto Price Charts

Step #3 Clustering Cryptocurrencies using Affinity Propagation

Step #4 Create a 2D Positioning Model based on the Graph Structure

Step #5 Visualize the Crypto Market Structure

Interpreting the Cryptomarket Map

Step #6 Creating a 3D Representation

Summary

Sources and Further Reading

Cluster Analysis with k-Means in Python

Findings Clusters in Multivariate Data with k-Means

Understanding the Mechanics of K-Means Clustering Algorithm

How many Clusters?

Pros and Cons of k-Means Clustering

Applications of Clustering

Implementing a K-Means Clustering Model in Python

Prerequisites

Step #1: Generate Synthetic Data

Step #2: Preprocessing

Step #3: Train a k-Means Clustering Model

Step #4: Make and Visualize Predictions

Step #5: Measuring Model Performance

Summary

Sources and Further Reading

Books on Clustering

Books on Machine Learning

Related Articles on Clustering

Color-Coded Cryptocurrency Price Charts in Python

What are Color-coded Price Charts?

Use Cases for Color-coded Price Charts

Implementing Color-coded Price Charts in Python

Prerequisites