# Cluster Analysis with k-Means in Python

Cluster analysis is an unsupervised machine learning technique that groups similar objects into clusters and separates them from different ones. In unsupervised learning, the model recognizes patterns and associations from the data without requiring a target variable. This technique is beneficial for distinguishing objects whose similarities and differences are unknown. This Python tutorial shows how this technique works and explains the steps to train a k-Means clustering algorithm.

The remainder of this article proceeds in two parts: The first part is conceptual and briefly introduces the k-Means clustering algorithm. We will also discuss application areas and the advantages and drawbacks of the k-Means algorithm. The second part of this article is a hands-on Python tutorial in which we perform cluster analysis with k-means on a set of synthetic data. Our goal is to distinguish three clusters that have a spherical shape. Finally, we let the model predict a test dataset and visualize the results.

## Findings Clusters in Multivariate Data with k-Means

K-means is an unsupervised machine learning algorithm that is used for clustering. It divides a dataset into a specified number (k) of clusters, with each cluster containing similar data points. The algorithm works by selecting k random points as the initial centroids (representatives) of the clusters, and then iteratively reassigning data points to the closest centroid and updating the centroid to the mean of the points in the cluster. This process continues until the centroids stop moving or a maximum number of iterations is reached.

The goal of k-means is to minimize the within-cluster sum of squares (WCSS), which is a measure of the distance between the data points in a cluster and the centroid of the cluster. By minimizing the WCSS, the algorithm aims to find the most compact and distinct clusters possible. K-means is a popular algorithm for clustering because it is fast and easy to implement, but it has some limitations, such as the need to specify the number of clusters in advance and the assumption that the clusters are spherical and equally sized. Let’s discuss this process in more detail.

### How k-Means Works

In the illustration below, we assume a cluster with three cluster centers. K-Means carries out several steps to partition the data.

1. Initially, the algorithm k chooses random starting positions for the centroids. Alternatively, we can also position the centroids manually.
2. Then the algorithm calculates the distance between the data points and the three centroids. The data points are assigned to the closest centroid/cluster where the cluster variance increases the least.
3. Next, the algorithm calculates the Euclidean distance between the centroids and their assigned data points. The result is linear decision boundaries that separate the clusters but are not yet optimal.
4. From then on, the algorithm optimizes the centroids’ positions to lower the resulting clusters’ variance. Then, the previous steps are repeated: averaging, assigning the data points to clusters, and shifting the centroids.

The process ends when the positions of the centroids do not change anymore.

### How many Clusters?

A particular challenge is that k-means requires estimating the number of clusters k. When tackling new clustering problems, we usually don’t know the optimal number of clusters. Unless the data is not too complex, we can often estimate the number of centers by looking at one or more scatter plots. However, this approach only works when the data has a few dimensions. For complex data with many dimensions, it is common to experiment with varying numbers of k to find an appropriate size for the problem. We can automate this process using hyperparameter tuning techniques such as grid search or random search. The idea is to try out different cluster sizes and identify the size that best differentiates between clusters.

### Pros and Cons of k-Means Clustering

We need to know the strengths and weaknesses of clustering techniques such as k-Means. In general, clustering can reveal structures and relationships in data supervised machine learning methods like classification likely would not uncover. In particular, when we suspect different subgroups in the data that differ in their behavior, clustering can help discover what makes these groups unique.

A specific strength of k-Means is that the algorithm is very good at detecting and separating clusters that have a spherical shape. However, when the clusters have more complex structures, such as half-moons or circles, the algorithm often struggles to distinguish them.

Another disadvantage of k-Means is that we need to determine the number of clusters in advance, even if we have no idea how many clusters exist. In cases where the number of clusters in a dataset is unknown, you might prefer other clustering algorithms, such as affinity propagation or hierarchical clustering, which can automatically determine the number of clusters.

### Applications of Clustering

The k-means algorithm is used in a variety of applications, including the following:

• A typical use case for clustering is in marketing and market segmentation. Here clustering is used to identify meaningful segments of similar customers. The similarity can be based on demographic data (age, gender, etc.) or customer behavior (for example, the time and amount of a purchase).
• Medical research uses clustering to divide patient groups into different subgroups, for example, to assess stroke risk. After clustering, the next step is to develop separate prediction models for the subgroups to estimate the risk more accurately.
• An application in the financial sector is outlier detection in fraud detection. Banks and credit card companies use clustering to detect unusual transactions and flag them for verification.
• Spam filtering: The input data are attributes of emails (text length, words contained, etc.) and help separate spam from non-spam emails.

## Implementing a K-Means Clustering Model in Python

In the following, we run a cluster analysis on synthetic data using Python and scikit-learn. We aim to train a K-Means cluster model in Python that distinguishes three clusters in the data. Since the data is artificial, we know which cluster each data point belongs to in advance. So after training, we can validate how well our model distinguishes the three clusters.

The code is available on the GitHub repository.

### Prerequisites

Before beginning the coding part, make sure that you have set up your Python 3 environment and required packages. If you don’t have an environment, consider the Anaconda Python environment. To set it up, you can follow the steps in this tutorial.

Also, make sure you install all required packages. In this tutorial, we will be working with the following standard packages:

This article uses the k-means clustering algorithm from the Python library Scikit-learn. We also use Seaborn for visualization.

You can install these libraries using console commands:

• pip install <package name>
• conda install <package name> (if you are using the anaconda packet manager)

### Step #1: Generate Synthetic Data

We start with the generation of synthetic data. For this purpose, we use the make_blobs function from the scikit-learn library. The function generates random clusters in two dimensions spherically arranged around a center. In addition, the data contains the respective cluster to which the data points belong. We use a scatterplot to visualize the data.

```import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt
from sklearn.model_selection import train_test_split as train_test_split
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, confusion_matrix

# Generate synthetic data
features, labels = make_blobs(
n_samples=400,
centers=3,
cluster_std=2.75,
random_state=42
)

# Visualize the data in scatterplots
def scatter_plots(df, palette):
fig, ax = plt.subplots(nrows=1, ncols=2, sharex=True, figsize=(20, 8))

ax1 = plt.subplot(1,2,1)
sns.scatterplot(ax = ax1, data=df, x='x', y='y', hue= 'labels', palette=palette)
plt.title('Clusters')

ax2 = plt.subplot(1,2,2)
sns.scatterplot(ax = ax2, data=df, x='x', y='y')
plt.title('How the model sees the data during training')

palette = {1:"tab:cyan",0:"tab:orange", 2:"tab:purple"}
df = pd.DataFrame(features, columns=['x', 'y'])
df['labels'] = labels
scatter_plots(df, palette)```

### Step #2: Preprocessing

There are some general things to keep in mind when preparing data for use with K-Means clustering:

• Missing data and outliers: if we have missing entries in our data, we need to handle these, for example, by removing the records or filling the missing values with a mean or median. In addition, K-means is sensitive to outliers. Therefore, make sure that you eliminate outliers from the training data.
• Normalization: K-Means can only deal with integer values. So, either we map the categorical variables to integer values or use one-hot-encoding to create separate binary variables.
• Dimensionality reduction: In general, having too many variables in a dataset can negatively affect the performance of clustering algorithms. A good practice is to keep the number of variables below 30, for example, by using techniques for dimensionality reduction such as Principal-Component-Analysis.
• Scaling: Important to note is also that K-means require scaling the data.

Our synthetic dataset is free of outliers or missing values. Therefore, we only need to scale the data. In addition, we separate the class labels of the clusters from the training set and split the data into a train- and a test dataset.

```# Scale the data
scaler = StandardScaler()
scaled_features = scaler.fit_transform(features)

X = scaled_features #Training data
y = labels #Prediction label

# Split the data into x_train and y_train data sets
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, random_state=0)```

### Step #3: Train a k-Means Clustering Model

Once we have prepared the data, we can begin the cluster analysis by training a K-means model. Our model uses the k-means algorithm from Python scikit-learn library. We have various options to configure the clustering process:

• n_clusters: The number of clusters we expect in the data.
• n_init: The number of iterations k-means will run with different initial centroids. The algorithm returns the best model.
• max_iter: The max number of iterations for a single run

We expect three clusters and configure the algorithm to run ten iterations.

```# Create a k-means model with n clusters
kmeans = KMeans(
init="random",
n_clusters=3,
n_init=10,
max_iter=300,
random_state=42
)

# fit the model
kmeans.fit(X_train)
print(f'Converged after {kmeans.n_iter_} iterations')```
`Converged after 4 iterations`

Our model has already converged after four iterations. Next, we will look at the results.

### Step #4: Make and Visualize Predictions

Next, we use our model to make predictions on the test set. It is generally a good idea to visualize the predictions. We can use a single scatterplot because our data has only two dimensions.

```# Get the cluster centers from the trained K-means model
cluster_center = scaler.inverse_transform(kmeans.cluster_centers_)
df_cluster_centers = pd.DataFrame(cluster_center, columns=['x', 'y'])

# Unscale the predictions
X_train_unscaled = scaler.inverse_transform(X_train)
df_train = pd.DataFrame(X_train_unscaled, columns=['x', 'y'])
df_train['pred_label'] = kmeans.labels_
df_train['true_label'] = y_train

def scatter_plots(df, cc, palette):
fig, ax = plt.subplots(nrows=1, ncols=2, sharex=True, figsize=(20, 8))

# Print the predictions
ax2 = plt.subplot(1,2,1)
sns.scatterplot(ax = ax2, data=df, x='x', y='y', hue='pred_label', palette=palette)
sns.scatterplot(ax = ax2, data=cc, x='x', y='y', color='r', marker="X")
plt.title('Predicted Labels')

# Print the actual values
ax1 = plt.subplot(1,2,2)
sns.scatterplot(ax = ax1, data=df, x='x', y='y', hue= 'true_label', palette=palette)
sns.scatterplot(ax = ax1, data=cc, x='x', y='y', color='r', marker="X")
plt.title('Actual Labels')

# The colors between the two plots may not match.
# This is because K-means does not know the initial labels and assigns numbers to clusters
palette = {1:"tab:cyan",0:"tab:orange", 2:"tab:purple"}
scatter_plots(df_train, df_cluster_centers, palette)```

The scatterplot above shows that K-Means found the three clusters. As a side note, the colors between the two plots do not match because K-means does not know the initial labels and assigns numbers to clusters.

### Step #5: Measuring Model Performance

Next, we measure the performance of our clustering model. K-means is an unsupervised machine learning algorithm, which means that it is used to cluster data points into groups based on their similarity, without the use of labeled training data. However, we can compare the cluster assignments to the labels attached to the data and see if the model can predict them correctly. In this way, we can use traditional classification metrics such as accuracy and f1_score to measure the performance of our clustering model.

First, we unify the cluster labels as A, B, and C.

```# The predictive model has the labels 0 and 1 reversed. We will correct that first.
#df_train['pred_test'] = df_train['pred_labels'].map({2:2, 1:3, 0:1})
df_eval = df_train.copy()
df_eval['true_label'] = df_eval['true_label'].map({0:'A', 1:'B', 2:'C'})
df_eval['pred_label'] = df_eval['pred_label'].map({0:'B', 1:'A', 2:'C'})
```	x			y			pred_label	true_label
0	-9.007547	-10.302910	C			C
1	1.009238	7.009681	A			B
2	-6.565501	-6.466780	C			C
3	2.389772	7.727235	B			B
4	-5.422666	-2.915796	C			C
5	-12.024305	-7.846772	C			C
6	-4.006250	9.319323	A			A
7	-6.297788	6.435267	A			A
8	2.169238	3.325947	B			B
9	-5.140506	-4.205585	C			C```

It is a common practice to create scatterplots on the predictions to visually verify the quality of the clusters and their decision boundaries. The following scatter plot shows the correctly assigned values and where our model was wrong.

```# Scatterplot on correctly and falsely classified values
df_eval.loc[df_eval['pred_label'] == df_eval['true_label'], 'Correctly classified?'] = 'True'
df_eval.loc[df_eval['pred_label'] != df_eval['true_label'], 'Correctly classified?'] = 'False'

plt.rcParams["figure.figsize"] = (10,8)
sns.scatterplot(data=df_eval, x='x', y='y', color='r', hue='Correctly classified?')```

The K-Means model correctly assigned most data points (blue) to their actual cluster. The few misclassified points are located at a decision boundary between two clusters (marked in orange).

```# Create a confusion matrix
def evaluate_results(model, y_true, y_pred, class_names):
tick_marks = [0.5, 1.5, 2.5]

# Print the Confusion Matrix
fig, ax = plt.subplots(figsize=(10, 6))
results_log = classification_report(y_true, y_pred, target_names=class_names, output_dict=True)
results_df_log = pd.DataFrame(results_log).transpose()
matrix = confusion_matrix(y_true,  y_pred)
model_score = score(y_pred, y_true, average='macro')

sns.heatmap(pd.DataFrame(matrix), annot=True, fmt="d", linewidths=.5, cmap="YlGnBu")
plt.xlabel('Predicted label'); plt.ylabel('Actual label')
plt.title('Confusion Matrix on the Test Data')
plt.yticks(tick_marks, class_names); plt.xticks(tick_marks, class_names)

print(results_df_log)

y_true = df_eval['true_label']
y_pred = df_eval['pred_label']
class_names = ['A', 'B', 'C']
evaluate_results(kmeans, y_true, y_pred, class_names)```

As we can see, the model has done a good job of grouping the labels into the attached classes.

## Summary

This article has presented k-Means, one of the most common density-based algorithms used for cluster analysis. You have learned how k-Means groups data points into non-overlapping clusters. We have also looked at the strengths and weaknesses of the algorithm. Remember that the k-Means algorithm works well on spherical clusters but may struggle to identify groups whose shape is more complex. Knowing these assumptions will help you make informed decisions using the k-Means algorithm. In the second part, implemented the k-Means clustering algorithm in Python. We generated a set of synthetic data and ran cluster analysis in Python to distinguish spherical clusters in the data. Finally, this article has presented methods for visualizing and evaluating the performance of clustering models.

We can use k-means to detect anomalies in time-series data. If you are interested in this topic, check out my recent article on anomaly detection with Python.

1. “Data Clustering: Algorithms and Applications” by Charu C. Aggarwal: This book covers a wide range of clustering algorithms, including hierarchical clustering, and discusses their applications in various fields.
2. “Data Mining: Practical Machine Learning Tools and Techniques” by Ian H. Witten and Eibe Frank: This book is a comprehensive introduction to data mining and machine learning, including a chapter on k-means.

Other relataly articles on clustering

1. Hierarchical clustering for customer segmentation in Python: We present an alternative approach to k-means that can determine the number of the clusters themselves and works well if the distributions of the groups in the data is more complex.
2. Clustering crypto markets using affinity propagation in Python: This article applies cluster analysis to crypto markets and creates a market map for various cryptocurrencies.

## Author

• Hi, I am Florian, a Zurich-based consultant for AI and Data. Since the completion of my Ph.D. in 2017, I have been working on the design and implementation of ML use cases in the Swiss financial sector. I started this blog in 2020 with the goal in mind to share my experiences and create a place where you can find key concepts of machine learning and materials that will allow you to kick-start your own Python projects.