Cluster Analysis with k-Means and Python

Cluster analysis is an unsupervised machine learning technique that groups similar objects into clusters and separates them from different ones. In unsupervised learning, the model recognizes patterns and associations from the data without requiring a target variable. This technique is beneficial for distinguishing objects whose similarities and differences are unknown. This tutorial shows how this technique works and explains the steps to train a k-Means clustering algorithm in Python.

The remainder of this article proceeds in two parts: The first part is conceptual and briefly introduces the k-Means clustering algorithm. We will also discuss application areas and the advantages and drawbacks of the k-Means algorithm. The second part of this article is a hands-on Python tutorial in which we perform cluster analysis with k-means on a set of synthetic data. Our goal is to distinguish three clusters that have a spherical shape. Finally, we let the model predict a test dataset and visualize the results.

k-means clustering
Three clusters in a dataset that were separated by the k-Means algorithms

Cluster Analysis with k-Means

K-Means divides a set of N observations into k partitions. The underlying approach minimizes the sum of the squared deviations from the centers of the clusters (centroids). We do this by shifting the centroids from an initial position to the cluster centers until we cannot achieve any further improvements. In this way, the algorithm creates spherical clusters in densely populated regions. Let’s discuss this process in more detail.

How k-Means Works

In the illustration below, we assume a cluster with three cluster centers. K-Means carries out several steps to partition the data.

  1. Initially, the algorithm k chooses random starting positions for the centroids. Alternatively, we can also position the centroids manually.
  2. Then the algorithm calculates the distance between the data points and the three centroids. The data points are assigned to the closest centroid/cluster where the cluster variance increases the least.
  3. Next, the algorithm calculates the Euclidean distance between the centroids and their assigned data points. The result is linear decision boundaries that separate the clusters but are not yet optimally.
  4. From then on, the algorithm optimizes the positions of the centroids to lower the variance of the resulting clusters. Then, the previous steps are repeated: averaging, assigning the data points to clusters, and shifting the centroids.

The process ends when the positions of the centroids do not change anymore.

K-means partitions a set of data by iteratively optimizing the centroids and the decision boundaries that result from them.
k-Means iteratively optimizes the decision boundaries between clusters

How many Clusters?

A particular challenge is that k-means requires estimating the number of clusters k. We usually don’t know the optimal number of clusters when tackling new clustering problems. Unless the data is not too complex, we can often estimate the number of centers by looking at one or more scatter plots. However, this approach only works when the data has few dimensions. For complex data with many dimensions, it is common to experiment with varying numbers of k to find an appropriate size for the problem at hand. We can automate this process using hyperparameter tuning techniques such as grid-search. The idea is to try out different cluster sizes and identify the size that best differentiates between clusters.

Pros and Cons of k-Means Clustering

We need to be aware of the strengths and weaknesses of clustering techniques such as k-Means. In general, clustering can reveal structures and relationships in data that supervised machine learning methods like classification most likely would not uncover. In particular, when we suspect different subgroups in the data that differ in their behavior, clustering can help discover what makes these groups unique.

A specific strength of k-Means is that the algorithm is very good at detecting and separating clusters that have a spherical shape. However, when the clusters have more complex structures, such as half-moons or circles, the algorithm often struggles to distinguish them.

Another disadvantage of k-Means is that we need to determine the number of clusters in advance, even if we have no idea how many clusters exist. In cases where the number of clusters in a dataset is unknown, you might prefer other clustering algorithms, such as affinity propagation which can automatically determine the number of clusters.

Applications of Clustering

The k-means algorithm is used in a variety of applications, including the following:

  • A typical use case for clustering is in marketing and market segmentation. Here clustering is used to identify meaningful segments of similar customers. The similarity can be based on demographic data (age, gender, etc.) or customer behavior (for example, the time and amount of a purchase).
  • Medical research uses clustering to divide patient groups into different subgroups, for example, to assess stroke risk. After clustering, the next step is to develop separate prediction models for the subgroups to estimate the risk more accurately.
  • An application in the financial sector is outlier detection in fraud detection. Banks and credit card companies use clustering to detect unusual transactions and flag them for verification.
  • Spam filtering: The input data are attributes of emails (text length, words contained, etc.) and help separate spam from non-spam emails.

Implementing a K-Means Clustering Model in Python

In the following, we run a cluster analysis on a set of synthetic data using Python and scikit-learn. Our goal is to train a K-Means cluster model in Python that distinguishes three clusters in the data. Since the data is synthetic, we know in advance which cluster each data point belongs. So after training, we can validate how well our model distinguishes the three clusters.

The code is available on the GitHub repository.

Prerequisites

Before beginning the coding part, make sure that you have set up your Python 3 environment and required packages. If you don’t have an environment set up yet, consider the Anaconda Python environment. To set it up, you can follow the steps in this tutorial.

Also, make sure you install all required packages. In this tutorial, we will be working with the following standard packages: 

This article uses the k-means clustering algorithm from the Python library Scikit-learn. We also use Seaborn for visualization.

You can install these libraries using console commands:

  • pip install <package name>
  • conda install <package name> (if you are using the anaconda packet manager)

Step #1: Generate Synthetic Data

We start with the generation of synthetic data. For this purpose, we use the make_blobs function from the scikit-learn library. The function generates random clusters in two dimensions spherically arranged around a center. In addition, the data contains the respective cluster to which the data points belong. We use a scatterplot to visualize the data.

import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt
from sklearn.model_selection import train_test_split as train_test_split
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, confusion_matrix

# Generate synthetic data
features, labels = make_blobs(
    n_samples=400,
    centers=3,
    cluster_std=2.75,
    random_state=42
)

# Visualize the data in scatterplots
def scatter_plots(df, palette):
    fig, ax = plt.subplots(nrows=1, ncols=2, sharex=True, figsize=(20, 8))
    fig.subplots_adjust(hspace=0.5, wspace=0.2)
    

    ax1 = plt.subplot(1,2,1)
    sns.scatterplot(ax = ax1, data=df, x='x', y='y', hue= 'labels', palette=palette)
    plt.title('Clusters')
    
    ax2 = plt.subplot(1,2,2)
    sns.scatterplot(ax = ax2, data=df, x='x', y='y')
    plt.title('How the model sees the data during training')

palette = {1:"tab:cyan",0:"tab:orange", 2:"tab:purple"}
df = pd.DataFrame(features, columns=['x', 'y'])
df['labels'] = labels
scatter_plots(df, palette)
Clusters vs how the model sees the data during training

Step #2: Preprocessing

There are some general things to keep in mind when preparing data for use with K-Means clustering:

  • Missing data and outliers: if we have missing entries in our data, we need to handle these, for example, by removing the records or filling the missing values with a mean or median. In addition, K-means is sensitive to outliers. Therefore, make sure that you eliminate outliers from the training data.
  • Normalization: K-Means can only deal with integer values. So, either we map the categorical variables to integer values or use one-hot-encoding to create separate binary variables.
  • Dimensionality reduction: In general, having too many variables in a dataset can negatively affect the performance of clustering algorithms. A good practice is to keep the number of variables below 30, for example, by using techniques for dimensionality reduction such as Principal-Component-Analysis.
  • Scaling: Important to note is also that K-means require scaling the data.

Our synthetic dataset is free of outliers or missing values. Therefore, we only need to scale the data. In addition, we separate the class labels of the clusters from the training set and split the data into a train- and a test dataset.

# Scale the data
scaler = StandardScaler()
scaled_features = scaler.fit_transform(features)

X = scaled_features #Training data
y = labels #Prediction label

# Split the data into x_train and y_train data sets
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, random_state=0)

Step #3: Train a k-Means Clustering Model

Once we have prepared the data, we can begin the cluster analysis by training a K-means model. Our model uses the k-means algorithm from Python scikit-learn library. We have various options to configure the clustering process:

  • n_clusters: The number of clusters we expect in the data.
  • n_init: The number of iterations k-means will run with different initial centroids. The algorithm returns the best model.
  • max_iter: The max number of iterations for a single run

We expect three clusters and configure the algorithm to run ten iterations.

# Create a k-means model with n clusters
kmeans = KMeans(
    init="random",
    n_clusters=3,
    n_init=10,
    max_iter=300,
    random_state=42
)

# fit the model
kmeans.fit(X_train)
print(f'Converged after {kmeans.n_iter_} iterations')
Output: Converged after 4 iterations

Our model has already converged after four iterations. Next, we will look at the results.

Step #4: Make and Visualize Predictions

Next, we use our model to make predictions on the test set. It is generally a good idea to visualize the predictions. We can use a single scatterplot because there are only two dimensions in our data.

# Get the cluster centers from the trained K-means model
cluster_center = scaler.inverse_transform(kmeans.cluster_centers_)
df_cluster_centers = pd.DataFrame(cluster_center, columns=['x', 'y'])

# Unscale the predictions
X_train_unscaled = scaler.inverse_transform(X_train)
df_train = pd.DataFrame(X_train_unscaled, columns=['x', 'y'])
df_train['pred_label'] = kmeans.labels_
df_train['true_label'] = y_train

def scatter_plots(df, cc, palette):
    fig, ax = plt.subplots(nrows=1, ncols=2, sharex=True, figsize=(20, 8))
    fig.subplots_adjust(hspace=0.5, wspace=0.2)
    
    # Print the predictions    
    ax2 = plt.subplot(1,2,1)
    sns.scatterplot(ax = ax2, data=df, x='x', y='y', hue='pred_label', palette=palette)
    sns.scatterplot(ax = ax2, data=cc, x='x', y='y', color='r', marker="X")
    plt.title('Predicted Labels')

    # Print the actual values
    ax1 = plt.subplot(1,2,2)
    sns.scatterplot(ax = ax1, data=df, x='x', y='y', hue= 'true_label', palette=palette)
    sns.scatterplot(ax = ax1, data=cc, x='x', y='y', color='r', marker="X")
    plt.title('Actual Labels')


# The colors between the two plots may not match.
# This is because K-means does not know the initial labels and assigns numbers to clusters 
palette = {1:"tab:cyan",0:"tab:orange", 2:"tab:purple"}
scatter_plots(df_train, df_cluster_centers, palette)
actual clusters vs predicted cluster of the K-means model

The scatterplot above shows that K-Means did a good job finding the three clusters. As a side note, the colors between the two plots do not match because K-means does not know the initial labels and assigns numbers to clusters.

Step #5: Measuring Model Performance

Next, we measure the performance of our clustering model. However, first, we unify the cluster labels as A, B, and C.

# The predictive model has the labels 0 and 1 reversed. We will correct that first. 
#df_train['pred_test'] = df_train['pred_labels'].map({2:2, 1:3, 0:1})
df_eval = df_train.copy()
df_eval['true_label'] = df_eval['true_label'].map({0:'A', 1:'B', 2:'C'})
df_eval['pred_label'] = df_eval['pred_label'].map({0:'B', 1:'A', 2:'C'})
df_eval.head(10)

We can use the variance of the clusters to assess how well our model distinguishes different clusters. In addition, it is a common practice to create scatterplots on the predictions to visually verify the quality of the clusters and their decision boundaries. The following scatter plot shows the correctly assigned values and where our model was wrong.

# Scatterplot on correctly and falsely classified values
df_eval.loc[df_eval['pred_label'] == df_eval['true_label'], 'Correctly classified?'] = 'True' 
df_eval.loc[df_eval['pred_label'] != df_eval['true_label'], 'Correctly classified?'] = 'False' 

plt.rcParams["figure.figsize"] = (10,8)
sns.scatterplot(data=df_eval, x='x', y='y', color='r', hue='Correctly classified?')
A 2-d scatterplot that shows the correctly assigned values and where our  K-means model was wrong.

The K-Means model correctly assigned most data points (blue) to their actual cluster. The few misclassified points are located at a decision boundary between two clusters (marked in orange).

Because we train our model on synthetic data, we know the actual clusters for each data point. Therefore, we can use traditional classification metrics such as accuracy and f1_score to measure the performance of our clustering model.

# Create a confusion matrix
def evaluate_results(model, y_true, y_pred, class_names):
    tick_marks = [0.5, 1.5, 2.5]
    
    # Print the Confusion Matrix
    fig, ax = plt.subplots(figsize=(10, 6))
    results_log = classification_report(y_true, y_pred, target_names=class_names, output_dict=True)
    results_df_log = pd.DataFrame(results_log).transpose()
    matrix = confusion_matrix(y_true,  y_pred)
    model_score = score(y_pred, y_true, average='macro')
    
    sns.heatmap(pd.DataFrame(matrix), annot=True, fmt="d", linewidths=.5, cmap="YlGnBu")
    plt.xlabel('Predicted label'); plt.ylabel('Actual label')
    plt.title('Confusion Matrix on the Test Data')
    plt.yticks(tick_marks, class_names); plt.xticks(tick_marks, class_names)
    
    print(results_df_log)


y_true = df_eval['true_label']
y_pred = df_eval['pred_label']
class_names = ['A', 'B', 'C']
evaluate_results(kmeans, y_true, y_pred, class_names)
confusion matrix on the predictions generated by our k-means model

Summary

This article has presented k-Means, one of the most common density-based algorithms used for cluster analysis. You have learned how k-Means groups data points into non-overlapping clusters. We have also looked at the strengths and weaknesses of the algorithm. Keep in mind that the k-Means algorithm works well on spherical clusters but may struggle to identify groups whose shape is more complex. Knowing these assumptions will help you make informed decisions using the k-Means algorithm. In a second part, this tutorial has shown how to implement the k-Means clustering algorithm in Python. We generated a set of synthetic data and ran cluster analysis in Python to distinguish spherical clusters in the data. Finally, this article has presented methods for visualizing and evaluating the performance of clustering models.

If you have any questions, please ask them in the comments.

Sources and Further Reading

  1. David Forsyth (2019) Applied Machine Learning Springer

Author

  • Hi, I am Florian, a Zurich-based consultant for AI and Data. Since the completion of my Ph.D. in 2017, I have been working on the design and implementation of ML use cases in the Swiss financial sector. I started this blog in 2020 with the goal in mind to share my experiences and create a place where you can find key concepts of machine learning and materials that will allow you to kick-start your own Python projects.

Leave a Comment