Simple Clustering using K-means with Python

Simple Cluster Analysis using K-Means and Python

This article shows how to run a simple cluster analysis on synthetic data using the K-Means clustering algorithm in Python. Cluster analysis is an unsupervised machine learning technique that groups similar objects into clusters and separates them from different ones. In unsupervised learning, the model recognizes patterns and associations purely from the data without requiring a target variable. Clustering is therefore often used to distinguish objects whose similarities and differences are not yet known.

The remainder of this article proceeds as follows. We begin with a brief intro to the K-Means clustering algorithm to better understand how it works. We will also briefly discuss application areas as well as the advantages and drawbacks of the K-Means algorithm. Then we turn to the hands-on Python part and run a cluster analysis with k-means on a set of synthetic data. Our goal is to distinguish three clusters that have a spherical shape. Finally, we let the model predict a set of test data and visualize the results.

k-means clustering
Three clusters in a dataset that were separated by the K-means algorithms

Cluster Analysis with K-Means

K-Means divides a set of N observations into k partitions. The underlying approach minimizes the sum of the squared deviations from the centers of the clusters (centroids). This is done by shifting the centroids from an initial position to the cluster centers until no further improvements can be achieved. In this way, the algorithm creates spherical clusters in regions that are densely populated. Let’s look at this process more closely.

How K-Means Works

Given three cluster centers (k=3), the algorithm carries out the following steps to partition the data.

K-means partitions a set of data by iteratively optimizing the centroids and the decision boundaries that result from them.
K-Means iteratively optimizes the decision boundaries between clusters
  1. Initially, the algorithm k chooses random starting positions for the centroids. Alternatively, we can also position the centroids manually.
  2. Then the algorithm calculates the distance between the data points and the three centroids. In this way, the data points are assigned to the closest centroid/cluster where the cluster variance increases the least.
  3. Next, the algorithm calculates the Euclidean distance between the centroids and their assigned data points. This leads to linear decision boundaries that separate the clusters but not yet optimally.
  4. From then on, the algorithm optimizes the positions of the centroids to lower the variance of the resulting clusters. This is done by iteratively repeating the previous steps: averaging, assigning the data points to clusters, and shifting the centroids.

The process ends when the positions of the centroids do not change anymore.

How many Clusters?

K-Means clustering requires to decide in advance on the number of clusters k. However, when dealing with a new clustering problem, the optimal number of clusters is usually unknown. Unless the data is not too complex, a way of estimating the number of centers is to look at one or more scatter plots. But for complex data with many dimensions, it is a common practice to experiment with varying numbers of k to find an appropriate size for the problem at hand. The idea is to try out different cluster sizes and identify the size that best differentiates between clusters. We can automate this process using hyperparameter tuning techniques such as grid-search.

Pros and Cons of K-Means Clustering

We need to be aware of the strengths and weaknesses of clustering techniques such as K-Means. In general, clustering can reveal structures and relationships in data that supervised machine learning methods like classification most likely would not uncover. This is especially helpful when we suspect different subgroups in the data that differ in their behavior. In such cases, we can use clustering to uncover what makes these groups unique.

A specific strength of K-Means is that the algorithm is very good at detecting and separating clusters that have a spherical shape. However, when the clusters have more complex structures, such as half-moons or circles, the algorithm often struggles to distinguish them. Another disadvantage of K-Means is that we need to determine the number of clusters in advance, even if we have no idea how many clusters exist. Also, K-Means requires careful preprocessing of the data, which we will look at in more detail later in the article.

Applications of Clustering

The K-means algorithm is used in a variety of applications, including the following:

  • A common use case for clustering is in marketing and market segmentation. Here clustering is used to identify meaningful segments of similar customers. The similarity can be based on demographic data (age, gender, etc.) or customer behavior (for example, the time and amount of a purchase).
  • In medical research, clustering is used to divide patient groups into different subgroups, for example, to assess stroke risk. After clustering, the next step is to develop separate prediction models for the subgroups, which can estimate the risk more accurately.
  • An application in the financial sector is outlier detection in the sense of fraud detection. Banks and credit card companies use clustering to detect unusual transactions and flag them for verification.
  • Another application of clustering is spam filtering. The input data are attributes of e-mails (text length, words contained, and so on), which are used to separate spam from non-spam emails.

Implementing a K-Means Clustering Model in Python

In the following, we run a cluster analysis on a set of synthetic data using Python and scikit-learn. Our goal is to train a K-Means cluster model in Python that distinguishes three clusters in the data. Since the data is synthetic, we know in advance to which cluster each data point belongs. So after training, we can validate how well our model distinguishes the three clusters. As always, you can find the code for this example on the corresponding GitHub repository.

Prerequisites

Before we begin with the coding part, make sure that you have set up your Python 3 environment and required packages. If you don’t have an environment set up yet, consider the Anaconda Python environment. To set it up, you can follow the steps in this tutorial.

Also, make sure you install all required packages. In this tutorial, we will be working with the following standard packages: 

This article uses the k-means clustering algorithm from the Python library Scikit-learn. We also use Seaborn for visualization.

You can install these libraries using console commands:

  • pip install <package name>
  • conda install <package name> (if you are using the anaconda packet manager)

Step #1: Generate Synthetic Data

We start with the generation of synthetic data. For this purpose, we use the make_blobs function from the scikit-learn library. The function generates random clusters in two dimensions that are spherically arranged around a center. In addition, the data contains the respective cluster to which the data points were assigned. We can use a scatterplot to visualize the data.

import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt
from sklearn.model_selection import train_test_split as train_test_split
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, confusion_matrix

# Generate synthetic data
features, labels = make_blobs(
    n_samples=400,
    centers=3,
    cluster_std=2.75,
    random_state=42
)

# Visualize the data in scatterplots
def scatter_plots(df, palette):
    fig, ax = plt.subplots(nrows=1, ncols=2, sharex=True, figsize=(20, 8))
    fig.subplots_adjust(hspace=0.5, wspace=0.2)
    

    ax1 = plt.subplot(1,2,1)
    sns.scatterplot(ax = ax1, data=df, x='x', y='y', hue= 'labels', palette=palette)
    plt.title('Clusters')
    
    ax2 = plt.subplot(1,2,2)
    sns.scatterplot(ax = ax2, data=df, x='x', y='y')
    plt.title('How the model sees the data during training')

palette = {1:"tab:cyan",0:"tab:orange", 2:"tab:purple"}
df = pd.DataFrame(features, columns=['x', 'y'])
df['labels'] = labels
scatter_plots(df, palette)

Step #2: Preprocessing

There are some general things to keep in mind when preparing data for use with K-Means clustering:

  • Missing data and outliers: if we have missing entries in our data, we need to handle these, for example, by removing the records or filling the missing values with a mean or median. In addition, K-means is sensitive to outliers. Therefore, make sure that you eliminate outliers from the training data.
  • Normalization: K-Means can only deal with integer values. So either we map the categorical variables to integer values or we use one-hot-encoding to create separate binary variables.
  • Dimensionality reduction: In general, having too many variables in a dataset can negatively affect the performance of clustering algorithms. A good practice is to keep the number of variables below 30, for example by using techniques for dimensionality reduction such as Principal-Component-Analysis.
  • Scaling: Important to note is also that K-means requires scaling the data.

The synthetic data used in this article is free of outliers or missing values. Therefore, we only need to scale the data. In addition, we separate the class labels of the clusters from the training set and split the data into separate sets for train and test.

# Scale the data
scaler = StandardScaler()
scaled_features = scaler.fit_transform(features)

X = scaled_features #Training data
y = labels #Prediction label

# Split the data into x_train and y_train data sets
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, random_state=0)

Step #3: Train a K-Means Clustering Model

Once we have prepared the data, we can begin with the cluster analysis by training a K-means model. Our model uses the K-means algorithm from Python scikit-learn library. We have various options to configure the clustering process:

  • n_clusters: The number of clusters we expect in the data. In our case, we expect three clusters.
  • n_init: The number of iterations the k-means algorithm will be run with different initial centroids. The algorithm returns the best model.
  • max_iter: The max number of iterations for a single run
# Create a k-means model with n clusters
kmeans = KMeans(
    init="random",
    n_clusters=3,
    n_init=10,
    max_iter=300,
    random_state=42
)

# fit the model
kmeans.fit(X_train)
print(f'Converged after {kmeans.n_iter_} iterations')
Converged after 4 iterations

Our model has already converged after four iterations. Next, we will look at the results.

Step #4: Make and Visualize Predictions

Next, we use our model to make predictions on the test set. It is generally a good idea to visualize the predictions. Because there are only two dimensions in our data, we can use a single scatterplot for this purpose.

# Get the cluster centers from the trained K-means model
cluster_center = scaler.inverse_transform(kmeans.cluster_centers_)
df_cluster_centers = pd.DataFrame(cluster_center, columns=['x', 'y'])

# Unscale the predictions
X_train_unscaled = scaler.inverse_transform(X_train)
df_train = pd.DataFrame(X_train_unscaled, columns=['x', 'y'])
df_train['pred_label'] = kmeans.labels_
df_train['true_label'] = y_train

def scatter_plots(df, cc, palette):
    fig, ax = plt.subplots(nrows=1, ncols=2, sharex=True, figsize=(20, 8))
    fig.subplots_adjust(hspace=0.5, wspace=0.2)
    
    # Print the predictions    
    ax2 = plt.subplot(1,2,1)
    sns.scatterplot(ax = ax2, data=df, x='x', y='y', hue='pred_label', palette=palette)
    sns.scatterplot(ax = ax2, data=cc, x='x', y='y', color='r', marker="X")
    plt.title('Predicted Labels')

    # Print the actual values
    ax1 = plt.subplot(1,2,2)
    sns.scatterplot(ax = ax1, data=df, x='x', y='y', hue= 'true_label', palette=palette)
    sns.scatterplot(ax = ax1, data=cc, x='x', y='y', color='r', marker="X")
    plt.title('Actual Labels')


# The colors between the two plots may not match.
# This is because K-means does not know the initial labels and assigns numbers to clusters 
palette = {1:"tab:cyan",0:"tab:orange", 2:"tab:purple"}
scatter_plots(df_train, df_cluster_centers, palette)
actual clusters vs predicted cluster of the K-means model

The scatterplot above shows that K-Means did a good job finding the three clusters. As a side note, the colors between the two plots do not match because K-means does not know the initial labels and assigns numbers to clusters.

Step #5: Measuring Performance

Next, we measure the performance of our clustering model. However, first, we unify the cluster labels as A, B, and C.

# The predictive model has the labels 0 and 1 reversed. We will correct that first. 
#df_train['pred_test'] = df_train['pred_labels'].map({2:2, 1:3, 0:1})
df_eval = df_train.copy()
df_eval['true_label'] = df_eval['true_label'].map({0:'A', 1:'B', 2:'C'})
df_eval['pred_label'] = df_eval['pred_label'].map({0:'B', 1:'A', 2:'C'})
df_eval.head(10)

We can use the variance of the clusters to assess how well our model distinguishes different clusters. In addition, it is a common practice to create scatterplots on the predictions to visually verify the quality of the clusters and their decision boundaries. The following scatter plot shows the correctly assigned values and where our model was wrong.

# Scatterplot on correctly and falsely classified values
df_eval.loc[df_eval['pred_label'] == df_eval['true_label'], 'Correctly classified?'] = 'True' 
df_eval.loc[df_eval['pred_label'] != df_eval['true_label'], 'Correctly classified?'] = 'False' 

plt.rcParams["figure.figsize"] = (10,8)
sns.scatterplot(data=df_eval, x='x', y='y', color='r', hue='Correctly classified?')
A 2-d scatterplot that shows the correctly assigned values and where our  K-means model was wrong.

As shown above, the K-Means model correctly assigned most data points (blue) to their actual cluster. The few points that were misclassified are located at a decision boundary between two clusters.

Because we train our model on synthetic data, we know the actual clusters for each data point. Therefore, we can use traditional classification metrics such as accuracy and f1_score to measure the performance of our clustering model.

# Create a confusion matrix
def evaluate_results(model, y_true, y_pred, class_names):
    tick_marks = [0.5, 1.5, 2.5]
    
    # Print the Confusion Matrix
    fig, ax = plt.subplots(figsize=(10, 6))
    results_log = classification_report(y_true, y_pred, target_names=class_names, output_dict=True)
    results_df_log = pd.DataFrame(results_log).transpose()
    matrix = confusion_matrix(y_true,  y_pred)
    model_score = score(y_pred, y_true, average='macro')
    
    sns.heatmap(pd.DataFrame(matrix), annot=True, fmt="d", linewidths=.5, cmap="YlGnBu")
    plt.xlabel('Predicted label'); plt.ylabel('Actual label')
    plt.title('Confusion Matrix on the Test Data')
    plt.yticks(tick_marks, class_names); plt.xticks(tick_marks, class_names)
    
    print(results_df_log)


y_true = df_eval['true_label']
y_pred = df_eval['pred_label']
class_names = ['A', 'B', 'C']
evaluate_results(kmeans, y_true, y_pred, class_names)
confusion matrix on the predictions generated by our k-means model

Summary

This article has presented K-means, one of the most common density-based algorithms used for cluster analysis. You have learned how K-means groups data points into non-overlapping clusters. We have also looked at the strengths and weaknesses of the algorithm. Keep in mind that the K-means algorithm works well on spherical clusters but may struggle to identify clusters whose shape is more complex. Knowing these assumptions behind the K-means algorithm will help you to make informed decisions when to use it. In a second part, this article has shown how to implement the K-means algorithm in Python. We have run cluster analysis in Python on a set of synthetic data and used a K-means model to distinguish clusters in a set of test data. Finally, this article has presented methods for visualizing and evaluating the performance of clustering models.

If you have any questions, feel free to ask them in the comments.

Sources and Further Reading

  1. David Forsyth (2019) Applied Machine Learning Springer
  2. An-introduction-to-clustering-and-different-methods-of-clustering/

Author

  • Hi, I am Florian, a Zurich-based consultant for AI and Data. Since the completion of my Ph.D. in 2017, I have been working on the design and implementation of ML use cases in the Swiss financial sector. I started this blog in 2020 with the goal in mind to share my experiences and create a place where you can find key concepts of machine learning and materials that will allow you to kick-start your own Python projects.

Leave a Reply