Find the clusters

using K-means and DBSCAN

Cluster analysis is a method for looking into data to see if individual datapoints within a dataset can be categorised as belonging to one of a number of groups of similar datapoints. The category a datapoint belongs in, or indeed the number of categories to use, is generally not defined beforehand, and instead the data are used to suggest these. As such, cluster analysis is an unsupervised process.

In low dimensional data, for example the position of a point on the \( (x,y) \) plane, such groupings are reasonably easy to find visually. However, if a data point is not characterised by two values but instead by, say 4 or 9 or 42 etc., then it is not easy to plot these in such a way that they can be visually inspected. It may be possible to reduce the dimensionality of the problem (this is done for example when a 3 dimensional object is projected onto a flat 2 dimensional screen), and there is a well known method called Principal Component Analysis (PCA) that can help with this. However, if it's not possible to reduce the dimension to allow for a visual method of finding clusters, or you just want an automated method based on the data, another approach must be used.

An easy method of determining clusters is the so-called K-Means method. This algorithm works by first assigning data to random categories and then finding the centroid (mean position) of all data points classified as being the same. It then looks through all the points again and reassigns their catergories by choosing them to be in the category that matches the centroid closest to them. A new centroid is calculated based on the new assignments and the process is repeated until no further updates can be made. At this point all the data points have been classified. The Python library sklearn has a function for K-Means (and many other clustering algorithms). An example usage is shown below.

import matplotlib.pyplot as plt

from sklearn.cluster import KMeans

# data = [[x1,y1],[x2,y2],...] a list of coordinate pairs

kmeans = KMeans(n_clusters=3)

kmeans.fit(data)

y_kmeans = kmeans.predict(data)

x=[]

y=[]

for X in data:

\(\quad \quad\)	x.append(X[0])

\(\quad \quad\)	y.append(X[1])

plt.scatter(x, y, c=y_kmeans, alpha=0.5)

plt.show()

The interactive element below shows the K-Means process for a set of data points that are chracterised by their position on the \( (x,y) \) plane. To the human eye it is easy to see that there are three clusters. Pressing the "Start" button will show how K-Means finds the three clusters.