Find the clusters
using K-means and DBSCAN
Cluster analysis is a method for looking into data to see if individual datapoints within a dataset
can be categorised as belonging to one of a number of groups of similar datapoints. The category a
datapoint belongs in, or indeed the number of categories to use, is generally not defined beforehand,
and instead the data are used to suggest these. As such, cluster analysis is an unsupervised
In low dimensional data, for example the position of a point on the \( (x,y) \) plane, such
groupings are reasonably easy to find visually. However, if a data point is not characterised
by two values but instead by, say 4 or 9 or 42 etc., then it is not easy to plot these in such
a way that they can be visually inspected. It may be possible to reduce the dimensionality of
the problem (this is done for example when a 3 dimensional object is projected onto a flat
2 dimensional screen), and there is a well known method called Principal Component Analysis
(PCA) that can help with this. However, if it's not possible to reduce the dimension to allow
for a visual method of finding clusters, or you just want an automated method based on the
data, another approach must be used.
An easy method of determining clusters is the so-called K-Means method. This algorithm works
by first assigning data to random categories and then finding the centroid (mean position) of
all data points classified as being the same. It then looks through all the points again and
reassigns their catergories by choosing them to be in the category that matches the centroid
closest to them. A new centroid is calculated based on the new assignments and the process is
repeated until no further updates can be made. At this point all the data points have been
classified. The Python library sklearn
has a function for K-Means (and many other clustering
algorithms). An example usage is shown below.
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
# data = [[x1,y1],[x2,y2],...] a list of coordinate pairs
kmeans = KMeans(n_clusters=3)
y_kmeans = kmeans.predict(data)
for X in data:
\(\quad \quad\) x.append(X)
\(\quad \quad\) y.append(X)
plt.scatter(x, y, c=y_kmeans, alpha=0.5)
The interactive element below shows the K-Means process for a set of data points that are
chracterised by their position on the \( (x,y) \) plane. To the human eye it is easy to
see that there are three clusters. Pressing the "Start" button will show how K-Means finds
the three clusters.
The above produces three distinct groups in three iterations. This is fine if it is known that
there are three groups; in K-Means the number of clusters is an input parameter, but what if
you don't know how many clusters there should be (which is one of the purposes of cluster analysis!)?
One method is to try several different
numbers of clusters and watch how a score of performance changes with the number of clusters.
The score is usually measured by the sum of squared errors
(SSE), which measures how
far away points assigned to a cluster are from the centroid of the cluster. Usually, as the
mumber of clusters increases the SSE will drop quickly. However, as the number gets beyond
a certain value there will be a fall off in the amount by which it drops, see below.
The point at which the drop-off changes is at three clusters in the above and this should
be tried as the minimum number with which to analyse the data. This method of determining the
number of clusters is known as the "elbow method" as it uses the "elbow" in the shape of the
The method of K-Means separates data points linearly. This works well for the example above
but when applied to a different set of data, the shortcoming of a linear separation is
There are ways around this! It is possible to map each data point into a higher-dimensional
space, which, if the mapping is chosen carefully, will allow the desired separation. This
method is called the Kernel K-Means method, and requires a "Kernel" function to be defined
that represents not the data mapping but an inner product of the mapped data. Unfortunately, the best function to use depends on the data.
An alternative method is to look at the density of points and use this as a way of determining
how they should be clustered. One density-based approach is known as DBSCAN (also available
as a function in sklearn
). This method looks
at all the points and determines which are surrounded by a given minimum number of other points
within a given radius. The points that are within this region are all classified as being the same
and the method is then recursively applied to those points. Eventually, there will be no more
points left within the region and the method starts a new class and moves on to the next
At the end of the process most of the points will be assigned to a cluster,
with some that are at the edges of clusters, which can be assigned to the closest cluster available.
Others will be neither classified nor boundary points, and which will be discarded as being outliers. An
example classification using DBSCAN is shown below, and it can be seen that the data are correctly
separated, and, with the radius and minimum number of points chosen, two classes are defined.
The DBSCAN method has the advantage that it automatically determines how many clusters there are
but the catch is the minimum number of points and the radius used for neighbour counting do have
to be specified. Again the choice is determined by the data.
Note, these are not the only methods available!