# Find the clusters

using K-means and DBSCAN

Cluster analysis is a method for looking into data to see if individual datapoints within a dataset can be categorised as belonging to one of a number of groups of similar datapoints. The category a datapoint belongs in, or indeed the number of categories to use, is generally not defined beforehand, and instead the data are used to suggest these. As such, cluster analysis is an unsupervised process.

In low dimensional data, for example the position of a point on the $$(x,y)$$ plane, such groupings are reasonably easy to find visually. However, if a data point is not characterised by two values but instead by, say 4 or 9 or 42 etc., then it is not easy to plot these in such a way that they can be visually inspected. It may be possible to reduce the dimensionality of the problem (this is done for example when a 3 dimensional object is projected onto a flat 2 dimensional screen), and there is a well known method called Principal Component Analysis (PCA) that can help with this. However, if it's not possible to reduce the dimension to allow for a visual method of finding clusters, or you just want an automated method based on the data, another approach must be used.

An easy method of determining clusters is the so-called K-Means method. This algorithm works by first assigning data to random categories and then finding the centroid (mean position) of all data points classified as being the same. It then looks through all the points again and reassigns their catergories by choosing them to be in the category that matches the centroid closest to them. A new centroid is calculated based on the new assignments and the process is repeated until no further updates can be made. At this point all the data points have been classified. The Python library sklearn has a function for K-Means (and many other clustering algorithms). An example usage is shown below.

import matplotlib.pyplot as plt
from sklearn.cluster import KMeans

# data = [[x1,y1],[x2,y2],...] a list of coordinate pairs

kmeans = KMeans(n_clusters=3)
kmeans.fit(data)
y_kmeans = kmeans.predict(data)

x=[]
y=[]
for X in data:
$$\quad \quad$$ x.append(X)
$$\quad \quad$$ y.append(X)
plt.scatter(x, y, c=y_kmeans, alpha=0.5)
plt.show()

The interactive element below shows the K-Means process for a set of data points that are chracterised by their position on the $$(x,y)$$ plane. To the human eye it is easy to see that there are three clusters. Pressing the "Start" button will show how K-Means finds the three clusters.

The above produces three distinct groups in three iterations. This is fine if it is known that there are three groups; in K-Means the number of clusters is an input parameter, but what if you don't know how many clusters there should be (which is one of the purposes of cluster analysis!)? One method is to try several different numbers of clusters and watch how a score of performance changes with the number of clusters. The score is usually measured by the sum of squared errors (SSE), which measures how far away points assigned to a cluster are from the centroid of the cluster. Usually, as the mumber of clusters increases the SSE will drop quickly. However, as the number gets beyond a certain value there will be a fall off in the amount by which it drops, see below. The point at which the drop-off changes is at three clusters in the above and this should be tried as the minimum number with which to analyse the data. This method of determining the number of clusters is known as the "elbow method" as it uses the "elbow" in the shape of the data.

The method of K-Means separates data points linearly. This works well for the example above but when applied to a different set of data, the shortcoming of a linear separation is apparent. There are ways around this! It is possible to map each data point into a higher-dimensional space, which, if the mapping is chosen carefully, will allow the desired separation. This method is called the Kernel K-Means method, and requires a "Kernel" function to be defined that represents not the data mapping but an inner product of the mapped data. Unfortunately, the best function to use depends on the data.

An alternative method is to look at the density of points and use this as a way of determining how they should be clustered. One density-based approach is known as DBSCAN (also available as a function in sklearn). This method looks at all the points and determines which are surrounded by a given minimum number of other points within a given radius. The points that are within this region are all classified as being the same and the method is then recursively applied to those points. Eventually, there will be no more points left within the region and the method starts a new class and moves on to the next unclassified point. At the end of the process most of the points will be assigned to a cluster, with some that are at the edges of clusters, which can be assigned to the closest cluster available. Others will be neither classified nor boundary points, and which will be discarded as being outliers. An example classification using DBSCAN is shown below, and it can be seen that the data are correctly separated, and, with the radius and minimum number of points chosen, two classes are defined. The DBSCAN method has the advantage that it automatically determines how many clusters there are but the catch is the minimum number of points and the radius used for neighbour counting do have to be specified. Again the choice is determined by the data.

Note, these are not the only methods available!