This site uses cookies. To see our cookie policy and how to control cookies click hereCookies

using K-means and DBSCAN

Cluster analysis is a method for looking into data to see if individual datapoints within a dataset can be categorised as belonging to one of a number of groups of similar datapoints. The category a datapoint belongs in, or indeed the number of categories to use, is generally not defined beforehand, and instead the data are used to suggest these. As such, cluster analysis is an unsupervised process.In low dimensional data, for example the position of a point on the \( (x,y) \) plane, such groupings are reasonably easy to find visually. However, if a data point is not characterised by two values but instead by, say 4 or 9 or 42 etc., then it is not easy to plot these in such a way that they can be visually inspected. It may be possible to reduce the dimensionality of the problem (this is done for example when a 3 dimensional object is projected onto a flat 2 dimensional screen), and there is a well known method called Principal Component Analysis (PCA) that can help with this. However, if it's not possible to reduce the dimension to allow for a visual method of finding clusters, or you just want an automated method based on the data, another approach must be used.

An easy method of determining clusters is the so-called K-Means method. This algorithm works by first assigning data to random categories and then finding the centroid (mean position) of all data points classified as being the same. It then looks through all the points again and reassigns their catergories by choosing them to be in the category that matches the centroid closest to them. A new centroid is calculated based on the new assignments and the process is repeated until no further updates can be made. At this point all the data points have been classified. The Python library

import matplotlib.pyplot as plt

from sklearn.cluster import KMeans

# data = [[x1,y1],[x2,y2],...] a list of coordinate pairs

kmeans = KMeans(n_clusters=3)

kmeans.fit(data)

y_kmeans = kmeans.predict(data)

x=[]

y=[]

for X in data:

\(\quad \quad\) x.append(X[0])

\(\quad \quad\) y.append(X[1])

plt.scatter(x, y, c=y_kmeans, alpha=0.5)

plt.show()

from sklearn.cluster import KMeans

# data = [[x1,y1],[x2,y2],...] a list of coordinate pairs

kmeans = KMeans(n_clusters=3)

kmeans.fit(data)

y_kmeans = kmeans.predict(data)

x=[]

y=[]

for X in data:

\(\quad \quad\) x.append(X[0])

\(\quad \quad\) y.append(X[1])

plt.scatter(x, y, c=y_kmeans, alpha=0.5)

plt.show()

The interactive element below shows the K-Means process for a set of data points that are chracterised by their position on the \( (x,y) \) plane. To the human eye it is easy to see that there are three clusters. Pressing the "Start" button will show how K-Means finds the three clusters.

The above produces three distinct groups in three iterations. This is fine if it is known that there are three groups; in K-Means the number of clusters is an input parameter, but what if you don't know how many clusters there should be (which is one of the purposes of cluster analysis!)? One method is to try several different numbers of clusters and watch how a score of performance changes with the number of clusters. The score is usually measured by the

The method of K-Means separates data points linearly. This works well for the example above but when applied to a different set of data, the shortcoming of a linear separation is apparent. There are ways around this! It is possible to map each data point into a higher-dimensional space, which, if the mapping is chosen carefully, will allow the desired separation. This method is called the Kernel K-Means method, and requires a "Kernel" function to be defined that represents not the data mapping but an inner product of the mapped data. Unfortunately, the best function to use depends on the data.

An alternative method is to look at the density of points and use this as a way of determining how they should be clustered. One density-based approach is known as DBSCAN (also available as a function in

Note, these are not the only methods available!