Data Clustering with K-Means Using Python

I wrote an article titled, “Data Clustering with K-Means Using Python” in the March 2018 issue of Visual Studio Magazine. See https://visualstudiomagazine.com/articles/2018/03/27/clustering-with-k-means-using-python.aspx.

The idea of clustering is pretty simple: take a dataset then group items together so that similar item are in the same group/cluster (and therefore dissimilar items are in different groups/clusters). After clustering, the results can be examined to see if any interesting patterns emerge, or you can identify outliers — a form of anomaly detection. But as always, the details are quite tricky.

There are several clustering algorithms. In my article, I explained how to implement one of the most common, which is called the k-means technique (or sometimes referred to as Lloyd’s algorithm). However, k-means is really more of a heuristic than a detailed algorithm, meaning that there are many different specific approaches you can use.

Many of the different k-means approaches involve the initialization phase. As it turns out, getting the k-means algorithm started well is very important. This is because clustering is an NP-complete problem which means that it’s no practical to get an optimal clustering (because you’d have to try every possible clustering). In fact, one variant of k-means is called k-means++ and it uses a pretty complicated initiation routine.

Anyway, I show exactly how to implement one possible variation of k-means clustering, using the Python language. The idea of a custom implementation is that it gives you total control over the many different options you can apply.

The biggest downside to k-means clustering is that the technique can be used only with data that is all numeric. There are techniques for clustering non-numeric or mixed numeric and non-numeric data, but they are very difficult.

“Liverpool from Wapping” (1875), John Grimshaw. Ships, people, buildings in three different clusters.