Understanding k-NN Classification using C#

I wrote an article titled “Understanding k-NN Classification using C#” in the December 2017 issue of Microsoft MSDN Magazine. See https://msdn.microsoft.com/en-us/magazine/mt814421.

The goal of k-NN (“k nearest neighbors”) classification is to predict the class of an item based on two or more predictor variables. For example, you might want to predict the political leaning (conservative, moderate, liberal) of a person based on their age, income, years of education, and number of children.

The technique is very simple. You obtain a set of training data that has known input and class values. Then for an unknown item, you find the k nearest training data points, and then predict the most common class.

In the image below, there are three classes, indicated by the red, green, and yellow data points. Each item has two predictor variables. The blue dot is the unknown. If you set k = 4, the four closest points to the blue dot are the red at (5,3), the yellow at (4,2), the yellow at (4,1), and the green at (6,1). The most common class is yellow, so you predict yellow.

Compared to other classification algorithms, the advantages of k-NN classification include: easy to implement, can be easily modified for specialized scenarios, works well with complex data patterns, and the results are somewhat interpretable.

Disadvantages of k-NN classification include: the result can be sensitive to your choice of the value for k, the technique works well only when all predictor variables are strictly numeric, it’s possible to get a tie result prediction, and the technique doesn’t work well with huge training data sets.