Weighted k-Nearest Neighbors Classification Example Using Python

One of the most fundamental machine learning classification techniques is k-nearest neighbors (k-NN). If you have a set of labeled data points, and you want to classify a new unlabeled data item, you compute the distance of the unlabeled item to all of the labeled points, then examine the closest k labeled data points, and find the most common label.

Sort of.

The details are a bit tricky and are best explained by a concrete example. I set up 30 dummy data points that look like:

[ 0, 0.32, 0.43, 0]
[ 1, 0.26, 0.54, 0]
. . . 
[12, 0.39, 0.43, 1]
. . .
[29, 0.71, 0.22, 2]

Each value represents a person. The first value is an ID, which is synonymous with the zero-based index. The second value is the person’s income (normalized). The third value is the person’s education level (normalized). The fourth value is the class label to predict, which can be 0, 1, or 2. The class labels are arbitrary but you can imagine that they are a measure of the person’s happiness.

Click to enlarge

The item to predict is set as [0.62, 0.35]. If you look at the graph of the data points (colored red, blue, green) and the item-to-predict (colored yellow), and you set k = 3, the 3 closest points to the yellow item are red, red, green, so if you use a majority voting rule, you’d predict the unlabeled item is class red = 0.

I set k = 6 in my demo code. The problem with a simple majority voting rule is that it doesn’t take into account how close the k points are. A better approach is to give more weight to points that are close to the data item that is being predicted. For the demo data, the k = 6 closest data points and their distances are:

(0.61 0.42) class =  0  dist = 0.0707   1/dist =  14.1421
(0.55 0.32) class =  0  dist = 0.0762   1/dist =  13.1306
(0.55 0.41) class =  1  dist = 0.0922   1/dist =  10.8465
(0.64 0.24) class =  2  dist = 0.1118   1/dist =   8.9443
(0.49 0.32) class =  0  dist = 0.1334   1/dist =   7.4953
(0.71 0.22) class =  2  dist = 0.1581   1/dist =   6.3246

The inverse distances are used so that a smaller distance gives a larger voting weight. The sum of the inverse distances is 14.1421 + . . + 6.3246 = 60.8834. The voting weights are the inverse distances divided by their sum:

class =  0  weight =  14.1421 / 60.8834 =  0.2323
class =  0  weight =  13.1306 / 60.8834 =  0.2157
class =  1  weight =  10.8465 / 60.8834 =  0.1782
class =  2  weight =   8.9443 / 60.8834 =  0.1469
class =  0  weight =   7.4953 / 60.8834 =  0.1231
class =  2  weight =   6.3246 / 60.8834 =  0.1039

The voting for each class is:

class 0 : 0.2323 + 0.2157 + 0.1231 =  0.5711
class 1 :                          =  0.1782
class 2 : 0.1469 + 0.1039          =  0.2508

Because class 0 has the most voting weight, the prediction is that the unlabeled data item is class 0.

The main advantage of k-NN classification is simplicity. The disadvantages include: the technique works only for strictly numeric data (that must be normalized), there’s no good way to select k, and different values of k can give very different predictions.

Where is planet Earth’s nearest inhabited neighbor? Several scientists have pointed out that broadcasting messages to outer space has slight but serious risks. When high-tech civilizations meet lower-tech civilizations, the results are usually not good for the lower-tech civilization — that would be Earth.

Left: In “Kronos” (1957), aliens send a giant energy harvesting machine to Earth. Center: In “Invaders from Mars” (1953), a Martian intelligence comes to Earth with plans of conquest. Right: In “The Crawling Eye” (1958), aliens land in the Swiss Alps and begin terra-forming Earth to suit their needs as a prelude to conquest.

Demo code:

# weighted_knn.py
# k-nearest neighbors demo
# Anaconda3-2020.02  (Python 3.7.6)

import numpy as np

def dist_func(item, data_point):
  sum = 0.0
  for i in range(2):
    diff = item[i] - data_point[i+1]  # no ID
    sum += diff * diff
  return np.sqrt(sum) 

def make_weights(k, distances):
  result = np.zeros(k, dtype=np.float32)
  sum = 0.0
  for i in range(k):
    result[i] += 1.0 / distances[i]
    sum += result[i]
  result /= sum
  return result

def show(v):
  print("idx = %3d (%3.2f %3.2f) class = %2d  " \
    % (v[0], v[1], v[2], v[3]), end="")

def main():
  print("\nBegin weighted k-NN demo \n")
  print("Normalized income-education data looks like: ")
  print("[id =  0, 0.32, 0.43, class = 0]")
  print(" . . . ")
  print("[id = 29, 0.71, 0.22, class = 2]")

  # ID/idx  income  education  class (arbitrary)
  data = np.array([
    [0, 0.32, 0.43, 0], [1, 0.26, 0.54, 0],
    [2, 0.27, 0.60, 0], [3, 0.37, 0.36, 0],
    [4, 0.37, 0.68, 0], [5, 0.49, 0.32, 0],
    [6, 0.46, 0.70, 0], [7, 0.55, 0.32, 0],
    [8, 0.57, 0.71, 0], [9, 0.61, 0.42, 0],
    [10, 0.63, 0.51, 0], [11, 0.62, 0.63, 0],
    [12, 0.39, 0.43, 1], [13, 0.35, 0.51, 1],
    [14, 0.39, 0.63, 1], [15, 0.47, 0.40, 1],
    [16, 0.48, 0.50, 1], [17, 0.45, 0.61, 1],
    [18, 0.55, 0.41, 1], [19, 0.57, 0.53, 1],
    [20, 0.56, 0.62, 1], [21, 0.28, 0.12, 1],
    [22, 0.31, 0.24, 1], [23, 0.22, 0.30, 1],
    [24, 0.38, 0.14, 1], [25, 0.58, 0.13, 2],
    [26, 0.57, 0.19, 2], [27, 0.66, 0.14, 2],
    [28, 0.64, 0.24, 2], [29, 0.71, 0.22, 2]],
    dtype=np.float32)

  item = np.array([0.62, 0.35], dtype=np.float32)
  print("\nNearest neighbors (k=6) to (0.62, 0.35): \n")

  # 1. compute all distances to item
  N = len(data)
  k = 6
  c = 3
  distances = np.zeros(N)
  for i in range(N):
    distances[i] = dist_func(item, data[i])

  # 2. get ordering of distances
  ordering = distances.argsort()

  # 3. get and show info for k nearest
  k_near_dists = np.zeros(k, dtype=np.float32)
  sum_inv_dists = 0.0
  for i in range(k):
    idx = ordering[i]
    show(data[idx])  # pretty formatting
    print("distance = %0.4f " % distances[idx], end="")
    k_near_dists[i] = distances[idx]  # save dists
    print("  1/dist = %8.4f " % (1.0/distances[idx]))
    sum_inv_dists += 1.0/distances[idx]
  print("\nsum inv distances = %9.4f " % sum_inv_dists)

  # 4. vote
  votes = np.zeros(c, dtype=np.float32)
  wts = make_weights(k, k_near_dists)
  print("\nVoting weights (inverse distance technique): ")
  for i in range(len(wts)):
    print("%7.4f" % wts[i], end="")
  
  print("\n\nPredicted class: ")
  for i in range(k):
    idx = ordering[i]
    pred_class = np.int64(data[idx][3])
    votes[pred_class] += wts[i] * 1.0
  for i in range(c):
    print("[%d]  %0.4f" % (i, votes[i]))

  print("\nEnd weighted k-NN demo ")

if __name__ == "__main__":
  main()