Exploring the scikit Library SpectralClustering Affinity Matrix

I spend 30 minutes each morning before work looking at technical stuff. One morning I decided to dissect the scikit library SpectralCustering module. Spectral clustering is much more complicated than the two most common techniques, k-means clustering and DBSCAN clustering.

The first step in spectral clustering is to compute an affinity matrix — the similarity of each data item with every other data item. There are many ways to do this, including RBF (radial basis function) similarity, nearest neighbors connectivity, epsilon radius neighbors, heat kernel applied to distances, and so on. An affinity matrix is similar to, but technically a bit different than, a similarity matrix.

One of the things that I don’t like about spectral clustering is that there are dozens of ways to implement it, as opposed to a technique like k-means which has fewer design options.

The scikit default is RBF with gamma = 1.0. To explore, I located the scikit source code on my machine and decorated the code by adding print() statements to display the internal affinity matrix. My installation of the scikit code was located at C:\Users\(user)\Anaconda3\Lib\site-packages\sklearn\cluster in file _spectral.py. I added two print statements:

. . .
  else:
      params = self.kernel_params
      if params is None:
          params = {}
      if not callable(self.affinity):
          params["gamma"] = self.gamma
          params["degree"] = self.degree
          params["coef0"] = self.coef0
      self.affinity_matrix_ = pairwise_kernels(
          X, metric=self.affinity, filter_params=True, **params
      )
# === added:
      print("\nAffinity matrix computed inside library: ") 
      print(self.affinity_matrix_); input()  # is an array
# ===
  random_state = check_random_state(self.random_state)
. . .

To verify my understanding, I wrote a from-scratch function to compute the same affinity matrix. I got the same result as the internal affinity matrix, as expected.

I ran a few experiments with small datasets, and the RBF affinity approach did not seem to work quite as well as nearest neighbors affinity, and a custom epsilon radius neighbors. Curious. I need to investigate. Update: After more experiments, RBF affinity with a better choice of gamma works just as well as nearest neighbors affinity.

The values on the diagonal of the affinity matrix are 1.0 because each data item has maximum RBF similarity with itself. For data item [0], the closest point is item [7] which has the second largest RBF value (0.9512).

At that point, it was time to go to work, but I was satisfied I gained a new bit of knowledge.

My next steps will be to explore how the affinity matrix is converted to a Laplacian matrix (a complex problem), which is then decomposed using eigenvalue decomposition (a complex problem), and then those results are used (somehow) to cluster the source data. Update: I did this. See https://jamesmccaffreyblog.com/2023/11/27/spectral-data-clustering-from-scratch-using-csharp/.

Three old movies that have high similarity. Left: In “The Beast From 20,000 Fathoms” (1953), a prehistoric creature rampages through New York City. Center: In “Godzilla” (1954), a prehistoric creature rampages through Tokyo. In “The Giant Behemoth” (1959), a prehistoric creature rampages through London. I like all three of these movies.

Demo code:

# spectral_demo_2.py

import numpy as np
from sklearn.cluster import SpectralClustering

# SpectralClustering(n_clusters=8, *, eigen_solver=None,
# n_components=None, random_state=None, n_init=10,
# gamma=1.0, affinity='rbf', n_neighbors=10,
# eigen_tol='auto', assign_labels='kmeans', degree=3,
# coef0=1, kernel_params=None, n_jobs=None,
# verbose=False)

print("\nBegin spectral clustering demo ")

np.set_printoptions(precision=4, suppress=True,
  floatmode='fixed', linewidth=100)

X = np.array([[0.20, 0.20],
[0.20, 0.70],
[0.30, 0.90],
[0.50, 0.50],
[0.50, 0.60],
[0.60, 0.50],
[0.70, 0.90],
[0.40, 0.10],
[0.90, 0.70],
[0.80, 0.20]])

print("\nData: ")
print(X)

print("\nApplying SpectralClustering() ")
clustering = SpectralClustering(n_clusters=2,
  # affinity='nearest_neighbors', n_neighbors=3,
  affinity='rbf', gamma=1.0,
  assign_labels='kmeans', random_state=0).fit(X)

print("\nResult: ")
print(clustering.labels_)

def my_rbf(x1, x2, gamma):
  # documentation: "np.exp(-gamma * d(X,X) ** 2)"
  dim = len(x1)
  sum = 0.0
  for i in range(dim):
    sum += (x1[i] - x2[i]) * (x1[i] - x2[i])
  result = np.exp(-gamma * sum)
  return result

print("\naffinity computed from scratch: ")
my_affinity = np.zeros((10,10), dtype=np.float64)
for i in range(10):
  for j in range(10):
    my_affinity[i][j] = my_rbf(X[i], X[j], 1.0)
print(my_affinity)

print("\nEnd demo ")