Kernel Ridge Regression From Scratch Using Python

I don’t use kernel ridge regression very often but I figured I’d implement KRR from scratch using Python. After a few hours of work, I was quite surprised when my scratch implementation produced results that were identical to the scikit library KernelRidge module, even though I didn’t look at the scikit source code.

Note: After I wrote this blog post, I realized that “from scratch” can have several meanings. I used the numpy.linalg.inv() function to compute a matrix inverse. I make no apologies — matrix inversion is one of the most difficult problems in numerical programming. That said, I could have used a completely from-scratch Python matrix inverse function I wrote: https://jamesmccaffreyblog.com/2022/01/14/matrix-inverse-from-scratch-using-python/.

Implementation of KRR from scratch is relatively easy if you know the underlying math. The key equations are:

 w * K(X', X') = Y'

 w * K(X', X') * inv(K(X', X')) = Y' * inv(K(X', X'))
 w = Y' * inv(K(X', X'))

The first equation means, “The assumption is that you multiply the model weights vector w times the kernel matrix K between all possible combinations of x training data to get all training values Y’.” So, next, the goal is to compute the model weights. The kernel matrix K(X’, X’) is easily computed directly from the X predictor values. The second equation multiplies both sides of the first equation by the inverse of the kernel matrix. After cancellation where A * inv(A) = Identity, the third equation shows how to compute the model weights: multiply the training target values Y’ vector times the inverse of K(X’, X’) matrix.

Therefore, to compute a predicted y from an x vector:

y_pred = w * K(x, X')

In words, “To compute a predicted y value from an x vector, matrix multiply the weights vector (computed via the equation above) times the kernel matrix of x against all training X’ vectors.”

It turns out that it’s a good idea to add regularization. The “ridge” in KRR means use L2 regularization. To do this, a small value, usually called alpha or noise or lambda, is added to the diagonal elements of the K matrix before computing its inverse.

My implementation uses hard-coded radial basis function (RBF) as the kernel function, with a gamma parameter = 1.0, where rbf(x1, x2) = exp( -gamma * ||x1 – x2||^2 ). The ||x1 – x2||^2 is squared Euclidean distance.

For my demo, I set up a tiny set of four predictor values where each predictor is a vector with 3 values. The target y values are single numbers:

   X                     y
x0 = [0.1, 0.5, 0.2]  y0 = 0.3
x1 = [0.4, 0.3, 0.0]  y1 = 0.9
x2 = [0.6, 0.1, 0.8]  y2 = 0.4
x3 = [0.0, 0.2, 0.7]  y3 = 0.9

A KRR model that uses RBF needs two hyperparameters: a gamma value for the RBF function, and an alpha value for L2 regularization. After some trial and error, I used gamma = 1.0 and alpha = 0.001. My scratch demo produced predicted y values of (0.3037, 0.8967, 0.4012, 0.7975) — very close to the actual y values of (0.3, 0.9, 0.4, 0.9).

I used the trained model to predict y for a new, previously unseen x = (0.5, 0.4, 0.6) and got y = 0.4097.

I punched my training data into a scikit KernelRidge model and was mildly surprised when I got identical results.

A fun experiment!

I don’t mind refactoring machine learning computer programs using different algorithms and different programming languages. I’ve looked at kernel ridge regression using scratch Python (matrix inversion), scratch Python (pseudo stochastic gradient descent), scikit KernelRidge (library code), and scratch C#. Every time I refactor a program, I learn something new.

When I was growing up, I loved the Tintin series of books. The series has been refactored several times as animated films, all of them interesting.

Top Left: “The Crab with the Golden Claws” (1947) was produced in Belgium by Wilfried Bouchery using stop motion. But it was shown only once when the producers went bankrupt and the film was confiscated. I found a bootleg copy on the Internet. Not bad, but mostly for historical interest.

Top Right: “Herge’s Adventures of Tintin” (1957-1964) was produced by Belgium’s Belvision Studios. There were 103 five-minute episodes that covered most of the Tintin books. I like this series a lot.

Bottom Left: “The Adventures of Tintin” (1991-1992) was a French-Canadian co-production. There were 39 thirty-minute episodes that followed the books closely. I love this series.

Bottom Right: “The Adventures of Tintin” (2011) was produced by Steven Spielberg. The 107-miunute feature film used computer animation. I thought the movie was only so-so. There was too much emphasis on over-the-top action sequences for my taste, and the animation was kind of creepy to my eye.

Demo code.

# krr_scratch_matrix.py

# lightweight kernel ridge regression from scratch
# closed-form matrix inverse training

# Anaconda3-2022.10  Python 3.9.13
# Windows 10/11

import numpy as np

# -----------------------------------------------------------

def rbf(x1, x2, gamma):
  # x1, x2 are 1-D arrays/vectors
  # rbf = exp( -gamma * ||x1 - x2||^2 )

  dist = np.linalg.norm(x1 - x2)  # Euclidean distance
  return np.exp( -gamma * (dist**2) )

  # dim = len(x1)  # less efficient but more clear
  # sum = 0.0
  # for i in range(dim):
  #   sum += (x1[i] - x2[i]) * (x1[i] - x2[i])
  # return np.exp( -gamma * sum )

# -----------------------------------------------------------

def compute_output(w, x, X, gamma):
  # x is a 1D array / vector of predictors
  # w is an array of weights
  # X is a 2D matrix of all training data
 
  N = len(X)  # number of train items
  sum = 0.0
  for i in range(N):
    xx = X[i]  # train item as 1 vector
    k = rbf(x, xx, gamma)  # kernel value
    sum += w[i] * k
  return sum

# -----------------------------------------------------------

print("\nLightweight kernel ridge regression from scratch ")
np.random.seed(1)  # not used -- no randomness
np.set_printoptions(precision=4, suppress=True)

# 0. set up training data

X = np.array([[0.1, 0.5, 0.2],
              [0.4, 0.3, 0.0],
              [0.6, 0.1, 0.8],
              [0.0, 0.2, 0.7]], dtype=np.float64)

y = np.array([0.3, 0.9, 0.4, 0.8], dtype=np.float64)

print("\nX values: ")
print(X)
print("\nTarget y values: ")
print(y)

# wK = y
# wKK' = yK'
# w = yK'

# 1. make kernel matrix K
N = len(X)
gamma = 1.0
alpha = 0.001  # regularization
print("\nSetting alpha = %0.4f gamma = %0.2f " \
  % (alpha, gamma))

K = np.zeros((N,N))
for i in range(N):
  for j in range(N):
    K[i][j] = rbf(X[i], X[j], gamma)

print("\nK = " )
print(K)

# 2. add regularization term on diagonal
for i in range(N):
  K[i][i] += alpha

# 3. compute inverse of modified K matrix
Kinv = np.linalg.inv(K)
print("\nKinv = ")
print(Kinv)

# 4. compute model weights using K inverse
wts = np.matmul(y, Kinv)
print("\nwts = ")
print(wts)

# 5. use trained model to make predictions
print("\nActual y values: ")
print(y)

print("\nPredicted y values: ")
for i in range(N):
  x = X[i]
  y_pred = compute_output(wts, x, X, gamma=gamma)
  print("%0.4f" % y_pred)

# 6. predict for previously unseen x
x = np.array([0.5, 0.4, 0.6], dtype=np.float64)
print("\nPredicting for [0.5, 0.4, 0.6] ")
y_pred = compute_output(wts, x, X, gamma=gamma)
print("\nPredicted y = %0.4f " % y_pred)

# 7. compare with scikit
print("\nUsing scikit KernelRidge ")
from sklearn.kernel_ridge import KernelRidge
model = KernelRidge(kernel='rbf', gamma=1.0, alpha=0.001)
model.fit(X, y)
print("\nModel weights from scikit KernelRidge ")
print(model.dual_coef_)

y_preds = model.predict(X)
print("\nPredicted y values using scikit KernelRidge: ")
print(y_preds)

print("\nEnd KRR from scratch demo ")