Comparing Support Vector Regression versus Post-Training-Trimmed Kernel Ridge Regression Using Scikit

Bottom line: I did a short experiment to compare support vector regression (SVR) with post-training-trimmed kernel ridge regression (KRR). I used the scikit library SVR and KernelRidge modules. Because of the nearly infinite number of hyperparameter values involved, it wasn’t possible to draw a strong conclusion, but my results strongly suggest the two techniques are essentially the same, or at least very similar, from a practical point of view.

Bear with me. The goal of a machine learning regression problem is to predict a single numeric value. For example, a bank might want to predict a maximum loan amount based on applicant age, sex, annual income, debt, and so on.

Common regression techniques include linear regression, quadratic regression, nearest neighbors regression, kernel ridge regression, Gaussian process regression, kernel support vector regression, neural network regression, random forest regression, and gradient boost regression. Each technique has dozens of variations, and each technique has pros and cons.

Two closely related machine learning regression techniques are kernel ridge regression (KRR) and support vector regression (SVR). Both techniques use a kernel function (usually the radial basis function, RBF) to compare two data items for similarity. Both techniques must store training data in order to make predictions. KRR must store all training items, while SVR eliminates some of the items during training, leaving just the “support vectors” that need to be stored. So, SVR uses less memory than KRR.

On the other hand, the loss function used by SVR is not differentiable, so SVR cannot be trained using stochastic gradient descent, and therefore, SVR does not scale well to very large datasets. So, KRR is easier to train than SVR.

Note: In addition to the kernel support vector regression technique discussed in this blog post, there is a linear support vector regression technique which is essentially useless in practice.

The idea of combining KRR and SVR goes like this: Briefly, I train a KRR model as normal using all data. Then I identify training data items that are predicted “too well” and remove them, leaving just pseudo support vectors. The somewhat unobvious SVR idea is that items that are predicted too well don’t help model accuracy very much, and can lead to an overfitted model. After removing some training items, I retrain a new KRR model using only the reduced training data. This gives a trimmed/sparse KRR model that approximates an SVR model.

In my mind, this technique gives the advantages of KRR (ability to handle very large datasets via SGD training) and the advantages of SVR (fewer stored model items and weights than KRR).

I first implemented the idea using the scikit KernelRidge module and the idea seemed to work well. But I wanted to take a second look, using a very small dataset, as a sanity check to make sure there weren’t any unexpected, bad, surprises. The output of one of the experiments is:

Begin SVR vs. trimmed KRR demo

Creating 20-item, 5-feature train data

X:
[[ 0.0458 -0.1872  1.4694 -1.4544  1.5328]
 [ 0.9008  0.4657  1.4883 -1.1651 -1.5362]
 [ 0.6536  0.8644  2.2698 -2.5530 -0.7422]
 [ 0.4002  0.9787  1.8676  1.7641  2.2409]
 [ 0.3869 -0.5108 -0.0282 -0.8955 -1.1806]
 [ 1.2303  1.2024 -0.3023  0.1563 -0.3873]
 [ 1.4543  0.7610  0.4439  0.1440  0.1217]
 [ 1.4941 -0.2052 -0.8541  0.3337  0.3131]
 [-0.3596 -0.8131  0.1774 -0.6725 -1.7263]
 [ 1.1788 -0.1799  1.0545  1.8959 -1.0708]
 [ 0.1290  1.1394  0.4023  0.7291 -1.2348]
 [ 0.9501 -0.1514  0.4106 -0.9773 -0.1032]
 [ 1.2224  0.2083  0.3564 -0.4032  0.9766]
 [-1.6302  0.4628  0.0519 -0.4018 -0.9073]
 [ 0.3782 -0.8878 -0.3479  0.1549 -1.9808]
 [ 0.0665  0.3025 -0.3627  0.4283 -0.6343]
 [ 0.0105  1.7859  0.4020  0.7066  0.1269]
 [-1.4200 -1.7063 -0.5097 -1.0486  1.9508]
 [-1.2528  0.7775 -0.2127 -0.4381 -1.6139]
 [-0.8708 -0.5788  0.0562 -0.6848 -0.3116]]

Creating and training SVR model
gamma = 0.1000
C = 1000.0000
epsilon = 8.0000
Done

Number support vectors: 10

Support vector indices:
[ 0  1  3  9 10 14 16 17 18 19]

Accuracy (within 0.10) train = 0.6000
MSE train = 64.0016
R2 train = 0.996314

===============================

Creating and training preliminary KRR RBF model
Setting gamma = 0.1000
Setting alpha = 0.0010
Done

Removing non-pseudo-support vectors
Using KRR trim epsilon = 0.1000

Number of pseudo support vectors = 10

Support vector indices:
[ 1  3  9 10 11 15 16 17 18 19]

Re-training trimmed KRR model
Done

Accuracy (within 0.10) train = 1.0000
MSE train = 0.0225
R2 train = 0.999999

End demo

The demo program begins by using the scikit make_regression() function to create a tiny synthetic dataset with just 20 rows/items, each with 5 predictor values. An SVR model ended up with 10 support vectors and an R2 score (essentially a scaled accuracy metric) of 0.9963 on the training data.

I tuned a trimmed KRR model to one with 10 pseudo-support vectors. The two techniques had 8 out of 10 support vectors in common — that’s good.

The trimmed KRR model had an R2 score of 0.9999 on the training data. The trimmed KRR model had better MSE than the SVR model, but that was expected because KRR minimizes MSE and SVR does not. The much better accuracy of the trimmed KRR model (100% for KRR vs. 60% for SVR) isn’t significant because the dataset is so small, and I didn’t use a test dataset to measure possible overfitting.

I noticed that both models were ultra-sensitive to hyperparameter values — gamma, C, and epsilon for SVR, and gamma, alpha, and trim-epsilon for trimmed KRR. This is a major weakness of both techniques.

Anyway, the sanity check was successful. It is possible to approximate a SVR regression model using a trimmed KRR model.

It’s common for me to refactor the software systems I create many times. It’s rarely possible to a get a non-trivial system completely correct on the first effort, and so a system is usually a collection of software sequels, so to speak, where each sequel is a bit better than its predecessor.

I’m a big fan of science fiction movies. There have been dozens of sci fi sequels. “Star Wars” (1977) and “Star Wars: The Empire Strikes Back” (1980). “Alien” (1979) and “Aliens” (1986). And so on. But most movie sequels are worse than their predecessor, however there are exceptions where the sequel is as-good-as, or even better than the original.

Left: In “The Quatermass Xperiment” aka “The Creeping Unknown” (1955), Dr. Bernard Quatermass leads a British effort to put men into space. Three astronauts are sent up, but only one man returns. He has been exposed to cosmic radiation and morphs into a deadly blob-like creature. The creature is eventually electrocuted. My personal grade = B+.

Right: In “Quatermass 2” aka “Enemy From Space” (1957), Quatermass discovers an alien invasion. The aliens use parasites to control the people in a small village and use the villagers to construct a plant to create alien food to support the invasion. Quatermass and the UK military blow up the plant and defeat the alien’s plans. My personal grade = A-.

Demo program. Replace the “lt” (less than) and “gt” with the Boolean operator symbols. (My blog editor chokes on symbols).

# svr_vs_krr_trimmed_scikit.py
# train using KRR, remove some, retrain using KRR

import numpy as np
from sklearn.kernel_ridge import KernelRidge
from sklearn.svm import SVR
from sklearn.datasets import make_regression

# KernelRidge(alpha=1, *, kernel='linear', gamma=None,
# degree=3, coef0=1, kernel_params=None)

# SVR(*, kernel='rbf', degree=3, gamma='scale',
# coef0=0.0, tol=0.001, C=1.0, epsilon=0.1,
# shrinking=True, cache_size=200, verbose=False,
# max_iter=-1)

# make_regression(n_samples=100, n_features=100,
# *, n_informative=10, n_targets=1, bias=0.0,
# effective_rank=None, tail_strength=0.5, noise=0.0,
# shuffle=True, coef=False, random_state=None)

# -----------------------------------------------------------

np.set_printoptions(precision=4, suppress=True,
  floatmode='fixed', linewidth=60)

# -----------------------------------------------------------

def accuracy(model, data_X, data_y, pct_close):
  n = len(data_X)
  n_correct = 0; n_wrong = 0
  for i in range(n):
    x = data_X[i].reshape(1,-1)
    y = data_y[i]
    y_pred = model.predict(x)[0]

    if np.abs(y - y_pred) "lt" np.abs(y * pct_close):
      n_correct += 1
    else: 
      n_wrong += 1
  return n_correct / (n_correct + n_wrong)

def mse(model, data_X, data_y):
  n = len(data_X)
  sum = 0.0
  for i in range(n):
    actual_y = data_y[i]
    pred_y = model.predict(data_X[i].reshape(1, -1))[0]
    diff = actual_y - pred_y
    sum += diff * diff
  return sum /n

# -----------------------------------------------------------
# -----------------------------------------------------------

print("\nBegin SVR vs. trimmmed KRR demo ")

np.set_printoptions(precision=4, suppress=True,
    floatmode='fixed')

print("\nCreating 20-item, 5-feature train data ")
X, y = make_regression(n_samples=20, n_features=5,
  n_informative=5, noise=0, random_state=0)

print("\nX: ")
print(X)

print("\nCreating and training SVR model ")
svr_gamma = 0.10
svr_C = 1000.0
svr_epsilon = 8.0
print("gamma = %0.4f " % svr_gamma)
print("C = %0.4f " % svr_C)
print("epsilon = %0.4f " % svr_epsilon)
svr_model = SVR(kernel='rbf', gamma=svr_gamma, C=svr_C,
  epsilon = svr_epsilon)
svr_model.fit(X, y)
print("Done ")

print("\nNumber support vectors: ", end="")
print(len(svr_model.support_))
print("\nSupport vector indices: ")
print(svr_model.support_)

X_sv = X[svr_model.support_,:]
y_sv = y[svr_model.support_]

acc_train = accuracy(svr_model, X_sv, y_sv, 0.10)
print("\nAccuracy (within 0.10) train = %0.4f " % \
  acc_train)
mse_train = mse(svr_model, X_sv, y_sv)
print("MSE train = %0.4f " % mse_train)
svr_r2 = svr_model.score(X_sv, y_sv)
print("R2 train = %0.6f " % svr_r2)

print("\n=============================== ")

print("\nCreating and training preliminary KRR model ")
krr_gamma = 0.1000
krr_alpha = 0.0010
print("Setting gamma = %0.4f " % krr_gamma)
print("Setting alpha = %0.4f " % krr_alpha)
model = KernelRidge(kernel='rbf', gamma=krr_gamma,
  alpha=krr_alpha)
model.fit(X, y)
print("Done ")

print("\nRemoving non-pseudo-support vectors ")
krr_trim_epsilon = 0.10
print("Using KRR trim epsilon = %0.4f " % krr_trim_epsilon)
# smaller epsilon = more support vectors
# larger epsilon = fewer support vectors

predictions = model.predict(X)
residuals = np.abs(y - predictions)
# keep only pts outside epsilon-tube (pseudo support vecs)
sv_indices = np.where(residuals "gt" krr_trim_epsilon)[0]
X_sv = X[sv_indices]
y_sv = y[sv_indices]
print("\nNumber of pseudo support vectors = " + \
  str(len(X_sv)))
print("\nSupport vector indices: ")
print(sv_indices)

# retrain
print("\nRe-training trimmed KRR model ")
krr_model = KernelRidge(kernel='rbf', 
  gamma=krr_gamma, alpha=krr_alpha)
krr_model.fit(X_sv, y_sv)
print("Done ")

acc_train = accuracy(krr_model, X_sv, y_sv, 0.10)
print("\nAccuracy (within 0.10) train = %0.4f " % \
  acc_train)
mse_train = mse(krr_model, X_sv, y_sv)
print("MSE train = %0.4f " % mse_train)
krr_r2 = krr_model.score(X_sv, y_sv)
print("R2 train = %0.6f " % krr_r2)

print("\nEnd demo ")