Bottom line: I did a short experiment to compare support vector regression (SVR) with post-training-trimmed kernel ridge regression (KRR). I used the scikit library SVR and KernelRidge modules. Because of the nearly infinite number of hyperparameter values involved, it wasn’t possible to draw a strong conclusion, but my results strongly suggest the two techniques are essentially the same, or at least very similar, from a practical point of view.
Bear with me. The goal of a machine learning regression problem is to predict a single numeric value. For example, a bank might want to predict a maximum loan amount based on applicant age, sex, annual income, debt, and so on.
Common regression techniques include linear regression, quadratic regression, nearest neighbors regression, kernel ridge regression, Gaussian process regression, kernel support vector regression, neural network regression, random forest regression, and gradient boost regression. Each technique has dozens of variations, and each technique has pros and cons.
Two closely related machine learning regression techniques are kernel ridge regression (KRR) and support vector regression (SVR). Both techniques use a kernel function (usually the radial basis function, RBF) to compare two data items for similarity. Both techniques must store training data in order to make predictions. KRR must store all training items, while SVR eliminates some of the items during training, leaving just the “support vectors” that need to be stored. So, SVR uses less memory than KRR.
On the other hand, the loss function used by SVR is not differentiable, so SVR cannot be trained using stochastic gradient descent, and therefore, SVR does not scale well to very large datasets. So, KRR is easier to train than SVR.
Note: In addition to the kernel support vector regression technique discussed in this blog post, there is a linear support vector regression technique which is essentially useless in practice.
The idea of combining KRR and SVR goes like this: Briefly, I train a KRR model as normal using all data. Then I identify training data items that are predicted “too well” and remove them, leaving just pseudo support vectors. The somewhat unobvious SVR idea is that items that are predicted too well don’t help model accuracy very much, and can lead to an overfitted model. After removing some training items, I retrain a new KRR model using only the reduced training data. This gives a trimmed/sparse KRR model that approximates an SVR model.
In my mind, this technique gives the advantages of KRR (ability to handle very large datasets via SGD training) and the advantages of SVR (fewer stored model items and weights than KRR).
I first implemented the idea using the scikit KernelRidge module and the idea seemed to work well. But I wanted to take a second look, using a very small dataset, as a sanity check to make sure there weren’t any unexpected, bad, surprises. The output of one of the experiments is:
Begin SVR vs. trimmed KRR demo Creating 20-item, 5-feature train data X: [[ 0.0458 -0.1872 1.4694 -1.4544 1.5328] [ 0.9008 0.4657 1.4883 -1.1651 -1.5362] [ 0.6536 0.8644 2.2698 -2.5530 -0.7422] [ 0.4002 0.9787 1.8676 1.7641 2.2409] [ 0.3869 -0.5108 -0.0282 -0.8955 -1.1806] [ 1.2303 1.2024 -0.3023 0.1563 -0.3873] [ 1.4543 0.7610 0.4439 0.1440 0.1217] [ 1.4941 -0.2052 -0.8541 0.3337 0.3131] [-0.3596 -0.8131 0.1774 -0.6725 -1.7263] [ 1.1788 -0.1799 1.0545 1.8959 -1.0708] [ 0.1290 1.1394 0.4023 0.7291 -1.2348] [ 0.9501 -0.1514 0.4106 -0.9773 -0.1032] [ 1.2224 0.2083 0.3564 -0.4032 0.9766] [-1.6302 0.4628 0.0519 -0.4018 -0.9073] [ 0.3782 -0.8878 -0.3479 0.1549 -1.9808] [ 0.0665 0.3025 -0.3627 0.4283 -0.6343] [ 0.0105 1.7859 0.4020 0.7066 0.1269] [-1.4200 -1.7063 -0.5097 -1.0486 1.9508] [-1.2528 0.7775 -0.2127 -0.4381 -1.6139] [-0.8708 -0.5788 0.0562 -0.6848 -0.3116]] Creating and training SVR model gamma = 0.1000 C = 1000.0000 epsilon = 8.0000 Done Number support vectors: 10 Support vector indices: [ 0 1 3 9 10 14 16 17 18 19] Accuracy (within 0.10) train = 0.6000 MSE train = 64.0016 R2 train = 0.996314 =============================== Creating and training preliminary KRR RBF model Setting gamma = 0.1000 Setting alpha = 0.0010 Done Removing non-pseudo-support vectors Using KRR trim epsilon = 0.1000 Number of pseudo support vectors = 10 Support vector indices: [ 1 3 9 10 11 15 16 17 18 19] Re-training trimmed KRR model Done Accuracy (within 0.10) train = 1.0000 MSE train = 0.0225 R2 train = 0.999999 End demo
The demo program begins by using the scikit make_regression() function to create a tiny synthetic dataset with just 20 rows/items, each with 5 predictor values. An SVR model ended up with 10 support vectors and an R2 score (essentially a scaled accuracy metric) of 0.9963 on the training data.
I tuned a trimmed KRR model to one with 10 pseudo-support vectors. The two techniques had 8 out of 10 support vectors in common — that’s good.
The trimmed KRR model had an R2 score of 0.9999 on the training data. The trimmed KRR model had better MSE than the SVR model, but that was expected because KRR minimizes MSE and SVR does not. The much better accuracy of the trimmed KRR model (100% for KRR vs. 60% for SVR) isn’t significant because the dataset is so small, and I didn’t use a test dataset to measure possible overfitting.
I noticed that both models were ultra-sensitive to hyperparameter values — gamma, C, and epsilon for SVR, and gamma, alpha, and trim-epsilon for trimmed KRR. This is a major weakness of both techniques.
Anyway, the sanity check was successful. It is possible to approximate a SVR regression model using a trimmed KRR model.

It’s common for me to refactor the software systems I create many times. It’s rarely possible to a get a non-trivial system completely correct on the first effort, and so a system is usually a collection of software sequels, so to speak, where each sequel is a bit better than its predecessor.
I’m a big fan of science fiction movies. There have been dozens of sci fi sequels. “Star Wars” (1977) and “Star Wars: The Empire Strikes Back” (1980). “Alien” (1979) and “Aliens” (1986). And so on. But most movie sequels are worse than their predecessor, however there are exceptions where the sequel is as-good-as, or even better than the original.
Left: In “The Quatermass Xperiment” aka “The Creeping Unknown” (1955), Dr. Bernard Quatermass leads a British effort to put men into space. Three astronauts are sent up, but only one man returns. He has been exposed to cosmic radiation and morphs into a deadly blob-like creature. The creature is eventually electrocuted. My personal grade = B+.
Right: In “Quatermass 2” aka “Enemy From Space” (1957), Quatermass discovers an alien invasion. The aliens use parasites to control the people in a small village and use the villagers to construct a plant to create alien food to support the invasion. Quatermass and the UK military blow up the plant and defeat the alien’s plans. My personal grade = A-.
Demo program. Replace the “lt” (less than) and “gt” with the Boolean operator symbols. (My blog editor chokes on symbols).
# svr_vs_krr_trimmed_scikit.py
# train using KRR, remove some, retrain using KRR
import numpy as np
from sklearn.kernel_ridge import KernelRidge
from sklearn.svm import SVR
from sklearn.datasets import make_regression
# KernelRidge(alpha=1, *, kernel='linear', gamma=None,
# degree=3, coef0=1, kernel_params=None)
# SVR(*, kernel='rbf', degree=3, gamma='scale',
# coef0=0.0, tol=0.001, C=1.0, epsilon=0.1,
# shrinking=True, cache_size=200, verbose=False,
# max_iter=-1)
# make_regression(n_samples=100, n_features=100,
# *, n_informative=10, n_targets=1, bias=0.0,
# effective_rank=None, tail_strength=0.5, noise=0.0,
# shuffle=True, coef=False, random_state=None)
# -----------------------------------------------------------
np.set_printoptions(precision=4, suppress=True,
floatmode='fixed', linewidth=60)
# -----------------------------------------------------------
def accuracy(model, data_X, data_y, pct_close):
n = len(data_X)
n_correct = 0; n_wrong = 0
for i in range(n):
x = data_X[i].reshape(1,-1)
y = data_y[i]
y_pred = model.predict(x)[0]
if np.abs(y - y_pred) "lt" np.abs(y * pct_close):
n_correct += 1
else:
n_wrong += 1
return n_correct / (n_correct + n_wrong)
def mse(model, data_X, data_y):
n = len(data_X)
sum = 0.0
for i in range(n):
actual_y = data_y[i]
pred_y = model.predict(data_X[i].reshape(1, -1))[0]
diff = actual_y - pred_y
sum += diff * diff
return sum /n
# -----------------------------------------------------------
# -----------------------------------------------------------
print("\nBegin SVR vs. trimmmed KRR demo ")
np.set_printoptions(precision=4, suppress=True,
floatmode='fixed')
print("\nCreating 20-item, 5-feature train data ")
X, y = make_regression(n_samples=20, n_features=5,
n_informative=5, noise=0, random_state=0)
print("\nX: ")
print(X)
print("\nCreating and training SVR model ")
svr_gamma = 0.10
svr_C = 1000.0
svr_epsilon = 8.0
print("gamma = %0.4f " % svr_gamma)
print("C = %0.4f " % svr_C)
print("epsilon = %0.4f " % svr_epsilon)
svr_model = SVR(kernel='rbf', gamma=svr_gamma, C=svr_C,
epsilon = svr_epsilon)
svr_model.fit(X, y)
print("Done ")
print("\nNumber support vectors: ", end="")
print(len(svr_model.support_))
print("\nSupport vector indices: ")
print(svr_model.support_)
X_sv = X[svr_model.support_,:]
y_sv = y[svr_model.support_]
acc_train = accuracy(svr_model, X_sv, y_sv, 0.10)
print("\nAccuracy (within 0.10) train = %0.4f " % \
acc_train)
mse_train = mse(svr_model, X_sv, y_sv)
print("MSE train = %0.4f " % mse_train)
svr_r2 = svr_model.score(X_sv, y_sv)
print("R2 train = %0.6f " % svr_r2)
print("\n=============================== ")
print("\nCreating and training preliminary KRR model ")
krr_gamma = 0.1000
krr_alpha = 0.0010
print("Setting gamma = %0.4f " % krr_gamma)
print("Setting alpha = %0.4f " % krr_alpha)
model = KernelRidge(kernel='rbf', gamma=krr_gamma,
alpha=krr_alpha)
model.fit(X, y)
print("Done ")
print("\nRemoving non-pseudo-support vectors ")
krr_trim_epsilon = 0.10
print("Using KRR trim epsilon = %0.4f " % krr_trim_epsilon)
# smaller epsilon = more support vectors
# larger epsilon = fewer support vectors
predictions = model.predict(X)
residuals = np.abs(y - predictions)
# keep only pts outside epsilon-tube (pseudo support vecs)
sv_indices = np.where(residuals "gt" krr_trim_epsilon)[0]
X_sv = X[sv_indices]
y_sv = y[sv_indices]
print("\nNumber of pseudo support vectors = " + \
str(len(X_sv)))
print("\nSupport vector indices: ")
print(sv_indices)
# retrain
print("\nRe-training trimmed KRR model ")
krr_model = KernelRidge(kernel='rbf',
gamma=krr_gamma, alpha=krr_alpha)
krr_model.fit(X_sv, y_sv)
print("Done ")
acc_train = accuracy(krr_model, X_sv, y_sv, 0.10)
print("\nAccuracy (within 0.10) train = %0.4f " % \
acc_train)
mse_train = mse(krr_model, X_sv, y_sv)
print("MSE train = %0.4f " % mse_train)
krr_r2 = krr_model.score(X_sv, y_sv)
print("R2 train = %0.6f " % krr_r2)
print("\nEnd demo ")

.NET Test Automation Recipes
Software Testing
SciPy Programming Succinctly
Keras Succinctly
R Programming
2026 Visual Studio Live
2025 Summer MLADS Conference
2026 DevIntersection Conference
2025 Machine Learning Week
2025 Ai4 Conference
2026 G2E Conference
2026 iSC West Conference
You must be logged in to post a comment.