Example of Gaussian Process Model Regression

The goal of a regression problem is to predict a single numeric value. An example is predicting the annual income of a person based on their age, years of education, and height.

A relatively rare technique for regression is called Gaussian Process Model. The technique is based on classical statistics and is very complicated.

I decided to refresh my memory of GPM regression by coding up a quick demo using the scikit-learn code library.

Click to enlarge

For my demo, the goal is to predict a single value by creating a model based on just six source data points. For simplicity, and so that I could graph my demo, I used just one predictor variable. The source data is based on f(x) = x * sin(x) which is a standard function for regression demos.

The graph of the demo results show that the GPM regression model predicted the underlying generating function extremely well within the limits of the source data — so well you have to look closely to see any difference. But the model does not extrapolate well at all. (Note: I included (0,0) as a source data point in the graph, for visualization, but that point wasn’t used when creating the GPM regression model.)

One of the reasons the GPM predictions are so close to the underlying generating function is that I didn’t include any noise/error such as the kind you’d get with real-life data.

The strengths of GPM regression are: 1.) it works well with very few data points, 2.) you can feed the model apriori information if you know such information, 3.) the predicted values have confidence levels (which I don’t use in the demo).

The weaknesses of GPM regression are: 1.) the technique requires many hyperparameters such as the kernel function, and the kernel function chosen has many hyperparameters too, 2.) you must make several model assumptions, 3.) it usually doesn’t work well for extrapolation.

An alternative to GPM regression is neural network regression. Neural networks are conceptually simpler, and easier to implement. However, neural networks do not work well with small source (training) datasets.

Good fun. Here’s the source code of the demo. I didn’t create the demo code from scratch; I pieced it together from several examples I found on the Internet, mostly scikit documentation at scikit-learn.org/stable/auto_examples/gaussian_process/plot_gpr_noisy_targets.html.

I scraped the results from my command shell and dropped them into Excel to make my graph, rather than using the matplotlib library.

# gpr_demo.py

import numpy as np
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import RBF,
  ConstantKernel as CK

def f(x):
  return x * np.sin(x)  # underlying function is x*sin(x)

print("\nBegin Gaussian Process Regression demo")

np.random.seed(1)  # reproducibility

X = np.array([[1.0], [3.0], [5.0], [6.0], [7.0], [8.0]],
  dtype=np.float32)  # source data x
print("\nSource data points x values:")
print(X)

y = f(X).ravel()  # source data y values
print("\nSource data points y values:")
print(y)

# data to predict x values
x = np.atleast_2d(np.linspace(0, 10, 21)).T
print("\nData to predict x values:")
print(x)

krnl = CK(1.0, (1e-3, 1e3)) * RBF(10, (1e-2, 1e2))
gp = GaussianProcessRegressor(alpha=1e-12,
 kernel=krnl, n_restarts_optimizer=9)

gp.fit(X, y)  # create Gaussian Process Regression function

(y_pred, sigma) = gp.predict(x, return_std=True)

print("\nPredicted y values:")
print(y_pred)

An Internet search for “complicated model” gave me more images of fashion models than machine learning models. Left: Always carry your clothes hangers with you. Center: Built-in social distancing. Right: You can never have too many cuffs.