Combining Principal Component Analysis with Gaussian Process Regression

I experimented with a machine learning technique I haven’t seen used before. I used principal component analysis (PCA) to reduce the number of predictor variables in the training data, and then I used the reduced data to train a Gaussian process regression (GPR) model. The idea is to reduce GPR model overfitting.

Bottom line: The technique works but there are two built-in ways to add noise to a GPR model and these built-in ways are easier to use.

A regression problem is one where the goal is to predict a single numeric value. For my experiments, I used the Boston Area House Price dataset. The dataset has 506 items. Each represents a town near Boston. The goal is to predict the median price of a house in each town. There are 13 predictor variables such as tax rate in the town, the crime rate, the density of Black residents, and so on. I split the data into a 400-item training set and a 106-item test set.

I created a baseline GPR prediction model on the source training data. The model scored 100% accuracy on the training data (typical) but only 73.58% accuracy on the test data. The model is moderately overfitted. I didn’t use the built-in GPR alpha parameter value or the WhiteKernel() noise component to add noise to the model to try and reduce overfitting.

Note: Using PCA attempts to reduce model overfitting by adjusting the data. The two built-in ways to add noise to a GPR model to reduce overfitting — alpha and WhiteKernel() — work by adjusting internal weights and biases. So, using PCA isn’t directly comparable to the built-in noise techniques.

I ran the source training predictors through a scikit library PCA model to compute the 13 principal components. The percent variability explained by the components, from high to low, are: [0.4963, 0.1488, 0.1213, 0.0701, 0.0555, 0.040, 0.0324, 0.0108, 0.0071, 0.0053, 0.0046, 0.0044, 0.0034]. The first two components explain 0.6451 of the variability. The first 12 components explain 0.9966 of the variability. I used the principal components to transform the training data.

Next I trained a Gaussian process regression model using 12 of the 13 transformed training data variables. That model scored 100% accuracy on the training data but only 68.87% accuracy on the test data — quite a bit worse than the baseline GPR model.

I ran a few other experiments and was able to reduce model overfitting but the PCA technique is more difficult than using alpha and WhiteKernel() to control overfitting.

Conclusion: Using principal component analysis to reduce training data in an effort to control Gaussian process regression overfitting sort of works but using the built-in GPR alpha parameter and the WhiteKernel() kernel is easier.

An Internet search for “overfitted” turned up some interesting results. Left: Yes, I’d say this wedding dress is a bit overfitted. Right: This is some sort of device that is worn under a big wedding dress and allows the dress to be gathered up when the bride needs to move. Thank you once again Internet for expanding my knowledge baseline.

Demo code. The training and test data can be found at https://jamesmccaffreyblog.com/2023/05/12/gaussian-process-regression-on-the-boston-housing-dataset-using-the-scikit-library/.

# boston_pca_gauss_process.py
# use PCA followed by Gaussian process regression

# Anaconda3-2022.10  Python 3.9.13
# scikit 1.0.2
# Windows 10/11 

import numpy as np
from sklearn.decomposition import PCA

from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import RBF
from sklearn.gaussian_process.kernels import DotProduct
from sklearn.gaussian_process.kernels import WhiteKernel
from sklearn.gaussian_process.kernels import ConstantKernel

# -----------------------------------------------------------

def accuracy(model, data_X, data_y, pct_close):
  # correct within pct of true income
  n_correct = 0; n_wrong = 0

  for i in range(len(data_X)):
    X = data_X[i].reshape(1, -1)  # one-item batch
    y = data_y[i]
    pred = model.predict(X)       # predicted income

    if np.abs(pred - y) "lt" np.abs(pct_close * y):  # less-than
      n_correct += 1
    else:
      n_wrong += 1
  acc = (n_correct * 1.0) / (n_correct + n_wrong)
  return acc

# -----------------------------------------------------------

def main():
  # 0. prepare
  print("\nBegin scikit PCA-GPR regression ")
  print("Predict Boston area house median price ")
  np.random.seed(1)
  np.set_printoptions(precision=4, suppress=True)

# -----------------------------------------------------------

  # 1. load data
  print("\nLoading train and test data ")
  train_file = ".\\Data\\boston_train.txt"
  train_X = np.loadtxt(train_file, delimiter="\t", 
    usecols=[0,1,2,3,4,5,6,7,8,9,10,11,12],
    comments="#", dtype=np.float32)
  train_y = np.loadtxt(train_file, delimiter="\t", 
    usecols=13, comments="#", dtype=np.float32) 

  test_file = ".\\Data\\boston_test.txt"
  test_X = np.loadtxt(test_file, delimiter="\t",
    usecols=[0,1,2,3,4,5,6,7,8,9,10,11,12],
    comments="#", dtype=np.float32)
  test_y = np.loadtxt(test_file, delimiter="\t",
    usecols=13, comments="#", dtype=np.float32) 
  print("Done ")

  print("\nData: ")
  print(train_X[0:4][:])
  print(". . .")
  print("\nActual prices: ")
  print(train_y[0:4])
  print(". . .")

  # 1a. create baseline model
  print("\nComputing baseline GPR model ")
  krnl = ConstantKernel(1.0, (1e-1, 1e3)) + \
    RBF(1.0, (1e-3, 1e3))
    # WhiteKernel()  # WhiteKernel adds noise
  base_model = GaussianProcessRegressor(kernel=krnl,
    normalize_y=True, alpha = 0.0,
   random_state=0)  # alpha adds noise
  base_model.fit(train_X, train_y)  # train on source data
  base_acc_train = accuracy(base_model, train_X, train_y, 0.15)
  print("\nBase accuracy on train data = %0.4f " % base_acc_train)
  base_acc_test = accuracy(base_model, test_X, test_y, 0.15)
  print("Base accuracy on test data = %0.4f " % base_acc_test)

  # 2. compute PCA and transform train data predictors
  print("\nComputing principal components ")
  pca = PCA()       # all 13 components
  pca.fit(train_X)  # compute them
  print("Done. Variance explained: ")
  print(pca.explained_variance_ratio_)

  # 3. transform the source X train data
  print("\nComputing transformed X data ")
  transformed_X = pca.transform(train_X)
  print(transformed_X)

  # 4. create GPR model on reduced data
  n = 12
  print("\nTraining model on first " + str(n) + " components ")
  subset_trans_X = transformed_X[:,0:n]

  model = GaussianProcessRegressor(kernel=krnl, normalize_y=True,
    alpha=0.0, random_state=0)
  model.fit(subset_trans_X, train_y)  # train

  # 5. compute model accuracy
  print("\nComputing model accuracy within 15% of true target ")
  acc_train = accuracy(model, subset_trans_X, train_y, 0.15)
  print("Accuracy on train data = %0.4f " % acc_train)

  transformed_test_X = pca.transform(test_X)
  subset_trans_test_X = transformed_test_X[:,0:n]
  acc_test = accuracy(model, subset_trans_test_X, test_y, 0.15)
  print("Accuracy on test data = %0.4f " % acc_test)

  print("\nEnd PCA-GPR demo ")

if __name__ == "__main__":
  main()