Principal Component Analysis (PCA) From Scratch vs. Scikit

A few days ago I coded up a demo of anomaly detection using principal component analysis (PCA) reconstruction error. I implemented the PCA functionality — computation of the transformed data, the principal components, and the variance explained by each component — from semi-scratch, meaning I used the NumPy linalg (linear algebra) library eig() function to compute eigenvalues and eigenvectors.

And it was good.

But in the back of my mind, I was thinking that I should have verified my semi-from-scratch implementation of PCA because PCA is very, very complex and I could have made a mistake.

The from-scratch version (left) and the scikit version (right) are identical except that some of the transformed vectors and principal components differ by a factor of -1. This doesn’t affect anything.

So I took my original from-scratch PCA anomaly detection program and swapped out the PCA implementation from the scikit sklearn.decomposition library. And as expected, the results of the scikit-based PCA program were identical to the results of the from-scratch PCA program. Almost.

My from-scratch code looks like:

import numpy as np

def my_pca(X):
  # returns transformed X, prin components, var explained
  dim = len(X[0])  # n_cols
  means = np.mean(X, axis=0)
  z = X - means  # avoid changing X
  square_m = np.dot(z.T, z)
  (evals, evecs) = np.linalg.eig(square_m) 
  trans_x = np.dot(z, evecs[:,0:dim]) 
  prin_comp = evecs.T  
  v = np.var(trans_x, axis=0, ddof=1) 
  sv = np.sum(v)
  ve = v / sv
  # order everything based on variance explained
  ordering = np.argsort(ve)[::-1]  # sort order high to low
  trans_x = trans_x[:,ordering]
  prin_comp = prin_comp[ordering,:]
  ve = ve[ordering]
  return (trans_x, prin_comp, ve)

X = (load data from somewhere)
(trans_x, p_comp, ve) = my_pca(X)

The scikit-based code looks like:

import numpy as np
import sklearn.decomposition

X = (load data from somewhere)
pca = sklearn.decomposition.PCA().fit(X)
trans_x = pca.transform(X)
p_comp = pca.components_
ve = pca.explained_variance_ratio_

All the results were identical except that the internal transformed X values and the principal components, sometimes differed by a factor of -1. As it turns out this is OK because PCA computes variances and the sign doesn’t affect variance.

The advantage of using scikit PCA is simplicity. The advantages of using PCA from scratch are 1.) you get fine-tuned control, 2.) you remove an external dependency, 3.) you aren’t using a mysterious black box.

PCA is interesting and sometimes useful, but for tasks like dimensionality reduction and reconstruction, deep neural techniques have largely replaced PCA.

PCA was developed in 1901 by famous statistician Karl Pearson. I wonder if statisticians of that era imagined today’s deep neural technologies. Three images from the movie “Things to Come” (1936) based on the novel of the same name by author H. G. Wells.

Demo code:

# pca_recon_skikit.py
# exactly replicates iris_pca_recon.py scratch version

import numpy as np
import sklearn.decomposition

def reconstructed(X, n_comp, trans_x, p_comp):
  means = np.mean(X, axis=0)
  result = np.dot(trans_x[:,0:n_comp], p_comp[0:n_comp,:])
  result += means
  return result

def recon_error(X, XX):
  diff = X - XX
  diff_sq = diff * diff
  errs = np.sum(diff_sq, axis=1)
  return errs

def main():
  print("\nBegin Iris PCA reconstruction using scikit ")
  np.set_printoptions(formatter={'float': '{: 0.1f}'.format})

  X = np.array([
    [5.1, 3.5, 1.4, 0.2],
    [5.4, 3.9, 1.7, 0.4],

    [6.4, 3.2, 4.5, 1.5],
    [5.7, 2.8, 4.5, 1.3],

    [7.2, 3.6, 6.1, 2.5],
    [6.9, 3.2, 5.7, 2.3]])

  print("\nSource X: ")
  print(X)

  print("\nPerforming PCA computations ")
  pca = sklearn.decomposition.PCA().fit(X)
  trans_x = pca.transform(X)
  p_comp = pca.components_
  ve = pca.explained_variance_ratio_
  print("Done ")

  print("\nTransformed X: ")
  np.set_printoptions(formatter={'float': '{: 0.4f}'.format})
  print(trans_x)

  print("\nPrincipal components: ")
  np.set_printoptions(formatter={'float': '{: 0.4f}'.format})
  print(p_comp)

  print("\nVariance explained: ")
  np.set_printoptions(formatter={'float': '{: 0.5f}'.format}) 
  print(ve)

  XX = reconstructed(X, 4, trans_x, p_comp)
  print("\nReconstructed X using all components: ")
  np.set_printoptions(formatter={'float': '{: 0.2f}'.format})
  print(XX)

  XX = reconstructed(X, 1, trans_x, p_comp)
  print("\nReconstructed X using one component: ")
  np.set_printoptions(formatter={'float': '{: 0.2f}'.format})
  print(XX)

  re = recon_error(X, XX)
  print("\nReconstruction errors using one component: ")
  np.set_printoptions(formatter={'float': '{: 0.3f}'.format})
  print(re)

  print("\nEnd PCA scikit ")

if __name__ == "__main__":
  main()