I was reviewing some research papers that had been submitted to an internal conference at the tech company I work for. I was the Area Chair for the Unsupervised and Semi-Supervised Learning track of the conference. One of the research papers used an old technique for anomaly detection — principal component analysis (PCA) reconstruction error.
PCA is a classical statistics technique. PCA accepts some numeric data, decomposes the data, and computes some matrices and vectors that are a reduced representation of the source data. If you decompose the source data, and then reconstruct the data using only part of the internal representation, you get reconstructed data that is close to, but not exactly the same as, the original source data. Then if you compare the reconstructed data items with the original source items, the item that differ the most (reconstruction error) are the ones that are most likely to be anomalies.
Anyway, it had been quite some time since I implemented a system for anomaly detection using PCA, so I decided to refresh my memory by coding up a small demo.
I set up six source items that were selected from the well-known Iris Dataset — two “setosa”, two “versicolor”, and two “virginica” items:
[[ 5.1 3.5 1.4 0.2] [ 5.4 3.9 1.7 0.4] [ 6.4 3.2 4.5 1.5] [ 5.7 2.8 4.5 1.3] [ 7.2 3.6 6.1 2.5] [ 6.9 3.2 5.7 2.3]]
The Iris Dataset is convenient because the data is already somewhat normalized. When using PCA it’s important to normalize the source data so that items with large values don’t swamp items with small values.
When using PCA for reconstruction, if you use all of the computed “principal components”, you will recreate the source data exactly, which isn’t very useful. The Iris data has four columns so PCA computes four principal components. For my demo, the first principal component accounted for 96% of the total variance so I used just that first component to reconstruct the data. The PCA reconstructed data is:
[[ 5.08 3.56 1.43 0.16] [ 5.22 3.54 1.78 0.33] [ 6.32 3.33 4.49 1.61] [ 6.22 3.35 4.24 1.49] [ 7.01 3.20 6.19 2.41] [ 6.84 3.23 5.77 2.21]]
The squared errors between each of the six source items and their reconstructions are:
[ 0.01 0.17 0.03 0.67 0.22 0.02]
The largest error is 0.67 for the fourth item so it is the most anomalous.
PCA anomaly detection can be useful but the technique is relatively crude compared to techniques such as deep neural autoencoder reconstruction error.

Tiki restaurants and bars are a reconstruction of an idealized Polynesian environment. Left: The Tonga Room restaurant, located inside the Fairmont Hotel in San Francisco. Right: The Golden Tiki bar, in Las Vegas.
Code:
# iris_pca_recon.py
# Iris data PCA reconstruction error
import numpy as np
def my_pca(X):
# returns transformed X, prin components, var explained
dim = len(X[0]) # n_cols
means = np.mean(X, axis=0)
z = X - means # avoid changing X
square_m = np.dot(z.T, z)
(evals, evecs) = np.linalg.eig(square_m) # 'right-hand'
trans_x = np.dot(z, evecs[:,0:dim])
prin_comp = evecs.T
v = np.var(trans_x, axis=0, ddof=1) # sample var
sv = np.sum(v)
ve = v / sv
# order everything based on variance explained
ordering = np.argsort(ve)[::-1] # sort order high to low
trans_x = trans_x[:,ordering]
prin_comp = prin_comp[ordering,:]
ve = ve[ordering]
return (trans_x, prin_comp, ve)
def reconstructed(X, n_comp, trans_x, p_comp):
means = np.mean(X, axis=0)
result = np.dot(trans_x[:,0:n_comp], p_comp[0:n_comp,:])
result += means
return result
def recon_error(X, XX):
diff = X - XX
diff_sq = diff * diff
errs = np.sum(diff_sq, axis=1)
return errs
def main():
print("\nBegin Iris PCA reconstruction error ")
np.set_printoptions(precision=1, suppress=True, sign=" ")
X = np.array([
[5.1, 3.5, 1.4, 0.2],
[5.4, 3.9, 1.7, 0.4],
[6.4, 3.2, 4.5, 1.5],
[5.7, 2.8, 4.5, 1.3],
[7.2, 3.6, 6.1, 2.5],
[6.9, 3.2, 5.7, 2.3]])
print("\nSource X: ")
print(X)
print("\nPerforming PCA computations ")
(trans_x, p_comp, ve) = my_pca(X)
print("Done ")
print("\nTransformed X: ")
np.set_printoptions(formatter={'float': '{: 0.4f}'.format})
print(trans_x)
print("\nPrincipal components: ")
np.set_printoptions(formatter={'float': '{: 0.4f}'.format})
print(p_comp)
print("\nVariance explained: ")
np.set_printoptions(formatter={'float': '{: 0.5f}'.format})
print(ve)
XX = reconstructed(X, 4, trans_x, p_comp)
print("\nReconstructed X using all components: ")
np.set_printoptions(formatter={'float': '{: 0.2f}'.format})
print(XX)
XX = reconstructed(X, 1, trans_x, p_comp)
print("\nReconstructed X using one component: ")
np.set_printoptions(formatter={'float': '{: 0.2f}'.format})
print(XX)
re = recon_error(X, XX)
print("\nReconstruction errors using one component: ")
np.set_printoptions(formatter={'float': '{: 0.2f}'.format})
print(re)
print("\nEnd PCA reconstruction error ")
if __name__ == "__main__":
main()

.NET Test Automation Recipes
Software Testing
SciPy Programming Succinctly
Keras Succinctly
R Programming
2026 Visual Studio Live
2025 Summer MLADS Conference
2026 DevIntersection Conference
2025 Machine Learning Week
2025 Ai4 Conference
2026 G2E Conference
2026 iSC West Conference
You must be logged in to post a comment.