Data Anomaly Detection Using Principal Component Analysis (PCA) Reconstruction Error

One evening, while I was walking my two dogs, I thought about the possibility of looking for data anomalies by analyzing principal component analysis (PCA) reconstruction error. Bottom line: the technique works, but it just doesn’t feel right to me.

The ideas here are extremely complex and can only be explained by using a concrete example. I implemented the idea using raw C#. Using Python and the scikit library would have been much, much easier. I started with a small, 12-item subset of the Penguin dataset:

[ 0]     39.5     17.4    186.0   3800.0
[ 1]     40.3     18.0    195.0   3250.0
[ 2]     36.7     19.3    193.0   3450.0
[ 3]     38.9     17.8    181.0   3625.0
[ 4]     46.5     17.9    192.0   3500.0
[ 5]     45.4     18.7    188.0   3525.0
[ 6]     45.2     17.8    198.0   3950.0
[ 7]     46.1     18.2    178.0   3250.0
[ 8]     46.1     13.2    211.0   4500.0
[ 9]     48.7     14.1    210.0   4450.0
[10]     46.5     13.5    210.0   4550.0
[11]     45.4     14.6    211.0   4800.0

Each item is one of three species of penguin. The fields are bill length, bill width, flipper length, body mass.

I performed z-score standardization on the source data — this is required for PCA. Then I computed the eigenvalues and eigenvectors of the standardized data — this is one of the most complex operations in numerical programming. To compute the eigens, I used the singular value decomposition (SVD) technique (I could have used the classical covariance matrix technique).

Because the source data has 12 rows and 4 columns, there are 4 eigenvalues, and 4 eigenvectors, each with 4 values. The percentage of variance explained by the 4 eigenvectors are 0.7801, 0.1578, 0.0409, 0.0211 and so the variance explained by just the first 2 eigenvectors is 0.7801 + 0.1578 = 0.9379.

I used the first 2 eigenvectors to reconstruct the source data. Then I computed the Euclidean distance as a measure of error between the source data and the reconstructed data:

[ 0]     39.6     17.8    191.4   3662.2  | recon err =  137.8913
[ 1]     40.3     18.2    189.1   3556.6  | recon err =  306.6433
[ 2]     36.5     18.6    188.2   3511.9  | recon err =   62.1100
[ 3]     39.1     18.5    187.7   3489.6  | recon err =  135.5649
[ 4]     46.4     17.7    189.2   3569.5  | recon err =   69.5266
[ 5]     45.3     18.2    186.8   3455.4  | recon err =   69.6141
[ 6]     45.0     16.8    195.1   3844.0  | recon err =  106.0892
[ 7]     46.3     18.9    182.0   3231.4  | recon err =   19.0492
[ 8]     46.3     13.9    211.7   4619.6  | recon err =  119.6264
[ 9]     48.7     14.1    209.0   4497.1  | recon err =   47.1339
[10]     46.6     13.9    211.1   4594.7  | recon err =   44.7474
[11]     45.2     13.9    211.7   4618.0  | recon err =  182.0142

Based on this analysis, the largest reconstruction error is 306.6433 which is associated with data item [1] and so this is “the most anomalous” in some sense. The source item [1] and its reconstructed item are:

original:      40.3     18.0    195.0   3250.0
reconstructed: 40.3     18.2    189.1   3556.6

The reconstruction error is dominated by the body mass term. Ugh. This makes sense because the magnitudes of the body mass values are much greater than the other variables. This means you’d probably have to normalize the source data first (to get a variable values in the same range) and then z-score standardize the data (to accommodate PCA).

So, technically, the PCA reconstruction error technique works. But the technique just doesn’t feel right to me, based on many years of experience.

The technique is very, very, very complex. And complex is almost always bad. And because in order to apply PCA, source data must be z-score standardized, it can only work with strictly numeric data, not mixed numeric and categorical. And because the reconstruction technique must use a discrete number of eigenvectors for reconstruction, the technique is not very granular.

There may be some scenarios where anomaly detection based on PCA reconstruction error is useful, but I suspect other techniques are better choices in almost all situations. But it was an interesting exploration anyway.

I usually post my demo code, but I’m not going to do so for this topic. The code is very long and very ugly.

Most of the experienced engineers I know, have a good intuitive sense of when a software system design is overly-complex. I’m no expert on bicycles, but my intuition tells me that these two examples might be a bit too complex.