Anomaly Detection Using Principal Component Analysis (PCA) in Visual Studio Magazine

I wrote an article titled “Anomaly Detection Using Principal Component Analysis (PCA)” in the October 2021 edition of the online Microsoft Visual Studio Magazine. See https://visualstudiomagazine.com/articles/2021/10/20/anomaly-detection-pca.aspx.

Principal component analysis (PCA) is a classical statistics technique that breaks down a data matrix into vectors called principal components. One way to use PCA components is to examine a set of data items to find anomalous items using reconstruction error. Briefly, the idea is to break the source data matrix down into its principal components, then reconstruct the original data using just the first few principal components. The reconstructed data will be similar to, but not exactly the same as, the original data. The reconstructed data items that are the most different from the corresponding original items are anomalous items.

PCA is based on decomposition. Suppose that you want to decompose the integer value 64 into three components. There are many possible decompositions. One decomposition is (8, 4, 2) because 8 * 4 * 2 = 64. The first component, 8, accounts for most of the original value, the 4 accounts for less and the 2 accounts for the least amount. If you use all three components to reconstruct the source integer you will replicate the source exactly. But if you use just the first two components to reconstruct the source integer you will get a value that’s close to the source: 8 * 4 = 32.

Principal component analysis is a very complex decomposition that works on data matrices instead of single integer values.

The heart of the PCA function is a call to the NumPy linalg.eig() function (“linear algebra, eigen”). The function returns an eigenvalues array and an eigenvectors matrix. The eig() function is very complex and implementing it from scratch is possible but usually not practical. The scikit library has an implementation ofPCa but I don’t like taking on external dependencies when I can avoid it.

Anomaly detection using principal component analysis reconstruction is one of the oldest unsupervised anomaly detection techniques, dating from the early 1900s. The main advantage of using PCA is simplicity — assuming you have access to a function that computes eigenvalues and eigenvectors. The two main disadvantages of using PCA are 1.) the technique works only with strictly numeric data, and 2.) because PCA uses matrices in memory, the technique does not scale to very large datasets.

In humans, anomalous physical features are sometimes good, sometimes bad, and sometimes neutral. Left: a girl with anomalously long arms. Center: an actress with an anomalously long neck. Right: a model with anomalously long legs.