"Principal Component Analysis from Scratch Using Singular Value Decomposition with C#" in Visual Studio Magazine

I wrote an article titled “Principal Component Analysis from Scratch Using Singular Value Decomposition with C#” in the February 2024 edition of Microsoft Visual Studio Magazine. See https://visualstudiomagazine.com/Articles/2024/02/16/pca-using-svd-for-ml.aspx.

Principal component analysis (PCA) is a classical machine learning technique. The goal of PCA is to transform a dataset into one with fewer columns. This is called dimensionality reduction. The transformed data can be used for visualization or as the basis for prediction using machine learning techniques that can’t deal with a large number of predictor columns.

There are two main techniques to implement PCA. The first technique, sometimes called classical, computes eigenvalues and eigenvectors from a covariance matrix derived from the source data. The second PCA technique sidesteps the covariance matrix and computes a singular value decomposition (SVD) of the (standardized) source data. My article presents a from-scratch C# implementation of the second technique: using SVD to compute eigenvalues and eigenvectors from the standardized source data.

For simplicity, the demo program uses just nine data items:

[0]  39.1  18.7  181.0  3750.0
[1]  39.5  17.4  186.0  3800.0
[2]  40.3  18.0  195.0  3250.0
[3]  46.5  17.9  192.0  3500.0
[4]  50.0  19.5  196.0  3900.0
[5]  51.3  19.2  193.0  3650.0
[6]  46.1  13.2  211.0  4500.0
[7]  50.0  16.3  230.0  5700.0
[8]  48.7  14.1  210.0  4450.0

The nine data items are a small subset of the 333-item Penguin Dataset. Each line represents a penguin. The four columns are bill length, bill depth, flipper length and body mass. Because there are four columns, the data is said to have dimension = 4.

After applying PCA, the transformed data is:

[0]   1.8846   0.6205   0.6592  -0.3440
[1]   1.3345   0.9543   0.3440  -0.2363
[2]   1.4610   0.5893   0.0603   0.7158
[3]   0.8447  -0.3894  -0.4378   0.0696
[4]   0.3924  -1.4465   0.0569  -0.0469
[5]   0.5486  -1.5813  -0.4221  -0.0953
[6]  -1.6896   1.1743  -0.6539  -0.0640
[7]  -3.1366  -0.3829   1.1156   0.1179
[8]  -1.6394   0.4616  -0.7222  -0.1168

The first two columns capture most of the information contained in the source data. Therefore the first two columns of the transformed data can be used as a surrogate for the source data. Typical uses are to visualize the data in a two-dimensional graph, or for use with machine learning techniques that can’t handle a large number of columns.

Another use of PCA is to reconstruct the original source data from the first few columns. The reconstructed data will not exactly match the source data. Reconstructed items that are farthest from their original version are anomalies.

Principal component analysis is the best known classical statistics technique for data dimensionality reduction. There are many science fiction novels that feature human dimensionality reduction.

Left: “Tarzan and the Ant-Men” (1924) by Edgar Rice Burroughs. Cover art by Richard Powers. The tenth of 24 Tarzan novels. Tarzan meets a peaceful race of 18-inch tall people but gets shrunk to their size by one of their evil scientists. Widely considered one of the best Tarzan novels (but I prefer the Mars series by Burroughs). My grade = B-.

Center: “Fantastic Voyage” (1966) by Isaac Asimov. Cover art by Tom Chantrell. A group of scientists are shrunk to microscopic size and then injected into a comatose man to operate on a brain clot. The book is a novelization of the movie (also 1966), where the screenplay was written by Harry Kleiner, which was based on a story by David Duncan, which in turn was based on an unpublished short story written by Jerome Bixby and Otto Klement. Quite a collaboration. My grade = B+.

Right: “The Micronauts” (1977) by Gordon Williams. Cover art by Gerald Grace. The world is running out of food so scientists devise a way to shrink people so food goes further. Part of a trilogy. Well-liked by many reviewers but I found the story too slow for my tastes. My grade = C.

1 Response to “Principal Component Analysis from Scratch Using Singular Value Decomposition with C#” in Visual Studio Magazine

Thorsten Kleppe says:

February 29, 2024 at 3:49 am

Not sure how to do it, but pretty cool what PCA can do:
https://twitter.com/svpino/status/1756004150280650968

The demo presented in the article works the same way as your classic PCA implementation from:
jamesmccaffrey.wordpress.com/2023/11/07/principal-component-analysis-pca-from-scratch-using-csharp/

Very cool, thank you James.

Loading...