Why Linear Regression Pseudo-Inverse Training Implemented from Scratch Fails With One-Hot Encoded Predictors

The bottom line is: If I’m implementing linear regression from scratch and the data has categorical predictors, how the data is encoded depends on whether training will be done using stochastic gradient descent (SGD), or will be done using a matrix inverse technique, typically relaxed Moore-Penrose pseudo-inverse, or left pseudo-inverse via normal equations.

For SGD training, categorical data can be one-hot encoded. For inverse-based training techniques, data should be drop-first encoded (also called dummy encoding).

As is often the case, explaining what a complex problem is, is far more difficult than describing the answer. So bear with me.

Suppose you have some raw data that looks like:

F,short,24,arkansas,29500,liberal
M,tall,39,delaware,51200,moderate
F,short,63,colorado,75800,conservative
M,medium,36,illinois,44500,moderate
F,short,27,colorado,28600,liberal
. . .

Each line represents a person. The fields are sex, height, age, State (only Arkansas, Colorado, Delaware, Illinois), income, and political leaning. The goal is to predict income from the other variables, and you need to use linear regression.

The categorical variables, sex, height, politics, need to be encoded. The numeric predictor and the target y value should be normalized. The most common way to encode the categorical height, State, and politics variables is to use one-hot encoding. The most common way to encode the Boolean sex variable is to use 0-1 encoding. If you use divide-by-constant normalization (100 for age, 100,000 for income), the resulting normalized and 0-1, one-hot encoded data looks like:

1,   1, 0, 0,   0.24,   1, 0, 0, 0,   0.29500,   0, 0, 1
0,   0, 0, 1,   0.39,   0, 0, 1, 0,   0.51200,   0, 1, 0
1,   1, 0, 0,   0.63,   0, 1, 0, 0,   0.75800,   1, 0, 0
0,   0, 1, 0,   0.36,   0, 0, 0, 1,   0.44500,   0, 1, 0
1,   1, 0, 0,   0.27,   0, 1, 0, 0,   0.28600,   0, 0, 1
. . .

OK, so far so good. But as it turns out, this normalization and encoding scheme works for SGD training but often fails for any training technique that involves a matrix inverse.

By “fails”, I mean the training will either blow up entirely, or succeed but give model weights and bias values with extremely large magnitudes, not close to the the values obtained by other training techniques.

SGD training is iterative. You specify a learning rate and a number of training epochs, and the model weights and the bias get closer and closer to optimal values. The downside here is the learning rate and max epochs values must be determined by trial and error. But SGD works with the one-hot encoding scheme.

The pseudo-inverse via normal equations technique, expressed in an equation is:

w = inv(Xt * X) * Xt * y

The w is a vector of the bias (in cell [0]) and weights (the remaining cells) that you want. X is a design matrix of predictor values, with a leading column of 1.0 values added to account for the model bias. Xt is the transpose of the design matrix. The inv() is a standard matrix inverse, but because Xt * X is mathematically positive-definite if you add a small regularization constant such as 1.0e-8 to the diagonal, you can use the Cholesky decomposition matrix inverse, which is much less complicated than alternatives. The first two * are matrix multiplication. The last * is matrix-to-vector multiplication.

It turns out that a one-hot encoded matrix of predictor values, X, is almost always ill-conditioned meaning you can’t reliably compute the inverse of X. (This last sentence is an enormous topic that would take pages to explain thoroughly). But pseudo-inverse via normal equations technique sometimes works with one-hot encoded predictors mostly Xt * X is square, and when you add a regularization constant to the diagonal, you also condition the matrix so that Cholesky inv() works.

The standard pseudo-inverse technique does not work. Expressed in an equation:

w = pinv(X) * y

The pinv() is relaxed Moore-Penrose pseudo-inverse. There are over a dozen variations of MP pinv(), all of them blisteringly complicated. Examples include SVD via Golub-Reinsch, SVD via Jacobi, SVD via QR Householder, QR via modified Gram-Schmidt, QR via Givens, QR via Householder. The standard SVD implementation in LAPACK is over 10,000 lines of Fortran code.

If you implement a MP pinv() function from-scratch, like I often do and use anything other than the insanely complicated SVD, the problem is that the design matrix X is not square and there is no way to condition the matrix, and pinv() will usually fail for any algorithm other than a most robust SVD algorithm which requires a minimum of 1,000 lines of very complicated code.

To recap to this point, for linear regression implemented from scratch, if you one-hot encode categorical predictors, SGD training works, pseudo-inverse via normal equations training sometimes works, but standard Moore-Penrose pseudo-inverse usually fails.

The most common way to deal with an ill-conditioned matrix that had one-hot encoded values, is to simply drop the first one of each encoding. For example, for variable Height, basic one-hot encoding is short = 1 0 0, medium = 0 1 0, tall = 0 0 1. If you drop the first categorical value, short, you get medium = 1 0, tall = 0 1, and short is the reference value and is assigned short = 0 0.

Dropping the first of each one-hot encoded categorical value removes the collinearity and allows matrix inverse and pseudo-inverse to work. So, if the raw data is:

F,short,24,arkansas,29500,liberal
M,tall,39,delaware,51200,moderate
F,short,63,colorado,75800,conservative
M,medium,36,illinois,44500,moderate
F,short,27,colorado,28600,liberal
. . .

Then if encoded using one-hot for SGD training:

1,   1, 0, 0,   0.24,   1, 0, 0, 0,   0.29500,   0, 0, 1
0,   0, 0, 1,   0.39,   0, 0, 1, 0,   0.51200,   0, 1, 0
1,   1, 0, 0,   0.63,   0, 1, 0, 0,   0.75800,   1, 0, 0
0,   0, 1, 0,   0.36,   0, 0, 0, 1,   0.44500,   0, 1, 0
1,   1, 0, 0,   0.27,   0, 1, 0, 0,   0.28600,   0, 0, 1
. . .

But if encoded using drop-first for relaxed MP pseudo-inverse or pseudo-inverse via normal equations:

1,   0, 0,   0.24,   0, 0, 0,   0.29500,   0, 1
0,   0, 1,   0.39,   0, 1, 0,   0.51200,   1, 0
1,   0, 0,   0.63,   1, 0, 0,   0.75800,   0, 0
0,   1, 0,   0.36,   0, 0, 1,   0.44500,   1, 0
1,   0, 0,   0.27,   1, 0, 0,   0.28600,   0, 1
. . .

Interesting stuff.

I know a lot about linear regression. I know next to nothing about Art. But I can get a gut instinct about images that I think are nice in some subjective way. I like black and white shadow photography when it’s done well. But based on many of the images I see on the Internet, it’s more difficult than I expected to get an attractive image. Here are three images I like that have a kind of linear theme.