How to Encode Categorical Data for Linear Ridge Regression

Bottom line: If you have categorical data (or mixed numeric and categorical) and you are going to create a linear ridge regression model, you can use one-hot encoding on categorical variables that have three or more possible values. For binary categorical variables you can use either minus-one-plus-one encoding, or zero-one encoding, or one-hot encoding (1 0, 0 1) — all three encoding techniques for binary predictors give the same results (accuracy, root mean squared error) but with slightly different model coefficients.

Linear ridge regression (LRR) predicts a single numeric value. LRR is a slightly enhanced version of basic linear regression. LRR uses the ridge technique to mitigate model overfitting. A nice side-effect of the ridge technique is that it “conditions” the underlying matrix so that the matrix inversion doesn’t fail.

Note: Linear ridge regression is a primitive technique in the sense that it can’t handle non-linear data well. I use LRR mostly to establish a baseline and then I use more powerful techniques: k-nearest neighbors regression, kernel ridge regression, Gaussian process regression, decision tree regression, neural network regression.

There are many Internet resources about LRR. But there is conflicting information about whether or not LRR can be used with categorical predictor data or not, and if it can, how to encode the categorical data.

I ran some experiments. I started with raw data that looks like:

F   24   michigan   29500.00   lib
M   39   oklahoma   51200.00   mod
F   63   nebraska   75800.00   con
. . .

The goal is to predict income from sex, age, State, political leaning. There are 200 training items and 40 test items.

Dataset 1 used minus-one-plus1 encoding on sex, and one-hot encoding on State and politics. I divided age values by 100 and income values by 100,000 giving:

 1   0.24   1  0  0   0.2950   0  0  1
-1   0.39   0  0  1   0.5120   0  1  0
 1   0.63   0  1  0   0.7580   1  0  0
. . .

Dataset 2 used zero-one encoding on sex:

1   0.24   1  0  0   0.2950   0  0  1
0   0.39   0  0  1   0.5120   0  1  0
1   0.63   0  1  0   0.7580   1  0  0
. . .

Dataset 3 used one-hot encoding on sex:

0  1   0.24   1  0  0   0.2950   0  0  1
1  0   0.39   0  0  1   0.5120   0  1  0
0  1   0.63   0  1  0   0.7580   1  0  0
. . .

I created three linear ridge regression models with alpha/noise = 0.05. All three models produced identical accuracy and root mean squared error metrics, but of course the coefficients and constant terms were slightly different.

Model 1 acc on train (0.10) = 0.9150
Model 1 acc on test (0.10) = 0.9500
RMSE 1 on train = 0.0260
RMSE 1 on test = 0.0262
Model 1 acc on train (0.10) = 0.9150

Model 2 acc on train (0.10) = 0.9150
Model 2 acc on test (0.10) = 0.9500
RMSE 2 on train = 0.0260
RMSE 2 on test = 0.0262

Model 3 acc on train (0.10) = 0.9150
Model 3 acc on test (0.10) = 0.9500
RMSE 3 on train = 0.0260
RMSE 3 on test = 0.0262

Accuracy was computed where a predicted income is correct if it’s within 10% of the true value. The identical results were expected — the three different forms of encoding the sex variable did not add or remove information.

So, the three encoding techniques for binary predictor variables (-1,+1; 0,1; 10, 01) all work fine. Which should you use? I think it’s a matter of personal preference. I like the minus-one-plus-one encoding but most of my colleagues prefer zero-one encoding. Encoding binary predictors using one-hot encoding adds an extra variable so I rarely use that technique.

When using linear ridge regression you should normalize numeric predictors to the same range. If you set the alpha/noise value to 0 then linear ridge regression reduces to ordinary linear regression and one-hot encoding could produce correlated data which would blow up the system.

Important note: For ordinary linear regression you should use dummy coding for categorical predictors. But dummy coding is ugly and I always avoid it when possible.

In addition to numeric and categorical data, a third type is ordinal data, where there’s implied order. For example, “terrible”, “bad”, “average”, “good”, “excellent” are ordinal values. Unlike pure categorical data (e.g., “red”, “blue”, “green”), n-level ordinal data can be encoded 1, 2, . . n.

I read some academic poll data from political science professors yesterday, rating the worst U.S. vice presidents in history. Two recent VPs who were rated as terrible are Richard B. Cheney (VP under George W. Bush, 2001-2009) and Kamala D. Harris (VP under Joe Biden, 2021-).

Left: Cheney urged an attack on Iraq based on their nuclear weapons arsenal. Except Iraq didn’t have a nuclear arsenal. In 2006 Cheney accidentally wounded a hunting buddy while on a quail hunt. Ouch.

Right: Harris is known for her inept command of the English language. One gem is, “So, I think it’s very important, as you have heard from so many incredible leaders, for us at every moment in time — and certainly this one — to see the moment in time in which we exist and are present, and to be able to contextualize it, to understand where we exist in the history and in the moment as it relates not only to the past but the future.” Wow.

I’m glad I have no interest in politics. The characteristics of people who are drawn into politics are the characteristics I least like in a person.