Machine Learning Regression - Normalization and Encoding Recommended Techniques

The goal of a machine learning regression problem is to predict a single numeric value. For example, you might want to predict the bank account balance of a person based on sex (male, female), age, annual income, political leaning (conservative, moderate, liberal), and height (short, medium, tall).

Normalization scales numeric predictors to a similar range, for example, dividing all age values by 100 so that they’re all between 0.0 and 1.0. Encoding converts categorical data into numeric values. For example, you could convert height values to short = 0.25, medium = 0.50, tall = 0.75.

For numeric predictor values, divide-by-k (k is some constant) is usually simple and effective. For example, divide all annual income values by 100,000 so that they’re scaled between 0.0 and 1.0 (assuming all incomes are less than $100,000). Another common technique is min-max normalization. Suppose some predictor variable has smallest value -3.0 and largest value 9.0. To min-max normalize any x, the equation is x’ = (x – min) / (max – min). So if x = 3.0, x’ = (3.0 – (-3.0)) / (9.0 – (-3.0)) = 6.0 / 12.0 = 0.50. All min-max normalized values will be between 0.0 ad 1.0. A third technique is z-score normalization, but it is rarely used now.

Non-numeric predictor variables fall into three categories. Ordinal predictors have inherent order. For example, variable height with values short, medium, tall. Nominal predictors have no inherent order. For example, variable political leaning with values conservative, moderate, liberal. Binary predictors have exactly two possible values. For example, sex with values male, female.

One-hot encoding of a political leaning variable could be conservative = (1, 0, 0), moderate = (0, 1, 0), liberal = (0, 0, 1).

One-over-n-hot encoding takes the number of possible values into account. For example, conservative = (1/3, 0, 0), moderate = (0, 1/3, 0), liberal = (0, 0, 1/3).

For equal-interval encoding, all pairs of consecutive values are the same distance apart. For variable height, short = 0.25, medium = 0.50, tall = 0.75.

Binary variables are conceptually tricky. The bottom line is that zero-one encoding usually works good enough, even when not theretically optimal.

Binary variables can be zero-one encoded (male = 0, female = 1), or minus-one-plus-one encoded (male = -1, female = +1), or one-hot encoded (male = 1,0 female = 0,1), or one-over-n-hot encoded (male = 1/2, 0 female = 0, 1/2), or simplified one-hot encoded (male = 0.25, female = 0.75), or equal-interval encoded (male = 0.3333 female = 0.6666) or modified equal-interval encoded (male = 0.25, female = 0.75). Because of the complexity, in most cases I use modified equal-interval encoding (nearest neighbors and kernel ridge regression) or zero-one encoding (all other techniques).

Click to enlarge.

Machine learning data normalization and encoding is surprisingly subtle and conceptually complex. The table above lists my usual recommendations for 10 common regression techniques.

Data normalization scales a set of data so that the values are similar in some respect, for example, all between 0.0 and 1.0. I’ve sometimes wondered what artists think about their need to normalize their particular style to make works recognizable as being created by them. I am probably over-thinking this.

Artist Doug Sneyd (1931-2025) was a popular illustrator and cartoonist in the 1960s and 1970s. His style is very distinctive — angular noses on men, stylish clothes on women, people usually holding something, and scenes often set in a bar or cocktail party. His style just screams “1960s!” to me.