I Have Never Understood Imputing Missing Values for Machine Learning

The goal of a machine learning model is to predict a single numeric value (regression) or a single discreet value such as the poltical leaning of a person (classification and binary classification). To create a model you must have data. For example, suppose some data looks like:

F, 24, Michigan, 29500.00, liberal
M, 39, Oklahoma, 51200.00, moderate
F, 63, Nebraska, 75800.00, conservative
. . .

The fields are sex, age, State, income, politics. Using this data you could predict any of the variables from the other variables, for example, predict income from sex, age, State, and politics.

Real-life data often has missing values. For example:

F,       24, Michigan, missing,  liberal
M,       39, Oklahoma, 51200.00, moderate
missing, 63, Nebraska, 75800.00, conservative
. . .

The obvious, and best approach is to toss out data rows that have one or more missing values. But for some reason, a standard machine learning technique for missing data is to supply imputed values. For example, for the missing sex value in the third row, you could insert the most common sex, male or female. And for the missing income value in the first row, you could insert the average of the income values.

The scikit-learn library has a module for supplying imputed values, but I can’t think of any scenarios where using it would be a good idea.

Imputing missing values makes absolutely no sense to me from a principled point of view. At best, the resulting prediction model will be sketchy, and the model could be flat-out misleading.

There’s no big moral to this post other than common sense should always prevail.



Missing data in machine learning is always bad. But missing details in art is a good thing. I don’t like photo-realistic art — I much prefer a certain level of abstraction where detail is missing.


This entry was posted in Machine Learning. Bookmark the permalink.

Leave a Reply