Data Prep for Machine Learning: Outliers

I wrote an article in the July 2020 edition of Visual Studio Magazine titled, “Data Prep for Machine Learning: Outliers”. See https://visualstudiomagazine.com/articles/2020/07/14/ml-data-prep-outliers.aspx.

The article explains how to programmatically identify and deal with outlier data. Suppose you have a data file of loan applications. Examples of outlier data include a person’s age of 99 (either a very old applicant or possibly a placeholder value that was never changed) and a person’s country of “Cannada” (probably a transcription error).

In situations where the source data file is small, about 500 lines or less, you can usually find and deal with outlier data manually. But in almost all realistic scenarios with large datasets you must handle outlier data programmatically.

The article explains how to find numeric data outliers by computing z-scores, and how to find categorical data outliers by computing frequency counts.

Data preparation is an umbrella term for many different activities. Data preparation is always tedious and much more time consuming than expected. There’s nothing conceptually difficult about data preparation. But there are many steps and each step has many small details to attend to.

There’s no umbrella term for weird umbrellas. Left: Goth umbrella in outer space. Center: My dogs love to chase squirrels but I don’t know what they’d do if they saw this one. Right: Japan. That’s all you need to know.