Data Prep for Machine Learning: Encoding

I wrote an article titled “Data Prep for Machine Learning: Encoding” in the August 2020 edition of Visual Studio Magazine. See https://visualstudiomagazine.com/articles/2020/08/12/ml-data-prep-encoding.aspx.

The article is one of a series where I walk through the entire process of programmatically preparing data for use by a deep neural network model. In artificial scenarios where there isn’t much data, it’s often possible to prepare data manually, using a text editor or spreadsheet program. But in realistic scenarios with very large data files, you’ve got to programmatically prepare the data. This process is very tricky and time consuming.

Encoding is the process of transforming non-numeric data, such as “blue”, into a numeric form, such as (0, 1, 0, 0). There are dozens of kinds of encoding. The most common are one-hot encoding, zero-one encoding, minus-one-plus-one encoding, and ordinal encoding. Encoding is conceptually easy but tricky in practice. The major challenge, however, is knowing what type of encoding to use in a particular situation.

I previously published articles on dealing with missing data, dealing with outlier data, and normalizing data. I’m working on articles that cover splitting data files (typically into training and test sets — a surprisingly tricky task) and serving up batches of training data (also surprisingly tricky). When I’m done with all six articles, I’ll probably put together one mega-example that does all of the transformations from ugly raw data to beautiful ready-for-ML data.


Many movies feature the “ugly duckling turns to swan” transformation. Left: In “The Princess Diaries” (2001) actress Anne Hathaway plays dorky high school student Mia Thermopolis who turns out to be a princess of Genovia. Center: In “My Fair Lady” (1964) actress Audrey Hepburn plays uncultured street girl Eliza Doolittle who is transformed into an English lady on a bet. Right: In “Miss Congeniality” (2000) actress Sandra Bullock plays crusty FBI agent Gracie Hart who must go undercover in a beauty pageant.

This entry was posted in Machine Learning. Bookmark the permalink.