Suppose you want to predict a person’s political leaning (“conservative”, “moderate”, “liberal”) from their sex “M”, “F”), age, region (“eastern”, “central”, “western”), and income. You decide to create a neural network model using the PyTorch code library.
Your source data looks like:
M 0.17 eastern 0.80290 moderate F 0.49 central 0.37137 moderate M 0.26 central 0.21369 liberal M 0.00 western 0.69087 conservative . . .
The age and income values have been normalized using min-max normalization so that the values are all between 0.0 and 1.0 but you need to encode predictor values sex and region, and dependent variable political. Your goal is:
1 0 0.17 1 0 0 0.80290 0 0 1 0.49 0 1 0 0.37137 0 1 0 0.26 0 1 0 0.21369 1 1 0 0.00 0 0 1 0.69087 2 . . .
Sex is encoded as “M” = (1 0), “F” = (0 1). Region is “eastern” = (1 0 0), “central” = (0 1 0), “western” = (0 0 1). Politic is “moderate” = 0, “liberal” = 1, “conservative” = 2. The categorical predictors have been one-hot encoded. The dependent variable has been ordinal encoded.
When the number of data items is relatively small, you can drop the data into Excel and manually encode the categrical data. But in many situations it’s a better idea to use a script to programmatically encode the data.
It’s possible to write a set of very general data encoding functions, but I prefer to write custom functions that are specific to a particulare dataset. In pseudo-code:
loop each src data line break line into separate fields if field is numeric, write it to dest if field is categorical, write encoding end-loop
A snippet looks like:
. . .
for line in fin:
line = line.rstrip() # remove trailing newline
tokens = line.split(delim) # break line to tokens
# 0. sex
if tokens[0] == "M": fout.write("1" + delim + "0")
elif tokens[0] == "F": fout.write("0" + delim + "1")
else: raise Exception("unexpected sex in column 0")
fout.write(delim)
# 1. age (normalized)
fout.write(tokens[1]); fout.write(delim)
. . .
This if-elif approach works well as long as the categorical varaibles have about 10 possible values or less. But suppose you have a variable like “state” which can be “alabama”, “alaska”, . . “wyoming”. Coding an if-elif with 50 branches is possible but tedious and error-prone. In such cases you can write a function that accepts a string value and a Dictionary object, and returns an encoded string. Here’s an example:
def encoded_region(region, region_dict, delim):
# ex: "central" returns "0 1 0"
n = len(region_dict) # num possible values
v = region_dict[region] # 0 or 1 or . .
s = ""
for i in range(n):
if i == v: s += "1"
else: s += "0"
if i != n-1: s += delim
return s
This requires you to construct a Dictionary object. You could do so manually:
rd = dict() rd["eastern"] = 0 rd["central"} = 1 rd["western"] = 2
Or you could write code that scans the source data file and programmatically constructs a Dictionary for predictor variables that have a lot of possible values.
The moral of the story is: when working with machine learning, data preparation is not conceptually difficult, but there are a lot of steps.
I love the Tintin books. The cover of “Tintin in Tibet” shows the imprints of steps taken by a yeti creature. And I found three interesting photographs of elaborate costumes from a search for “steps in Tibet”.
# file_encode.py
# Python 3.7.6
def encode(src, dest):
comment = "#"
delim = "\t"
fin = open(src, "r")
fout = open(dest, "w")
for line in fin:
if line.startswith(comment): continue
line = line.rstrip()
tokens = line.split(delim)
# 0. sex
if tokens[0] == "M": fout.write("1" + delim + "0")
elif tokens[0] == "F": fout.write("0" + delim + "1")
else: raise Exception("unexpected sex in column 0")
fout.write(delim)
# 1. age (normalized)
fout.write(tokens[1]); fout.write(delim)
# 2. region
if tokens[2] == "eastern":
fout.write("1" + delim + "0" + delim + "0")
elif tokens[2] == "central":
fout.write("0" + delim + "1" + delim + "0")
elif tokens[2] == "western":
fout.write("0" + delim + "0" + delim + "1")
else: raise Exception("unexpected region in column 2")
fout.write(delim)
# 3. income (normalized)
fout.write(tokens[3]); fout.write(delim)
# 4. political (dependent variable)
if tokens[4] == "moderate": fout.write("0")
elif tokens[4] == "liberal": fout.write("1")
elif tokens[4] == "conservative": fout.write("2")
else: raise Exception("unexpected politic in column 4")
# last column so no delim
# 5. end-line
fout.write("\n")
fout.close()
fin.close()
# ==================
src = ".\\people_norm.txt" # cleaned and normalized
dest = ".\\people_encoded.txt"
print("\nBegin encoding " + src)
encode(src, dest)
print("\nEnd encoding")


.NET Test Automation Recipes
Software Testing
SciPy Programming Succinctly
Keras Succinctly
R Programming
2026 Visual Studio Live
2025 Summer MLADS Conference
2026 DevIntersection Conference
2025 Machine Learning Week
2025 Ai4 Conference
2026 G2E Conference
2026 iSC West Conference
You must be logged in to post a comment.