Programmatically Encoding PyTorch Training Data

Suppose you want to predict a person’s political leaning (“conservative”, “moderate”, “liberal”) from their sex “M”, “F”), age, region (“eastern”, “central”, “western”), and income. You decide to create a neural network model using the PyTorch code library.

Your source data looks like:

M   0.17   eastern   0.80290   moderate   
F   0.49   central   0.37137   moderate   
M   0.26   central   0.21369   liberal   
M   0.00   western   0.69087   conservative
. . .

The age and income values have been normalized using min-max normalization so that the values are all between 0.0 and 1.0 but you need to encode predictor values sex and region, and dependent variable political. Your goal is:

1   0   0.17   1   0   0   0.80290   0
0   1   0.49   0   1   0   0.37137   0
1   0   0.26   0   1   0   0.21369   1
1   0   0.00   0   0   1   0.69087   2
. . .

Sex is encoded as “M” = (1 0), “F” = (0 1). Region is “eastern” = (1 0 0), “central” = (0 1 0), “western” = (0 0 1). Politic is “moderate” = 0, “liberal” = 1, “conservative” = 2. The categorical predictors have been one-hot encoded. The dependent variable has been ordinal encoded.

When the number of data items is relatively small, you can drop the data into Excel and manually encode the categrical data. But in many situations it’s a better idea to use a script to programmatically encode the data.

It’s possible to write a set of very general data encoding functions, but I prefer to write custom functions that are specific to a particulare dataset. In pseudo-code:

loop each src data line break line into separate fields if field is numeric, write it to dest if field is categorical, write encoding end-loop

A snippet looks like:

. . .
for line in fin:
  line = line.rstrip()  # remove trailing newline
  tokens = line.split(delim)  # break line to tokens

  # 0. sex
  if tokens[0] == "M": fout.write("1" + delim + "0")
  elif tokens[0] == "F": fout.write("0" + delim + "1")
  else: raise Exception("unexpected sex in column 0")
  fout.write(delim)

  # 1. age (normalized)
  fout.write(tokens[1]); fout.write(delim)
. . .

This if-elif approach works well as long as the categorical varaibles have about 10 possible values or less. But suppose you have a variable like “state” which can be “alabama”, “alaska”, . . “wyoming”. Coding an if-elif with 50 branches is possible but tedious and error-prone. In such cases you can write a function that accepts a string value and a Dictionary object, and returns an encoded string. Here’s an example:

def encoded_region(region, region_dict, delim):
  # ex: "central" returns "0  1  0"
  n = len(region_dict)   # num possible values
  v = region_dict[region]  # 0 or 1 or . .
  s = ""
  for i in range(n):
    if i == v: s += "1"
    else: s += "0"
    if i != n-1: s += delim
  return s

This requires you to construct a Dictionary object. You could do so manually:

rd = dict()
rd["eastern"] = 0
rd["central"} = 1
rd["western"] = 2

Or you could write code that scans the source data file and programmatically constructs a Dictionary for predictor variables that have a lot of possible values.

The moral of the story is: when working with machine learning, data preparation is not conceptually difficult, but there are a lot of steps.

I love the Tintin books. The cover of “Tintin in Tibet” shows the imprints of steps taken by a yeti creature. And I found three interesting photographs of elaborate costumes from a search for “steps in Tibet”.

# file_encode.py
# Python 3.7.6

def encode(src, dest):
  comment = "#"
  delim = "\t"
  fin = open(src, "r")
  fout = open(dest, "w")
  
  for line in fin:
    if line.startswith(comment): continue
    line = line.rstrip()
    tokens = line.split(delim)

    # 0. sex
    if tokens[0] == "M": fout.write("1" + delim + "0")
    elif tokens[0] == "F": fout.write("0" + delim + "1")
    else: raise Exception("unexpected sex in column 0")
    fout.write(delim)

    # 1. age (normalized)
    fout.write(tokens[1]); fout.write(delim)

    # 2. region
    if tokens[2] == "eastern":
      fout.write("1" + delim + "0" + delim + "0")
    elif tokens[2] == "central":
      fout.write("0" + delim + "1" + delim + "0")
    elif tokens[2] == "western":
      fout.write("0" + delim + "0" + delim + "1")
    else: raise Exception("unexpected region in column 2")
    fout.write(delim)

    # 3. income (normalized)
    fout.write(tokens[3]); fout.write(delim) 

    # 4. political (dependent variable)
    if tokens[4] == "moderate": fout.write("0")
    elif tokens[4] == "liberal": fout.write("1")
    elif tokens[4] == "conservative": fout.write("2")
    else: raise Exception("unexpected politic in column 4")  
    # last column so no delim

    # 5. end-line
    fout.write("\n")

  fout.close()
  fin.close()

# ==================

src = ".\\people_norm.txt"    # cleaned and normalized
dest = ".\\people_encoded.txt"
print("\nBegin encoding " + src)
encode(src, dest)
print("\nEnd encoding")