Converting String Data to Integers aka Ordinal Encoding

I was working with the scikit naive Bayes classifier. Naive Bayes is best used for categorical / string / text data such as a file that looks like:

actuary   green   korea   F
barista   green   italy   M
dentist   hazel   japan   M
dentist   green   italy   F
chemist   hazel   japan   M
. . .

The goal is to predict sex (F or M) from job, eye color, and country. In most cases the raw string data should be converted/encoded to integers like:

0   0   2   0
1   0   0   1
3   1   1   1
3   0   0   0
2   1   1   1
. . .

where (actuary=0, barista=1, chemist=2, dentist=3); (green=0, hazel=1); (italy = 0, japan=1, korea=2); (female=0, male=1). In many scenarios, you can manually convert the string data to integer data manually, for example by dropping the string data into an Excel spreadsheet and then doing find-replace operations.

Instead of manually converting strings to integers, it is possible to do the conversion/encoding programmatically. The scikit library has an OrdinalEncoder class that can do this. For example:

import numpy as np
from sklearn.preprocessing import OrdinalEncoder 

train_file = ".\\Data\\job_eye_country_sex_raw.txt"
raw = np.genfromtxt(train_file, usecols=range(0,4),
  delimiter="\t", dtype=str)

enc.fit(raw)  # scan data
print("\nCategories: ")
print(enc.categories_)  # show what encoding will do
encoded = enc.transform(raw)  # encode the data

X = encoded[:,0:3]
y = encoded[:,3]
# etc.

I thought that OrdinalEncoder was 1.) somewhat overkill for such a simple problem, and 2.) using it introduces a dependency, and mostly 3.) it couldn’t easily handle customization of string-to-integer mapping. So I implemented a lightweight ordinal encode function from scratch that only has about a dozen lines of code:

def ordinal_encode(data, col_values):
  # data is an np string matrix (from genfromtxt)
  # col_values is a list of lists, per column
  (nr, nc) = data.shape
  result = np.zeros((nr,nc), dtype=np.int64)
  for j in range(nc):      # each col
    vals = col_values[j]   # the strings in this col
    for i in range(nr):
      s = data[i][j]
      for k in range(len(vals)):
        if s == vals[k]:
          result[i][j] = k
          break;
  return result

print("\nOrdinal encoding from scratch demo ")
print("\nReading data to memory with genfromtxt() ")
train_file = ".\\Data\\job_eye_country_sex_raw.txt"
raw = np.genfromtxt(train_file, usecols=range(0,4),
  delimiter="\t", dtype=str)

print("\nRaw data: ")
print(raw)

encoded = ordinal_encode(raw,
 [['actuary','barista','chemist','dentist'],
  ['green','hazel'],
  ['italy','japan','korea'],
  ['F','M']])

encoded = ordinal_encode(raw, col_vals)

print("\nEncoded data: ")
print(encoded)
# now use for naive Bayes

My lightweight ordinal_encode() function accepts a list of lists of unique values in each column. The order in which the string values are listed determines their integer encoding. So in the example, “green” = 0, “hazel” = 1.

Instead of manually specifying the values, it’s possible to programmatically find the values:

def get_col_values(data):
  # data is an np string matrix (from genfromtxt)
  # return is a list of lists, per column
  (nr, nc) = data.shape
  result = []  # list of lists
  for j in range(nc):
    vals = []  # list of unique values in col
    for i in range(nr):
      s = data[i][j]
      if s in vals:
        pass
      else:
        vals.append(s)
    vals.sort()
    result.append(vals)
  return result

train_file = ".\\Data\\job_eye_country_sex_raw.txt"
raw = np.genfromtxt(train_file, usecols=range(0,4),
  delimiter="\t", dtype=str)

col_vals = get_col_values(raw)  # adjust order if needed
print("\nColumn values: ")
print(col_vals)

encoded = ordinal_encode(raw, col_vals)

print("\nEncoded data: ")
print(encoded)

# etc., etc.

The get_col_values() function encodes each column using alphabetical order, but it’s easy to customize that behavior.

The moral of this post is that sometimes using built-in library code like the scikit OrdinalEncoder class is a good thing, but sometimes writing custom code like the ordinal_encode() function is better.

Transforming string values to integers isn’t very difficult. Transforming a submarine into an airplane isn’t so easy. Here are three ideas that never became reality.

Left: The Convair submersible seaplane was a U.S. Navy design project from the early 1960s. Right: In the early 1930s, Soviet engineering student Boris Ushakov proposed a flying submarine design.

Demo code:

# experiments.py

import numpy as np
from sklearn.naive_bayes import CategoricalNB
# from sklearn.preprocessing import OrdinalEncoder

def get_col_values(data):
  # data is an np string matrix (from genfromtxt)
  # return is a list of lists, per column
  (nr, nc) = data.shape
  result = []  # list of lists
  for j in range(nc):
    vals = []  # list of unique values in col
    for i in range(nr):
      s = data[i][j]
      if s in vals:
        pass
      else:
        vals.append(s)
    vals.sort()
    result.append(vals)
  return result

def ordinal_encode(data, col_values):
  # data is an np string matrix (from genfromtxt)
  # col_values is a list of lists, per column
  (nr, nc) = data.shape
  result = np.zeros((nr,nc), dtype=np.int64)
  for j in range(nc):      # each col
    vals = col_values[j]   # the strings in this col
    for i in range(nr):
      s = data[i][j]
      for k in range(len(vals)):
        if s == vals[k]:
          result[i][j] = k
          break;
  return result

print("\nOrdinal encoding from scratch demo ")
print("\nReading data to memory with genfromtxt() ")
train_file = ".\\Data\\job_eye_country_sex_raw.txt"
raw = np.genfromtxt(train_file, usecols=range(0,4),
  delimiter="\t", dtype=str)

print("\nRaw data: ")
print(raw)

col_vals = get_col_values(raw)
print("\nColumn values: ")
print(col_vals)

# encoded = ordinal_encode(raw,
#  [['actuary','barista','chemist','dentist'],
#   ['green','hazel'],
#   ['italy','japan','korea'],
#   ['F','M']])

encoded = ordinal_encode(raw, col_vals)

print("\nEncoded data: ")
print(encoded)

X = encoded[:,0:3]
y = encoded[:,3]

print("\nBegin naive Bayes ")

# # using built-in OrdinalEncoder
# print("\nEncoding data: ")
# enc = OrdinalEncoder(dtype=np.int64)
# enc.fit(raw)  # scan data
# print("\nCategories: ")
# print(enc.categories_)
# encoded = enc.transform(XY)
# X = encoded[:,0:3]
# y = encoded[:,3]

print("\nCreating naive Bayes classifier ")
model = CategoricalNB(alpha=1)
model.fit(X, y)
print("Done ")
pred_classes = model.predict(X)

print("\nPredicted classes: ")
print(pred_classes)

acc_train = model.score(X, y)
print("\nAccuracy on train data = %0.4f " % acc_train)

# use model
# dentist, hazel, italy = [3,1,0]
print("\nPredicting class for dentist, hazel, italy ")
probs = model.predict_proba([[3,1,0]])
print("\nPrediction probs: ")
print(probs)

predicted = model.predict([[3,1,0]])
print("\nPredicted class: ")
print(predicted)

print("\nEnd demo ")