I was working with the scikit naive Bayes classifier. Naive Bayes is best used for categorical / string / text data such as a file that looks like:
actuary green korea F barista green italy M dentist hazel japan M dentist green italy F chemist hazel japan M . . .
The goal is to predict sex (F or M) from job, eye color, and country. In most cases the raw string data should be converted/encoded to integers like:
0 0 2 0 1 0 0 1 3 1 1 1 3 0 0 0 2 1 1 1 . . .
where (actuary=0, barista=1, chemist=2, dentist=3); (green=0, hazel=1); (italy = 0, japan=1, korea=2); (female=0, male=1). In many scenarios, you can manually convert the string data to integer data manually, for example by dropping the string data into an Excel spreadsheet and then doing find-replace operations.
Instead of manually converting strings to integers, it is possible to do the conversion/encoding programmatically. The scikit library has an OrdinalEncoder class that can do this. For example:
import numpy as np
from sklearn.preprocessing import OrdinalEncoder
train_file = ".\\Data\\job_eye_country_sex_raw.txt"
raw = np.genfromtxt(train_file, usecols=range(0,4),
delimiter="\t", dtype=str)
enc.fit(raw) # scan data
print("\nCategories: ")
print(enc.categories_) # show what encoding will do
encoded = enc.transform(raw) # encode the data
X = encoded[:,0:3]
y = encoded[:,3]
# etc.
I thought that OrdinalEncoder was 1.) somewhat overkill for such a simple problem, and 2.) using it introduces a dependency, and mostly 3.) it couldn’t easily handle customization of string-to-integer mapping. So I implemented a lightweight ordinal encode function from scratch that only has about a dozen lines of code:
def ordinal_encode(data, col_values):
# data is an np string matrix (from genfromtxt)
# col_values is a list of lists, per column
(nr, nc) = data.shape
result = np.zeros((nr,nc), dtype=np.int64)
for j in range(nc): # each col
vals = col_values[j] # the strings in this col
for i in range(nr):
s = data[i][j]
for k in range(len(vals)):
if s == vals[k]:
result[i][j] = k
break;
return result
print("\nOrdinal encoding from scratch demo ")
print("\nReading data to memory with genfromtxt() ")
train_file = ".\\Data\\job_eye_country_sex_raw.txt"
raw = np.genfromtxt(train_file, usecols=range(0,4),
delimiter="\t", dtype=str)
print("\nRaw data: ")
print(raw)
encoded = ordinal_encode(raw,
[['actuary','barista','chemist','dentist'],
['green','hazel'],
['italy','japan','korea'],
['F','M']])
encoded = ordinal_encode(raw, col_vals)
print("\nEncoded data: ")
print(encoded)
# now use for naive Bayes
My lightweight ordinal_encode() function accepts a list of lists of unique values in each column. The order in which the string values are listed determines their integer encoding. So in the example, “green” = 0, “hazel” = 1.
Instead of manually specifying the values, it’s possible to programmatically find the values:
def get_col_values(data):
# data is an np string matrix (from genfromtxt)
# return is a list of lists, per column
(nr, nc) = data.shape
result = [] # list of lists
for j in range(nc):
vals = [] # list of unique values in col
for i in range(nr):
s = data[i][j]
if s in vals:
pass
else:
vals.append(s)
vals.sort()
result.append(vals)
return result
train_file = ".\\Data\\job_eye_country_sex_raw.txt"
raw = np.genfromtxt(train_file, usecols=range(0,4),
delimiter="\t", dtype=str)
col_vals = get_col_values(raw) # adjust order if needed
print("\nColumn values: ")
print(col_vals)
encoded = ordinal_encode(raw, col_vals)
print("\nEncoded data: ")
print(encoded)
# etc., etc.
The get_col_values() function encodes each column using alphabetical order, but it’s easy to customize that behavior.
The moral of this post is that sometimes using built-in library code like the scikit OrdinalEncoder class is a good thing, but sometimes writing custom code like the ordinal_encode() function is better.

Transforming string values to integers isn’t very difficult. Transforming a submarine into an airplane isn’t so easy. Here are three ideas that never became reality.
Left: The Convair submersible seaplane was a U.S. Navy design project from the early 1960s. Right: In the early 1930s, Soviet engineering student Boris Ushakov proposed a flying submarine design.
Demo code:
# experiments.py
import numpy as np
from sklearn.naive_bayes import CategoricalNB
# from sklearn.preprocessing import OrdinalEncoder
def get_col_values(data):
# data is an np string matrix (from genfromtxt)
# return is a list of lists, per column
(nr, nc) = data.shape
result = [] # list of lists
for j in range(nc):
vals = [] # list of unique values in col
for i in range(nr):
s = data[i][j]
if s in vals:
pass
else:
vals.append(s)
vals.sort()
result.append(vals)
return result
def ordinal_encode(data, col_values):
# data is an np string matrix (from genfromtxt)
# col_values is a list of lists, per column
(nr, nc) = data.shape
result = np.zeros((nr,nc), dtype=np.int64)
for j in range(nc): # each col
vals = col_values[j] # the strings in this col
for i in range(nr):
s = data[i][j]
for k in range(len(vals)):
if s == vals[k]:
result[i][j] = k
break;
return result
print("\nOrdinal encoding from scratch demo ")
print("\nReading data to memory with genfromtxt() ")
train_file = ".\\Data\\job_eye_country_sex_raw.txt"
raw = np.genfromtxt(train_file, usecols=range(0,4),
delimiter="\t", dtype=str)
print("\nRaw data: ")
print(raw)
col_vals = get_col_values(raw)
print("\nColumn values: ")
print(col_vals)
# encoded = ordinal_encode(raw,
# [['actuary','barista','chemist','dentist'],
# ['green','hazel'],
# ['italy','japan','korea'],
# ['F','M']])
encoded = ordinal_encode(raw, col_vals)
print("\nEncoded data: ")
print(encoded)
X = encoded[:,0:3]
y = encoded[:,3]
print("\nBegin naive Bayes ")
# # using built-in OrdinalEncoder
# print("\nEncoding data: ")
# enc = OrdinalEncoder(dtype=np.int64)
# enc.fit(raw) # scan data
# print("\nCategories: ")
# print(enc.categories_)
# encoded = enc.transform(XY)
# X = encoded[:,0:3]
# y = encoded[:,3]
print("\nCreating naive Bayes classifier ")
model = CategoricalNB(alpha=1)
model.fit(X, y)
print("Done ")
pred_classes = model.predict(X)
print("\nPredicted classes: ")
print(pred_classes)
acc_train = model.score(X, y)
print("\nAccuracy on train data = %0.4f " % acc_train)
# use model
# dentist, hazel, italy = [3,1,0]
print("\nPredicting class for dentist, hazel, italy ")
probs = model.predict_proba([[3,1,0]])
print("\nPrediction probs: ")
print(probs)
predicted = model.predict([[3,1,0]])
print("\nPredicted class: ")
print(predicted)
print("\nEnd demo ")

.NET Test Automation Recipes
Software Testing
SciPy Programming Succinctly
Keras Succinctly
R Programming
2026 Visual Studio Live
2025 Summer MLADS Conference
2025 DevIntersection Conference
2025 Machine Learning Week
2025 Ai4 Conference
2025 G2E Conference
2025 iSC West Conference
You must be logged in to post a comment.