The Difference Between the scikit Categorical Naive Bayes and Bernoulli Naive Bayes Classifiers

One of my work colleagues asked me to explain the difference between the scikit CategoricalNB and BernoulliNB modules. Briefly, BernoulliNB is just a special case of CategoricalNB. In BernoulliNB all predictors are either 0 or 1. In CategoricalNB, predictors can have two or more values. Put another way, there’s really no need for BernoulliNB because you can always use CategoricalNB.

The Voting Dataset problem has all Boolean/binary predictor values. Left: Using BernoulliNB. Right: Using CategoricalNB. The results are identical.

Suppose you want to predict a person’s job type from sex, age, and height, and the raw data looks like:

sex  age  height  job-type
--------------------------
M    25   70      technical
F    53   63      sales
M    42   72      management
. . .

Suppose you encode the data as:

sex: female = 0, male = 1
age: young = 29 or less = 0, old = 30 or more = 1
height: short = 65″ or less = 0, tall = 66″ or more = 1
job: management = 0, sales = 1, techncial = 2

The encoded data is:

sex  age  height  job-type
--------------------------
 1    0    1        2
 0    1    0        1
 1    1    1        0
. . .

Because all the predictors are Boolean, you can use BernoulliNB like this:

from sklearn.naive_bayes import BernoulliNB
print("Creating Bernoulli naive Bayes classifier ")
model = BernoulliNB(alpha=1)
. . .

But you could also use CategoricalNB and you’d get the exact same results:

from sklearn.naive_bayes import CategoricalNB
print("Creating Categorical naive Bayes classifier ")
model = CategoricalNB(alpha=1)
. . .

Suppose you had a different data set of encoded data that looks like:

sex  age  height  job-type
--------------------------
 1    2    2         2
 0    0    4         1
 1    2    1         0
. . .

Here age is young = 0, medium = 1, old = 2. Height is very short = 0, short = 1, medium = 2, tall = 3, very tall = 4. In this case you could use CategoricalNB but not BernoulliNB.

The BernoulliNB constructor has a parameter binarize=0.0 that CategoricalNB doesn’t have. If you feed a BernoulliNB model a numeric dataset that isn’t encoded, then values less or equal to the binarize will be automatically encoded as 0 and values greater than binarize will be encoded as 1. (The scikit documentation is very unclear about binarize). In practice, the binarize parameter is rarely useful.

I recall reading that several years ago, some researchers in the UK did an experiment where observers were asked to identify the sex of two people based only on the eyes. Weirdly, observers did well predicting Caucasian people (over 80% accuracy), but did no better than chance when predicting Black or Asian people. In each pair above, the female is on the left and the male is on the right. (These aren’t images used in the study). For Black pairs, the most common misidentification was a female incorrectly predicted as a male. For Asian pairs, the most common misidentification was a male incorrectly predicted as a female. I guess the conclusion is that a.) for facial identification, you need more than just eyes, and b.) UK researchers do some strange studies.