Calculating Expected Calibration Error for Binary Classification

Suppose you have a binary classification model where the goal is to predict if a person has a disease of some kind, based on predictor variables such as blood pressure, score on a diagnostic test, cholesterol level, and so on. The output of the model is a value between 0 and 1 that indicates the likelihood that the person has the disease. Therefore, model output values can loosely be interpreted as pseudo-probabilities where values less than 0.5 indicate class 0 (no disease) and values greater than 0.5 indicate class 1 (disease).

Note: My original version of this post was completely incorrect — among the worst technical blunders I’ve ever made. I’ve replaced the original version with a corrected version.

Output pseudo-probability values are sometimes called confidence values or just probabilities. I’ll use the term pseudo-probabilities (PPs).

A machine learning binary classification model is well-calibrated if the output pseudo-probabilities closely reflect the model accuracies. In other words, if the output pseudo-probability for a person is 0.75 (strongly indicating class 1) then you’d like there to be roughly a 75% chance the model is correct — the person does in fact have the disease. Or if the output pseudo-probability is 0.10 (strongly indicating class 0) then you’d like roughly a 1 – 0.10 = 90% chance the person does not have the disease.

Some binary classification models are well-calibrated and some are not. The first step in dealing with model calibration is measuring it. There are many ways to measure binary classification model calibration but one common technique is to calculate a metric called Calibration Error (CE). Small values of CE indicate a model that is well-calibrated; larger values of CE indicate a model that is less well-calibrated.

Note: There are many different variations of calibration error. I present the one I use in practice.

Note: For multi-class problems, CE is used with slight changes and is usually called expected calibration error (ECE). However both CE and ECE are terms that are used interchangeably for binary and multi-class problems.

Calculating CE is best explained by example. Suppose there are 100 training data items. Each data item generates an output pseudo-probability (PP) which determines the predicted class. The training data has the known correct class target value which determines if the prediction is correct or wrong.

First, you create 10 bins for the PP values: [0.0 to 0.1), [0.1 to 0.2), . . . [0.9 to 1.0]. Suppose that for the 100 data items, for bin [0] with a PP value between 0.1 and 0.2, there are 8 items like this:

Item    PP     Correct?
=======================
[57]   0.05    correct
[ 9]   0.08    wrong
[23]   0.04    correct
[52]   0.03    correct 
[86]   0.05    correct
[66]   0.06    wrong
[30]   0.04    correct
[59]   0.05    correct

Six out of eight items were corrected predicted (based on the true, known correct values in the training data) so the model accuracy for the bin [0] is 0.75. The average PP for the bin is (0.05 + 0.08 + . . + 0.05) / 8 = 0.05.

You compute the average PP and the model accuracy, and the absolute value of the difference between each PP and accuracy for the nine other bins. Suppose you get:

Bin             Count   Avg PP   Accuracy   abs(PP – Acc)
============================================================

0  0.0 to 0.1     8      0.05     0.75       0.70
1  0.1 to 0.2    12      0.14     0.67       0.53
2  0.2 to 0.3     9      0.22     0.67       0.45
3  0.3 to 0.4     7      0.38     0.57       0.19 
4  0.4 to 0.5    11      0.40     0.64       0.24
     
5  0.5 to 0.6    13      0.55     0.69       0.14
6  0.6 to 0.7     5      0.69     0.60       0.09
7  0.7 to 0.8     6      0.72     0.67       0.05
8  0.8 to 0.9    10      0.84     0.70       0.14
9  0.9 to 1.0    19      0.95     0.63       0.32
                ---
                100

Now for bins [0] to [4] (the ones corresponding to class 0), you compute the complement of each average PP, 1 – avg PP, giving this:

Bin             Count   1 - Avg PP  Accuracy   abs(PP – Acc)
                         Avg PP
============================================================

0  0.0 to 0.1     8      0.95        0.75       0.70
1  0.1 to 0.2    12      0.86        0.67       0.53
2  0.2 to 0.3     9      0.78        0.67       0.45
3  0.3 to 0.4     7      0.62        0.57       0.19 
4  0.4 to 0.5    11      0.60        0.64       0.24
     
5  0.5 to 0.6    13      0.55        0.69       0.14
6  0.6 to 0.7     5      0.69        0.60       0.09
7  0.7 to 0.8     6      0.72        0.67       0.05
8  0.8 to 0.9    10      0.84        0.70       0.14
9  0.9 to 1.0    19      0.95        0.63       0.32
                ---
                100

And now the calibration error (CE) is the weighted average of the absolute values in the last column:

CE = [(8 * 0.70) + (12 * 0.53) + . . + (19 * 0.32)] / 100 = 0.1571

This is a measure of how closely the model PP values correspond to the accuracy. Notice that if each bin accuracy equals the bin average pseudo-probability (or complement for the first half of the bins), the expected calibration error is 0.

Note: Many research papers imply that you should use the maximum PP value in each bin rather than the average PP value in each bin. I don’t understand this at all. The average PP in each bin makes much more sense. I suspect that some people are confused with binary classification where there is a single output pseudo-probability, and multi-class classification where there are multiple output pseudo-probabilities and you use the largest one to determine the predicted class.

My first attempt at this post did not use the complement of the PP values for bins corresponding to class 0 — those less than 0.5. I was just completely wrong. I have no defense for my sloppiness, but it was a good humbling experience. I’m thankful to Daniel Sumler, a graduate student at the University of Liverpool, England, for an e-mail message that pointed me in the right direction.

In general, logistic regression binary classification models are usually fairly well-calibrated, but support vector machine models and neural network models are less well-calibrated.

I’ve been thinking that maybe model calibration error can be used as a measure of dataset similarity. The idea is that similar datasets should have similar calibration error — maybe. It’s an idea that hasn’t been investigated as far as I know.

Binary wristwatches. Left: The time is 6:18. Center-left: A DeTomaso (same Italian company that produces sports cars). Center-right: A watch from a company called The One. Right: I had this Endura jump-hour watch in the 1960s and I was very proud of it.