How to Calculate Expected Calibration Error for Multi-Class Classification

Expected calibration error (ECE) is a metric that compares neural network model output pseudo-probabilities to model accuracies. ECE values can be used to calibrate (adjust) a neural network model so that output pseudo-probabilities more closely match actual probabilities of a correct prediction.

There are many different ways to compute a calibration error. Here I describe a simple version that I use in practice.

Note: This post is an updated version. My original post had several embarrassing errors.

Suppose you have a multi-class classification problem where the goal is to predict a person’s political party affiliation from predictor variables such as sex, age, income, and so on. And suppose there are four political parties: 0 = democrat, 1 = republican, 2 = independent, 3 = green.

After training a neural network multi-class classifier, if you feed it some input predictor values, the trained model will emit a vector of four values (called pseudo-probabilities, or confidence values) such as (0.02, 0.03, 0.90, 0.05) that sum to 1 so that they can very loosely be interpreted as probabilities. Here the largest pseudo-probability (0.90) is at index [2] so the prediction is class 2 = independent.

If the largest pseudo-probability is 0.90 then it would be nice if the model accuracy is close to 0.90. Calibration error measures the difference between a prediction pseudo-probability and the model accuracy, over all the training data.

First, I create 10 bins, one for each possible largest PP: [0.0 to 0.1), [0.1 to 0.2), . . [0.9 to 1.0]. The number of bins is arbitrary to some extent but 10 is a good choice in most scenarios.

Now suppose you have 100 training items. Each item will generate a largest pseudo-probability that determines the predicted class. And suppose that 6 of the 100 items emit a largest PP between 0.2 to 0.3 which corresponds to bin [2]. Each item will be a correct prediction or not, based on the known correct class in the training data, like so:

Item   PP    Correct?
=====================
[65]  0.24   correct
[ 8]  0.28   wrong
[42]  0.22   wrong
[90]  0.24   wrong
[36]  0.25   correct
[21]  0.27   correct

The average PP for this bin is (0.24 + 0.28 + . . + 0.27) / 6 = 0.25 and the accuracy for the bin is 3 / 6 = 0.50.

You compute the average PP and the accuracy for all 10 bins and get something like this:

bin largest PP   ct    avg PP  acc   |avg PP - acc|
===================================================
0  0.0 to 0.1     0         
1  0.1 to 0.2     0         
2  0.2 to 0.3     6     0.24   0.50    0.26
3  0.3 to 0.4    12     0.36   0.67    0.31
4  0.4 to 0.5    18     0.44   0.56    0.12
5  0.5 to 0.6    10     0.52   0.60    0.08
6  0.6 to 0.7    12     0.68   0.58    0.10
7  0.7 to 0.8    14     0.74   0.71    0.03
8  0.8 to 0.9    16     0.86   0.75    0.11
9  0.9 to 1.0    12     0.92   0.83    0.09
                ---       
                100

The last column is the absolute value of the difference between the bin average PP and the bin accuracy. Notice that for a problem with 4 classes, the weakest possible prediction set of PPs has a largest PP of just larger than 1.0 / 4 = 0.25 and so bin [0] and bin [1] won’t have any associated data items.

The calibration error is the weighted average of the values in the last column:

CE = [(0 * 0.00) + (0 * 0.00) + (6 * 0.26) + (12 * 0.31) + (18 * 0.12) + (10 * 0.08) + (12 * 0.10) + (14 * 0.03) + (16 * 0.11) + (12 * 0.09)] / 100 = 0.1270.

This is an overall measure of how much a model PP differs from the expected accuracy of the model. Notice that if the accuracy of every bin exactly equals the largest PP for each bin, the calibration error would be 0.

I came across an interesting paper titled “Measuring Calibration in Deep Learning” that describes alternative measures of calibration, such as Adaptive Calibration Error. See https://arxiv.org/pdf/1904.01685.pdf.

Before electronics, coin-operated games were strictly mechanical. These machines required constant manual calibration. Left: A Rock-Ola “Official Sweepstakes” game from about 1933. You select one of 8 horses (lower right), insert a coin (middle), then push the sideways lever. The horses spin and stop, where the flag (right) indicates the winner. A small metal ball also spins and determines the payout odds (the light green slots) for the winning horse.

Center: A Mills “Chicago” model upright slot machine from about 1898. You insert coins at top, submit the coins with sideways lever at top, then pull the large handle (middle). The geometric-art wheel spins and determines a payout. There’s also a music box device (bottom) that plays while wheel is spinning.

Right: A Withey “Seven Grand” dice game from about 1933. From the days when gambling and liquor were illegal. You’d give the speakeasy bartender a penny (equivalent to about 25 cents today), pull the handle which would spin the felt table to roll the seven dice. The goal is to get 4 of a kind (you win 15 cents), 5 (30 cents), 6 (150 cents), or 7 (300 cents) of a kind. If you won anything, it would be paid by the bartender.