Why You Can Sometimes Use Either Mean Squared Error or Binary Cross Entropy for MNIST Data

Many machine learning examples use the MNIST image dataset. But some examples use mean squared error (MSE) and some examples use (BCE). In problems where both the target value and the computed value are between 0.0 and 1.0, in theory you can use either MSE or BCE. This applies to MNIST because in almost all situations, you normalize MNIST raw pixel values, which are between 0 and 255, by dividing by 255 to make them between 0.0 and 1.0, and you apply sigmoid activation which scales the computed values between 0.0 and 1.0. But as I explain here, I think that using BCE is better for binary classification problems, and MSE is best when targets aren’t 0 or 1 — typically a regression problem or an autoencoder problem.

One source of the confusion among beginners is that you usually use BCE for binary classification when the targets are either 0 or 1, and you usually use MSE for regression when targets are any numeric value. This leads to the incorrect notion that BCE can be used only when targets are 0 or 1. But, again, if target and computed are both between 0.0 and 1.0 you can use either MSE or BCE.

Here’s a concrete example:

# bce_vs_mse.py

import torch as T

print("\nYou can use BCE or MSE")

target = T.tensor([0.9], dtype=T.float32)

x1 = T.tensor([0.25], dtype=T.float32)
x2 = T.tensor([0.50], dtype=T.float32)
x3 = T.tensor([0.75], dtype=T.float32)

print("\ntarget: " + str(target))
print("computeds: ")
print(x1)
print(x2)
print(x3)

print("\nMSE losses: ")
# (x, target) order doesn't matter
mse1 = T.nn.functional.mse_loss(x1, target) 
mse2 = T.nn.functional.mse_loss(x2, target)
mse3 = T.nn.functional.mse_loss(x3, target)

print(mse1)
print(mse2)
print(mse3)

print("\nBCE losses: ")
# order matters
bce1 = T.nn.functional.binary_cross_entropy(x1, target) 
bce2 = T.nn.functional.binary_cross_entropy(x2, target)
bce3 = T.nn.functional.binary_cross_entropy(x3, target)

print(bce1)
print(bce2)
print(bce3)

First I set up a target of 0.9 and three computeds of 0.25, 0.50, and 0.75. Notice the computeds get closer and closer to the target. The MSE loss values are 0.4225, 0.1600, 0.0225. The loss values decrease as the computeds get closer to the target. That’s good. The BCE loss values are 1.2764, 0.6931, 0.3975. Again, the loss values decrease. Good.

Now suppose that the target is 0.60, and three computeds are 0.50, 0.60, 0.70. The MSE and BCE are:

target = 0.60

computed   MSE      BCE
--------------------------
0.50      0.0100   0.6931
0.60      0.0000   0.6730
0.70      0.0100   0.6956

The MSE loss values make perfect sense — 0.50 and 0.70 have the same loss because they’re the same distance away from the target 0.60, and the loss for computed = 0.60 is 0 because target = computed.

But for the BCE loss there are two annoying details. First, the loss when target == computed is 0.6730, not 0 as you’d expect. Second, the loss values are asymmetric — even though computeds of 0.50 and 0.70 are the same distance away (0.10) from the target of 0.60, the loss for computed = 0.50 is slightly less than the loss value for target = 0.70.

For these reasons, in situations where either MSE or BCE can be used — both targets and computeds are all between 0.0 and 1.0 — I usually use BCE for classification problems. But I prefer to use MSE for problems where the targets aren’t 0 or 1. An example is computing loss for an autoencoder or variational autoencoder for MNIST data.

There is some disagreement about this. Based on my experience, for some problems MSE works a bit better and for some problems BCE works a bit better, but I’ve never seen a problem where there’s a huge difference in results. I guess the moral here is that when you can use either MSE or BCE, try both and see if one works better than the other.

In machine learning, there are often several correct techniques for a task. But in dog show agility courses, there is usually only one correct technique. Left: Correct technique for hurdle. Center: Incorrect technique for tube-dash obstacle. Right: Incorrect technique for hurdle.