Explaining Why PyTorch Multi-Class Classification Neural Networks Use NLLLoss (Negative Log-Likelihood Loss)

I was preparing to teach a class on PyTorch neural networks at the large tech company I work for. To prepare, I wanted to mentally review PyTorch neural network basics, including the mysterious NLLLoss function.

1. Early Days: One-Hot Targets, Softmax Activation, Mean Squared Error (or Cross Entropy Loss)

Suppose you have a multi-class classification problem where the goal is to predict which of three cities (Anaheim, Boulder, Chicago) a person lives in, based on their age, income, and weight. Your raw training data might look like:

23, $72000, 155, Boulder
31, $63000, 165, Anaheim
38, $58000, 170, Chicago
54, $24000, 200, Chicago
. . .

You prepare the raw data by normalizing the numeric predictors and one-hot encoding the targets:

0.23, 0.72000, 0.155, 0,1,0
0.31, 0.63000, 0.165, 1,0,0
0.38, 0.58000, 0.170, 0,0,1
0.54, 0.24000, 0.200, 0,0,1
. . .

You design your neural network so that you apply the softmax() function as the last step in the output. For example, if the input to the neural network is the first training item (0.31, 0.63000, 0.165) and with the network’s current weights and bias values, the network might spit out preliminary output values called logits such as (2.5, 3.5, 1.5). If you apply softmax(), the final output values are scaled so that they sum to 1.0, such as (0.20, 0.70, 0.10). These final values are sometimes called pseudo-probabilities.

                   softmax
logit  exp(logit)  exp/sum  (rounded for simplicity)
 2.5    12.18       0.20
 3.5    33.11       0.70
 1.5     4.48       0.10
        -----       ----
        49.78       1.00

Now, during training, in order to update the network’s weights and bias values so that the network gets better, you must calculate the loss/error between the computed output of (0.20, 0.70, 0.10) and the desired target output of (0, 1, 0). The easiest way to do this is by using mean squared error. For a single data item, the squared error is

 SE = (0.20 - 0)^2 + (0.70 - 1)^2 + (0.10 - 0)^2
    = 0.04 + 0.09 + 0.01
    = 0.14

You’d compute the squared error for each training data item, add them up, divide by how many items you have, and that result is the mean (average) squared error (MSE) for the current weights and bias values. The MSE is then used in a very complex process, called back-propagation, to update the weights and bias values so that the neural network’s computed outputs are closer to the known, correct, target values.

In the late 1990s, instead of using MSE loss/error, which works fine, researchers and engineers noticed that a different function called mean cross entropy error (MCEE) sometimes, but not always, works better than MSE. MSE looks at all of the computed pseudo-probabilities such as (0.20, 0.70, 0.10), and all of the one-hot target values such as (0, 1, 0). Cross entropy error looks only at the pair of values that correspond to the correct target, where the 1 is. For the example above:

 CEE = -1 * log(0.70)
     = 0.36

You’d compute CEE for each data item, take their average, and that value is the MCEE. The idea here is that you only care about the 1 value in the one-hot target. But I repeat, MSE works fine too.

2. PyTorch: Ordinal Targets, LogSoftmax Activation, NLLLoss

You can use the old one-hot encoding, softmax activation, MSE or MCEE loss/error with PyTorch. However, PyTorch (first released in 2016) has a tricky alternative for multi-class classification.

First instead of using one-hot encoding for the targets, you use ordinal encoding. For example, instead of:

0.23, 0.72000, 0.155, 0,1,0
0.31, 0.63000, 0.165, 1,0,0
0.38, 0.58000, 0.170, 0,0,1
0.54, 0.24000, 0.200, 0,0,1
. . .

You use:

0.23, 0.72000, 0.155, 1
0.31, 0.63000, 0.165, 0
0.38, 0.58000, 0.170, 2
0.54, 0.24000, 0.200, 2
. . .

This is simpler, especially in cases where you have many target classes — if you had 50 States instead of just 3, each one-hot encoded target would have 49 0s and one 1.

Next, you design your neural network so that instead of applying the softmax function to the preliminary output values, you apply log-softmax, which as the name suggests, is just the log of the softmax values. For the example data item above:

                   softmax
logit  exp(logit)  exp/sum  log(softmax)
 2.5    12.18       0.20     -1.61
 3.5    33.11       0.70     -0.36
 1.5     4.48       0.10     -2.30
        -----       ----     -----
        49.78       1.00       NA

The log-softmax values are all negative, but still larger values mean larger loss/error and smaller values mean smaller loss/error. Now, during training, you specify using the NLLLoss (negative log-likelihood loss). The NLLLoss function expects log-softmax values and ordinal encoded targets. Behind the scenes, NLLLoss takes the ordinal encoded target (1 in the example), and fetches the corresponding log-softmax value at that index (at [1], which is -0.36), and flips the sign, giving an NLLoss value of 0.36. Notice this is exactly the same loss value from above, obtained using one-hot encoding, softmax activation, and cross entropy error. Very tricky!

Notice that the PyTorch NLLLoss function doesn’t really compute anything in a traditional sense — the function just flips a sign from negative to positive. A somewhat strange idea.

To recap, PyTorch multi-class classification networks can use ordinal targets, log-softmax activation, and NLLLoss to compute loss/error. This approach is a bit easier because you avoid explicit one-hot encoding of target values, and gives you the exact same results as using one-hot targets, softmax activation, and CrossEntropyLoss.

I came across an interesting 2025 research paper titled “Human Perception of Art in the Age of Artificial Intelligence”, by J. van Hees, T. Grootswagers, G. Quek, and M. Varlet. People were asked to compare real art produced by humans, with AI generated art. Somewhat surprisingly, people consistently preferred the AI art. The researchers speculated that people preferred the AI images due to the difference in entropy.