PyTorch CrossEntropyLoss vs. NLLLoss (Cross Entropy Loss vs. Negative Log-Likelihood Loss)

If you are designing a neural network multi-class classifier using PyTorch, you can use cross entropy loss (torch.nn.CrossEntropyLoss) with logits output (no activation) in the forward() method, or you can use negative log-likelihood loss (torch.nn.NLLLoss) with log-softmax (torch.LogSoftmax() module or torch.log_softmax() funcction) in the forward() method. Whew! That’s a mouthful. Let me explain with some code examples.

Suppose you are looking at the Iris Dataset, which has four predictor variables and three classes. The CrossEntropyLoss with logits approach is easier to implement and is by far the most common approach.

The demo run on the left uses CrossEntropyLoss with no activation on the output nodes. The demo run on the right uses NLLLoss with LogSoftmax activation on the output nodes. The results are identical.

A possible 4-7-3 network definition, and associated training code looks like:

class Net(T.nn.Module):
  def __init__(self):
    super(Net, self).__init__()
    self.hid1 = T.nn.Linear(4, 7)  # 4-7-3
    self.oupt = T.nn.Linear(7, 3)
    # initialize wts and biases here

  def forward(self, x):
    z = T.tanh(self.hid1(x))
    z = self.oupt(z)  # logits output
    return z

# training
. . . 
loss_func = T.nn.CrossEntropyLoss()
optimizer = T.optim.SGD(net.parameters(), lr=lrn_rate)
. . .
loop
  oupt = net(X)
  loss_obj = loss_func(oupt, Y)
  loss_obj.backward()
  optimizer.step()
end-loop

This CrossEntropyLoss with logits output (logits just means no activation applied) technique is really just wrapper code around the older NLLLoss with LogSoftmax technique. That older approach could look like:

class Net(T.nn.Module):
  def __init__(self):
    super(Net, self).__init__()
    self.hid1 = T.nn.Linear(4, 7)  # 4-7-3
    self.oupt = T.nn.Linear(7, 3)
    self.apply_log_soft = T.nn.LogSoftmax(dim=1) # Module
    # initialize wts and biases here

  def forward(self, x):
    z = T.tanh(self.hid1(x))
    z = self.oupt(z)
    # z = T.log(T.softmax(z, dim=1))  # inefficient
    z = self.apply_log_soft(z)  # efficient
    # z = T.log_softmax(z, dim=1)  # function instead of Module
    return z

# training
. . . 
loss_func = T.nn.NLLLoss()  # assumes LogSoftmax() applied
optimizer = T.optim.SGD(net.parameters(), lr=lrn_rate)
. . .
loop
  oupt = net(X)
  loss_obj = loss_func(oupt, Y)
  loss_obj.backward()
  optimizer.step()
end-loop

In short, when using the newer and simpler approach for multi-class classification, you don’t apply any activation to the output and then CrossEntropyLoss applies log-SoftMax internally. When using the older approach for multi-class classification, you apply LogSoftmax to the output and NLLLoss assumes you’ve done so.

When making a prediction, with the CrossEntropyLoss technique the raw output values will be logits so if you want to view probabilities you must apply SoftMax. With the older NLLLoss technique, the raw output values will be log of SoftMax so if you want to view probabilities you must apply the exp() function.

To summarize, when designing a neural network multi-class classifier, you can you CrossEntropyLoss with no activation, or you can use NLLLoss with log-SoftMax activation. This applies only to multi-class classification — binary classification and regression problems have a different set of rules.