Computing log_softmax() for PyTorch Directly

In a PyTorch multi-class classification problem, the basic architecture is to apply log_softmax() activation on the output nodes, in conjunction with NLLLoss() during training. It’s possible to compute softmax() and then apply log() but it’s slightly more efficient to compute log_softmax() directly.

Computing softmax() looks like:

import torch as T

def softmax(x):
  mx = T.max(x)
  y = T.exp(x - mx)
  return y / T.sum(y)

Finding the max value is just a math trick to avoid arithmetic overflow.

Computing log_softmax() directly looks like:

def log_softmax(x):
  mx = T.max(x)
  lse = T.log(T.sum(T.exp(x - mx)))
  return x - mx - lse

The reason why log_softmax() is applied to the output nodes is rather subtle. If the target class is at index [i] then the negative log likelihood loss is just the negative of log_softmax() value at [i]. For example, if the log_softmax of a neural output is [-1.6563, -1.7563, -1.5563], and the target class label is [2], then the NLLLoss() is -(-1.5563) = 1.5563. Quite remarkable. One way to think of the log_softmax() plus NLLLoss() pairing is that log_softmax() actually computes the error and NLLLoss() just extracts the error.

If you just have a single set of log-softmax outputs and a single target class label, you could write an NLLLoss() like so:

def my_nll_loss(oupt, target):
  # oupt is a vector of log-softmax values
  result = -oupt[target]
  return result

If you have a batch of output values and a vector of targets, you can use the clever diag() function like so:

def my_nll_loss(oupt, targets):
  # oupt is a matrix of log-softmax values
  out = T.diag(oupt[:,targets])  # one val from each row
  return -T.mean(out)

In the early days of neural networks, you’d compute softmax() on the output nodes and then explicit CrossEntropy() loss. The softmax() plus CrossEntropy() loss approach and the log_softmax() plus NLLLoss() approach give the same results but the log_softmax() plus NLLLoss() approach is more efficient from an engineering perspective.

In the old science fiction movies I enjoy, efficiency was sometimes achieved by reusing special effects snippets. A cool spaceship appeared in four different movies. Left: “Flight to Mars” (1951) was quickly produced in just a few weeks to take advantage of the publicity surrounding the Academy Award winning “Destination Moon” (1950). The spaceship for “Flight to Mars” was reused three times. Center-Left: “World Without End” (1956) is an OK film. Center-Right: “It! The Terror from Beyond Space” (1958) is a landmark film and the direct inspiration for “Alien” (1979). Right: “Queen of Outer Space” (1958) is better than you might guess based on the title.