In a PyTorch multi-class classification problem, the basic architecture is to apply log_softmax() activation on the output nodes, in conjunction with NLLLoss() during training. It’s possible to compute softmax() and then apply log() but it’s slightly more efficient to compute log_softmax() directly.
Computing softmax() looks like:
import torch as T def softmax(x): mx = T.max(x) y = T.exp(x - mx) return y / T.sum(y)
Finding the max value is just a math trick to avoid arithmetic overflow.
Computing log_softmax() directly looks like:
def log_softmax(x): mx = T.max(x) lse = T.log(T.sum(T.exp(x - mx))) return x - mx - lse
The reason why log_softmax() is applied to the output nodes is rather subtle. If the target class is at index [i] then the negative log likelihood loss is just the negative of log_softmax() value at [i]. For example, if the log_softmax of a neural output is [-1.6563, -1.7563, -1.5563], and the target class label is [2], then the NLLLoss() is -(-1.5563) = 1.5563. Quite remarkable. One way to think of the log_softmax() plus NLLLoss() pairing is that log_softmax() actually computes the error and NLLLoss() just extracts the error.
If you just have a single set of log-softmax outputs and a single target class label, you could write an NLLLoss() like so:
def my_nll_loss(oupt, target): # oupt is a vector of log-softmax values result = -oupt[target] return result
If you have a batch of output values and a vector of targets, you can use the clever diag() function like so:
def my_nll_loss(oupt, targets): # oupt is a matrix of log-softmax values out = T.diag(oupt[:,targets]) # one val from each row return -T.mean(out)
In the early days of neural networks, you’d compute softmax() on the output nodes and then explicit CrossEntropy() loss. The softmax() plus CrossEntropy() loss approach and the log_softmax() plus NLLLoss() approach give the same results but the log_softmax() plus NLLLoss() approach is more efficient from an engineering perspective.

In the old science fiction movies I enjoy, efficiency was sometimes achieved by reusing special effects snippets. A cool spaceship appeared in four different movies. Left: “Flight to Mars” (1951) was quickly produced in just a few weeks to take advantage of the publicity surrounding the Academy Award winning “Destination Moon” (1950). The spaceship for “Flight to Mars” was reused three times. Center-Left: “World Without End” (1956) is an OK film. Center-Right: “It! The Terror from Beyond Space” (1958) is a landmark film and the direct inspiration for “Alien” (1979). Right: “Queen of Outer Space” (1958) is better than you might guess based on the title.

.NET Test Automation Recipes
Software Testing
SciPy Programming Succinctly
Keras Succinctly
R Programming
2026 Visual Studio Live
2025 Summer MLADS Conference
2026 DevIntersection Conference
2025 Machine Learning Week
2025 Ai4 Conference
2026 G2E Conference
2026 iSC West Conference
You must be logged in to post a comment.