Neural networks are conceptually simple but fantastically complicated with regards to the details. One of many extremely confusing topics is why there are two different equations for updating the hidden-to-output layer weights.
Briefly, for multiclass classification, if j is a hidden node, k is an output node, w[j][k] is the weight from hidden node j to output node k, h is a hidden node value, t is a target value from training data, y is a computed output node value, lr is the learning rate, and grad[j][k] is the gradient for w[j][k], then one form of the weight update equation is:
grad[j][k] = -(t[k] - y[k]) * h[j] * (1 - y[k])(y[k]) w[j][k] += lr * grad[j][k]
The second form of the update equation is:
grad[j][k] = -(t[k] - y[k]) * h[j] w[j][k] += lr * grad[j][k]
The second form is the same as the first but missing a (1 – y[k])(y[k]) term. What?
The first update equation assumes you want to minimize the mean squared error (or, equivalently, maximize likelihood — another extremely complicated topic). The second update equation assumes you want to minimize cross entropy error.
You can find many derivations of the first update equation but not very many of the second. A good reference for the first, mean squared error version is at:
http://www.cs.cornell.edu/courses/cs5740/2016sp/resources/backprop.pdf.
An excellent reference for the second, cross entropy error version is at:
In the image below, I’ve placed two key pages from each reference side by side. Note that they use different notations (which are also different from my notation).

Update for hidden-to-output weights assuming mean squared error (left) and assuming cross entropy error (right).
How can two different equations both work?
Suppose the target node t[k] is 1 and suppose the computed output node y[k] value is 0.6. The (1 – y[k])(y[k]) term is 0.4 * 0.6 = 0.24. Now, because y[k] will always be between 0.0 and 1.0 the (1 – y[k])(y[k]) term will always be a small positive number less than or equal to 0.25 (when y[k] = 0.5).
Therefore, for a given value of the learning rate, the update for the first form (mean squared error) is just a little bit slower than the update for the second form (cross entropy error).
The first form has an advantage that the update amount will be larger when computed output is “bad”, meaning close to 0.5 (which is a relatively unhelpful value), and smaller when the computed output is “good”. This leads in theory to faster training. The disadvantage of the first form is that training can stall out when computed y values get close to target t values.
In practice, both forms usually work about equally well — which is why there are two different forms of update — if one form was clearly better then there’d only be one form.
All things considered, I’m a fan of different forms of algorithms. It makes the field of computer science rich in ideas and endlessly interesting.

.NET Test Automation Recipes
Software Testing
SciPy Programming Succinctly
Keras Succinctly
R Programming
2026 Visual Studio Live
2025 Summer MLADS Conference
2026 DevIntersection Conference
2025 Machine Learning Week
2025 Ai4 Conference
2026 G2E Conference
2026 iSC West Conference
For me, the first equation has not shown any benefit, and the extra calculation finally takes me to the second equation.
But one thing is still confusing to me: if I omit the MSE or CE error, the NN prediction is the same as without error function. What means if the training don’t stop because the error is less then a value, there is no practical use case for the error function?
The both paper are great, I like the style of J.G. Makin.