Neural Network Back-Propagation Weight Update Equation: Mean Squared Error vs. Cross Entropy Error

Neural networks are conceptually simple but fantastically complicated with regards to the details. One of many extremely confusing topics is why there are two different equations for updating the hidden-to-output layer weights.

Briefly, for multiclass classification, if j is a hidden node, k is an output node, w[j][k] is the weight from hidden node j to output node k, h is a hidden node value, t is a target value from training data, y is a computed output node value, lr is the learning rate, and grad[j][k] is the gradient for w[j][k], then one form of the weight update equation is:

grad[j][k] = -(t[k] - y[k]) * h[j] * (1 - y[k])(y[k])
w[j][k] += lr * grad[j][k]

The second form of the update equation is:

grad[j][k] = -(t[k] - y[k]) * h[j]
w[j][k] += lr * grad[j][k]

The second form is the same as the first but missing a (1 – y[k])(y[k]) term. What?

The first update equation assumes you want to minimize the mean squared error (or, equivalently, maximize likelihood — another extremely complicated topic). The second update equation assumes you want to minimize cross entropy error.

You can find many derivations of the first update equation but not very many of the second. A good reference for the first, mean squared error version is at:

http://www.cs.cornell.edu/courses/cs5740/2016sp/resources/backprop.pdf.

An excellent reference for the second, cross entropy error version is at:

Click to access notes.pdf

In the image below, I’ve placed two key pages from each reference side by side. Note that they use different notations (which are also different from my notation).


Update for hidden-to-output weights assuming mean squared error (left) and assuming cross entropy error (right).

How can two different equations both work?

Suppose the target node t[k] is 1 and suppose the computed output node y[k] value is 0.6. The (1 – y[k])(y[k]) term is 0.4 * 0.6 = 0.24. Now, because y[k] will always be between 0.0 and 1.0 the (1 – y[k])(y[k]) term will always be a small positive number less than or equal to 0.25 (when y[k] = 0.5).

Therefore, for a given value of the learning rate, the update for the first form (mean squared error) is just a little bit slower than the update for the second form (cross entropy error).

The first form has an advantage that the update amount will be larger when computed output is “bad”, meaning close to 0.5 (which is a relatively unhelpful value), and smaller when the computed output is “good”. This leads in theory to faster training. The disadvantage of the first form is that training can stall out when computed y values get close to target t values.

In practice, both forms usually work about equally well — which is why there are two different forms of update — if one form was clearly better then there’d only be one form.

All things considered, I’m a fan of different forms of algorithms. It makes the field of computer science rich in ideas and endlessly interesting.



Four different forms of fans.

This entry was posted in Machine Learning. Bookmark the permalink.

1 Response to Neural Network Back-Propagation Weight Update Equation: Mean Squared Error vs. Cross Entropy Error

  1. Thorsten Kleppe's avatar Thorsten Kleppe says:

    For me, the first equation has not shown any benefit, and the extra calculation finally takes me to the second equation.
    But one thing is still confusing to me: if I omit the MSE or CE error, the NN prediction is the same as without error function. What means if the training don’t stop because the error is less then a value, there is no practical use case for the error function?

    The both paper are great, I like the style of J.G. Makin.

Comments are closed.