The Difference Between Neural Network L2 Regularization and Weight Decay

It’s correct to say that neural network L2 regularization and weight decay are the same thing, but it’s also correct to say they do the same thing but in slightly different ways. Let me explain. I’ll start with L2 regularization.

L2 regularization is a technique used to reduce the likelihood of neural network model overfitting. Overfitting occurs when you train a neural network too long. The trained model predicts very well on the training data (often nearly 100% accuracy) but when presented with new data the model predicts poorly.

Neural networks that have been over-trained are often characterized by having weights that are large. A good NN might have weight values that range between -5.0 to +5.0 but a NN that is overfitted might have some wweight values such as 25.0 or -32.0. So, one approach for discouraging overfitting is to prevent weight values from getting large in magnitude.

L2 regularization does this by theoretically adding a term to the underlying error function. The term penalizes weight values. Larger weights produce larger error during training. In the equations below, the regular mean squared error has an additional term that is a fraction (lambda divided by 2) of the sum of the squared weight values. Lambda is usually something like 0.005. Mathematically, this leads to a change in the weight gradients, which in turn leads to a change in the weight delta values that are used to update the weights, which in turn reduces the value of each weight on each training iteration. Clever.

The L2 regularization technique for neural networks was worked out by researchers in the 1990s. But at the same time, engineers, working independently from researchers, noticed that if you simply decrease the value of each weight on each training iteration, you get an improved trained model that isn’t as likely to be overfitted. For example, the code might look like:

  // determine weight delta using back-propagation
  wt = wt + wt_delta  // update
  wt = wt * 0.98      // or wt = wt - (0.02 * wt) 

The weight is updated as usual and then multiplied by 0.98 which reduces the value by 2% on each iteration. Engineers called this ad hoc technique weight decay.

So, researchers and engineers came up with the same idea. Researchers worked out the idea using mathematics and engineers worked out the idea based on experience. So, L2 regularization reduces the magnitudes of neural network weights during training and so does weight decay. The L2 approach has a solid underlying theory but is complicated to implement. The weight decay approach “just works” but is simple to implement.



I’ve never been to a fashion show, and I don’t know anything about fashion (believe me!) but I imagine that there’s continuous pressure on models to reduce their weight. However, as the grotesque model on the right illustrates, there are exceptions to every rule.

This entry was posted in Machine Learning. Bookmark the permalink.

1 Response to The Difference Between Neural Network L2 Regularization and Weight Decay

  1. Peter Boos's avatar Peter Boos says:

    interesting, i was wondering lichess.org uses a (alternative) Elo rating system that is able to addapt (respond) quicker to skill progress, that’s also a nearing function, but a solution that keeps more flexible over time while remaining a nearing function, i wonder if that too can be used in a neural net. (would we get more flexible neural nets then maybe?)

Comments are closed.