Why L2 Regularization and Weight Decay Are the Same When Using SGD

It’s fairly well known that when training a machine learning model (in particular, a neural network) using plain vanilla stochastic gradient descent (but no using SGD variations such as momentum or Adam), then L2 regularization and weight decay regularization are essentially equivalent mathematically.

But the details are headache-inducingly complicated. Years ago, before the creation of the PyTorch code for neural networks, when I had to implement neural networks from scratch, I spent many hours investigating L2 regularization and weight decay. I eventually came to the empirical conclusion that the two techniques produced the same effect of limiting the magnitude of the model weights, which in turn discourages model overfitting.

L2 regularization starts with the assumption that the error function (here mean squared error but the principle is the same for cross entropy error) has a weight penalty. Then, when you take the calculus gradient of the error function, the weight update equation (for the hidden-to-output layer) looks like:



Here Greek lambda is a constant that controls how much squared weight values are penalized in the error function. However, if you start with the assumption that the error function does not have a weight penalty, then you can get the same weight update by simply tacking on a weight decay term:



Here Greek alpha is the weight decay constant. If you stare at the equations for a long time, you can grasp what’s going on — but for simplicity, I have left out literally dozens of details that muddy the explanation.

There are several research papers that you can find on the Internet that show L2 regularization and weight decay are equivalent using a formal math approach. One commonly cited paper is “Decoupled Weight Decay Regularization” (2019) by I. Loshchilov and F. Hutter:



The math is rather intimidating, but after I stared at it for a long time, I grasped (mostly) what is going on.



L2 regularization and weight decay look very similar and do essentially the same thing.

Left: The rocket ship from “Project Moonbase” (1953) looks quite a bit like a bullet with fins.

Center: The rocket ship from “Cat Women of the Moon” (1953). I always wondered why this rocket ship and the one from “Project Moonbase” appear to be identical. After some Internet investigation, it turns out that to save costs, the two movies were filmed simultaneously, in collaboration, by two different producers, using the same sets, props, and costumes.

Right: Over 50 years later in 2019, the SpaceX company unveiled a Hopper rocket ship prototype that looks very much the same as the two rocket ships from 1953.


This entry was posted in Machine Learning. Bookmark the permalink.

Leave a Reply