Neural Network Glorot Initialization

I gave a talk on fundamentals of neural networks recently. My approach is to use a combination of pictures plus code. When I hit the part about initializing weights and biases, I mentioned that initialization is surprisingly important, but I didn’t have time to go into details.

The most common approach for weight initialization is to use uniform random values in some range, for example [-0.01, +0.01] but several deep learning libraries now use Glorot initialization as the default. I coded up a program to demonstrate.

Actually, the term “Glorot initialization” is ambiguous because there are two variations. In both cases the idea is to compute a standard deviation based on the architecture of the network. In one variation of Glorot, the standard deviation is used with a uniform random distribution to generate initial weight values. In the second variation, the standard deviation is used with a Normal (Gaussian) distribution. But somewhat confusingly, both variations are sometimes called “normalized initialization”.

My demo program run uses the Glorot uniform variation where standard deviation = sqrt(6.0 / (fan-in + fan-out)) for each layer. It’s usual practice to leave initial biases values at 0.0.

The paper considered one of the original sources for Glorot initialization is at: http://jmlr.org/proceedings/papers/v9/glorot10a/glorot10a.pdf

The moral of the story is that when getting up to speed with deep learning, it’s important to know lots of little details but also be able to work at higher levels of abstraction.

“Victoria Harbor”, Kwan Yeuk Pang. Interesting combination of detail and abstraction.