A New Technique for Training Huge Neural Networks

I ran across a very interesting research paper titled “Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer” by G. Yang, et al.

Suppose you want to train a natural language model/network that has 1,000,000,000,000 weights that must be learned. In order to train the huge model, you must specify training parameters such as the learning rate (how quickly weights change during training), the batch size (how many training items to process at a time), the learning rate schedule (how to gradually decrease the learning rate during training), and so on. There are a huge number of combinations of training parameters, and each combination takes a lot of time (possibly days) and money (often tens or even hundreds of thousands of dollars) per attempt. It’s simply not feasible to find the optimal training parameter values so you must use your best guesses and then train the huge model and hope for a good result.

Two pages from the research paper.

But the newly discovered technique presented in the research paper allows you to create a small version of the model with, say, 1,000,000 weights. With a neural network of this size, it’s feasible to use one of several techniques to experiment and find the optimal or near-optimal set of training parameters. For example, you might find, “Optimal training parameters are to use a learning rate of 0.014, with a batch size of 512, and an exponential decay learning rate schedule with gamma of 0.9985.” If the small 1,000,000-weight network has be constructed carefully, the optimal training parameters for the small 1,000,000 network are also optimal for the expanded network that has 1,000,000,000,000 weights. You can train the huge model just once. This will still be expensive but you have confidence you’ve used the best training parameters.

The research paper is 47 pages long and is mathematically intimidating. But the researchers have created a PyTorch implementation that is freely available at github.com/microsoft/mup. I coded up a quick demo using the documentation guideline. First pleasant surprise: the Python package installed flawlessly. Second pleasant surprise: my demo worked the first time.

My demo sets up a small-ish 784=(200-200)-10 network for the MNIST image dataset, and trains the network successfully. The next step, which I didn’t do, would be to experiment to find the optimal learning rate. Then it’d be possible to scale up the network to something like 784-(10,000-1,000)-10 and train it using the optimal learning rate.

Some code snippets from the demo:

import numpy as np
import matplotlib.pyplot as plt
import torch as T
from mup import MuReadout, make_base_shapes, MuSGD

self.fc1 = T.nn.Linear(784, self.width)
self.fc2 = T.nn.Linear(self.width, self.width)
self.readout = MuReadout(self.width, 10)

print("Creating 784-200-200-10 mup NN ")
base_net = Net(width=1).to(device)
delta_net = Net(width=2).to(device)
net = Net(width=200).to(device)
set_base_shapes(net, base_net, delta=delta_net)

loss_func = T.nn.CrossEntropyLoss()  # does log-softmax()
optimizer = MuSGD(net.parameters(), lr=0.1)

Fascinating stuff. This research result has the potential to be hugely important.

In the days before powerful CAD software, scale models were used to design things, including Disneyland rides. Left: A model of the Pirates of the Caribbean ride being constructed in 1966. Right: A model of the Big Thunder ride. When I was a college student, I worked at Disneyland and worked on both of these rides.