Some Thoughts About Neural Architecture Search (NAS)

I work at a large tech company. I watched a talk recently about a neural architecture search (NAS) system that some of my colleagues have developed.

When a neural network is designed, there are many hyperparameter values that must be specified. For a standard neural classifier the primary hyperparameters are:

Architecture:
number hidden layers
number nodes in each hidden layer
activation function each layer
weight and bias initialization technique

Training:
batch size
max epochs
optimization algorithm (SGD, Adam, etc.)
optimation learning rate
regularization and weight decay rate
momentum

There are other hyperparameters too, such as loss function (mean squared error vs. cross entropy), dropout layer(s) location and rate.

The number of permutations of architecture and training hyperparameter values is, literally, infinite. Picking good hyperparameter values is more art than science and is based on experience and intuition.

An example of one of the dozens of NAS systems that have been rolled out. This one, called Model Search, was released and then looks like it was promptly abandoned. I won’t be investing my time and effort into learning any complex NAS system until I have confidence that the system will have long-term support.

Early training hyperparameter optimization approaches used some form of genetic algorithm (GA). The problem is that GA optimization for training hyperparameters is very time consuming.

Architecture hyperparameter optimization is arguably even more complicated than training hyperparameter optimization because it’s a combinatorial optimization problem. And architecture for convolutional neural networks is even more complicated: convolution layers, pooling layers, batch normalization layers, dropout layers, and more.

Another NAS system (“Petridish”) that was released and then immediately abandoned. Sigh.

There’s been a lot of recent research effort on architecture optimization, often called neural architecture search (NAS). I’m mildly skeptical that completely programmatic approaches to NAS will lead anywhere. On the other hand, I suspect that programmatic techniques can be useful for fine-tuning designs that have been created using human expertise.

Any architecture optimization system/platform will be complex which means it will inevitably have an extremely steep learning curve. Furthermore, such systems will require a large operational support effort — an area where machine learning has consistently failed miserably — the ML landscape is littered with helper systems like NAS that were created, maintained for a few months, and then abandoned.

Machine learning researchers aren’t the only ones who have short attention spans.

This entry was posted in Machine Learning. Bookmark the permalink.

1 Response to Some Thoughts About Neural Architecture Search (NAS)

Thorsten Kleppe says:

June 14, 2022 at 4:07 am

Sounds very understandable what you are saying. In a way, that’s also quite good, because otherwise there would only be machines building machines at some point.

Nevertheless, you could also simply take your experience and thin out the hyperparameters. Limit the number of layers to 2-3 in the NN. With 2 layers you can quickly achieve quite good results, but in my experience it is more worthwhile to use 3 layers. Activation function simply limit to ReLU, so there are top results, see benchmarks.

I have to think about the Roulette Wheel Selection algorithm. In the first rounds you test less data but more network constellations. In the next rounds, the winners are tested more and more intensively with the data and more bad networks are thrown out.

During training, many more hyperparameters can be taken out. Batch size can be replaced by Auto-Batch by backpropagating only wrong examples. SGD has proven itself with many top results, and there are good reasons not to use Adam. So just stay with SGD. Regularization and weight decay as well as momentum I would also leave out for now and use input dropout. (https://github.com/grensen/easy_regression#fully-trained)

This makes it much easier to find nice results. It seems to me that it would be best to always give the search a direction to keep the possibilities small.

It would be great to see what your NAS concept would look like if you think it’s worth working on.

Loading...