An Example of PyTorch Hyperparameter Random Search

Bottom line: Hyperparameter random search can be effective but the difficult part is determining what to parameterize and the range of possible parameter values.

When creating a neural network prediction model there are many architecture hyperparameters (number hidden layers, number of nodes in each hidden layer, hidden activation functions, initialization algorithms and their parameters, etc., etc.) And then there are dozens of training hyperparameters (optimization algorithm, learning rate, momentum, batch size, number training epochs, etc.)

In this demo, the best parameters were in trial 2 with 16 hidden nodes, tanh hidden activation, Adam optimization, learn rate = 0.01809, batch size = 14, and max_epochs = 799

Most of my colleagues and I use a manual approach for finding good hyperparameters. We use our experience and intuition. It’s possible to programmatically search for good hyperparameters. Somewhat surprisingly, a random search of hyperparameter values is highly effective compared to more sophisticated techniques, grid search in particular. See “Random Search for Hyper-Parameter Optimization” (2012) by J. Bergstra and Y. Bengio.

I put together a demo of hyperparameter random search. My demo problem is to predict a person’s political leaning (conservative, moderate, liberal) from sex, age, state, and income. In pseudo-code the idea is:

  # loop n times
  #   create random arch and train hyperparams
  #   use arch params to create net
  #   use train params to train net
  #   evaluate trained net
  #   log params and eval metric to file
  # end-loop
  # analyze log offline

I used just two architecture parameters: number of hidden nodes and hidden activation function. The architecture had fixed two hidden layers.

I used just four training parameters: optimization algorithm, learning rate, batch size, and max epochs. Here’s my demo function that generates random hyperparameters:

def create_params(seed=1):
  # n_hid, activation; opt, lr, bs, max_ep
  rnd = np.random.RandomState(seed)

  n_hid = rnd.randint(6, 21)  # [6, 20]
  activation = ['tanh', 'relu'][rnd.randint(0,2)]

  opt = ['sgd', 'adam'][rnd.randint(0,2)]
  lr = rnd.uniform(low=0.001, high=0.10)
  bs = rnd.randint(6, 16)
  max_ep = rnd.randint(200, 1000)

  return (n_hid, activation, opt, lr, bs, max_ep)

The number of hidden nodes varies from 6 to 20, the learning rate varies from 0.001 to 0.01, and so on. Where do these ranges come from? Just guesses based on experience.

There are dozens of details, such as how to evaluate a trained network.

So, hyperparameter search isn’t a magic wand — you have to use experience to determine which of the hundreds of possible parameters to search, and which of the literally infinite ranges for parameter values to use.

One of the disadvantages of random search is that you can get ugly results, such as a learning rate of 0.10243568790223344556677123. One way to deal with this issue is to round floating point values to three decimals and integers to a power of 10 before trying them.

Like many of the older guys I work with, I gained a love of reading from the juvenile “Hardy Boys” mystery series. Several of the books featured a search for something, such as a treasure of some kind. Left: “The Tower Treasure” (#1, 1959 edition). Center: “Hunting for Hidden Gold (#5, 1963 edition). Right: “The Secret of Pirates’ Hill” (#36, 1956 edition). All three covers by artist Rudy Nappi (1923-2015).

Demo code.

# people_hyperparam_search.py
# predict politics type from sex, age, state, income
# PyTorch 1.12.1-CPU Anaconda3-2020.02  Python 3.7.6
# Windows 10/11 

import numpy as np
import torch as T
device = T.device('cpu')  # apply to Tensor or Module

# -----------------------------------------------------------

class PeopleDataset(T.utils.data.Dataset):
  # sex  age    state    income   politics
  # -1   0.27   0  1  0   0.7610   2
  # +1   0.19   0  0  1   0.6550   0
  # sex: -1 = male, +1 = female
  # state: michigan, nebraska, oklahoma
  # politics: conservative, moderate, liberal

  def __init__(self, src_file):
    all_xy = np.loadtxt(src_file, usecols=range(0,7),
      delimiter="\t", comments="#", dtype=np.float32)
    tmp_x = all_xy[:,0:6]   # cols [0,6) = [0,5]
    tmp_y = all_xy[:,6]     # 1-D

    self.x_data = T.tensor(tmp_x, 
      dtype=T.float32).to(device)
    self.y_data = T.tensor(tmp_y,
      dtype=T.int64).to(device)  # 1-D

  def __len__(self):
    return len(self.x_data)

  def __getitem__(self, idx):
    preds = self.x_data[idx]
    trgts = self.y_data[idx] 
    return preds, trgts  # as a Tuple

# -----------------------------------------------------------

class Net(T.nn.Module):
  def __init__(self, n_hid, activation):
    super(Net, self).__init__()
    self.hid1 = T.nn.Linear(6, n_hid)  # 6-(nh-nh)-3
    self.hid2 = T.nn.Linear(n_hid, n_hid)
    self.oupt = T.nn.Linear(n_hid, 3)

    if activation == 'tanh':
      self.activ = T.nn.Tanh()
    elif activation == 'relu':
      self.activ = T.nn.ReLU()
    
    T.nn.init.xavier_uniform_(self.hid1.weight)
    T.nn.init.zeros_(self.hid1.bias)
    T.nn.init.xavier_uniform_(self.hid2.weight)
    T.nn.init.zeros_(self.hid2.bias)
    T.nn.init.xavier_uniform_(self.oupt.weight)
    T.nn.init.zeros_(self.oupt.bias)

  def forward(self, x):
    z = self.activ(self.hid1(x))
    z = self.activ(self.hid2(z))
    z = T.log_softmax(self.oupt(z), dim=1)  # NLLLoss() 
    return z

# -----------------------------------------------------------
 
def overall_loss_3(model, ds, n_class):
  # MSE using built in MSELoss() version
  X = ds[0:len(ds)][0]  # all X values
  Y = ds[0:len(ds)][1]  # all targets, ordinal form
  YY = T.nn.functional.one_hot(Y, num_classes=n_class) 
  
  with T.no_grad():
    oupt = T.exp(model(X))  #  [all,3]  probs form

  loss_func = T.nn.MSELoss(reduction='sum')
  loss_val = loss_func(oupt, YY)  # a tensor
  mse = loss_val / len(ds)
  return mse  # as tensor

# -----------------------------------------------------------

def accuracy(model, dataset):
  # assumes model.eval()
  X = dataset[0:len(dataset)][0]
  # Y = T.flatten(dataset[0:len(dataset)][1])
  Y = dataset[0:len(dataset)][1]
  with T.no_grad():
    oupt = model(X)  #  [40,3]  logits

  # (_, arg_maxs) = T.max(oupt, dim=1)
  arg_maxs = T.argmax(oupt, dim=1)  # argmax() is new
  num_correct = T.sum(Y==arg_maxs)
  acc = (num_correct * 1.0 / len(dataset))
  return acc.item()

# -----------------------------------------------------------

def train(net, ds, opt, lr, bs, me, le):
  train_ldr = T.utils.data.DataLoader(ds, batch_size=bs,
    shuffle=True)  

  loss_func = T.nn.NLLLoss()  # assumes log_softmax()
  if opt == 'sgd':
    optimizer = T.optim.SGD(net.parameters(), lr=lr)
  elif opt == 'adam':
    optimizer = T.optim.Adam(net.parameters(), lr=lr)
  # else error

  for epoch in range(0, me):
    epoch_loss = 0.0  # for one full epoch
    for (batch_idx, batch) in enumerate(train_ldr):
      X = batch[0]  # inputs
      Y = batch[1]  # correct class/label/politics
      optimizer.zero_grad()
      oupt = net(X)
      loss_val = loss_func(oupt, Y)  # a tensor
      epoch_loss += loss_val.item()  # accumulate
      loss_val.backward()
      optimizer.step()

    if epoch % le == 0:
      print("epoch = %5d  |  loss = %10.4f" % \
        (epoch, epoch_loss))
  print("Training done ") 
  return net 

# -----------------------------------------------------------

def create_params(seed=1):
  # n_hid, activation; opt, lr, bs, max_ep
  rnd = np.random.RandomState(seed)

  n_hid = rnd.randint(6, 21)  # [6, 20]
  activation = ['tanh', 'relu'][rnd.randint(0,2)]

  opt = ['sgd', 'adam'][rnd.randint(0,2)]
  lr = rnd.uniform(low=0.001, high=0.10)
  bs = rnd.randint(6, 16)
  max_ep = rnd.randint(200, 1000)

  return (n_hid, activation, opt, lr, bs, max_ep)
  
# -----------------------------------------------------------

def search_params(ds):
  # using Datset ds
  # loop n times
  #  create random arch and train hyperparams
  #  use arch params to create net
  #  use train params to train net
  #  evaluate trained net
  #  log params and eval metric to file
  # end-loop
  # analyze log offline

  max_trials = 6
  for i in range(max_trials):
    print("\nSearch trial " + str(i))
    (n_hid, activation, opt, lr, bs, max_ep) = \
      create_params(seed=i*2)
    print((n_hid, activation, opt, lr, bs, max_ep))

    net = Net(n_hid, activation).to(device)
    net.train()

    net = train(net, ds, opt, lr, bs, max_ep, le=200)

    net.eval()
    error = overall_loss_3(net, ds, n_class=3).item()
    acc = accuracy(net, ds) 
    print("acc = %0.4f  error = %0.4f " % (acc, error))
    # log params, error, accuracy here

  return 0

# -----------------------------------------------------------

def main():
  # 0. get started
  print("\nBegin People hyperparameter random search ")
  T.manual_seed(1)
  np.random.seed(1)
  
  # 1. create Dataset objects
  print("\nCreating People Datasets ")

  train_file = ".\\Data\\people_train.txt"
  train_ds = PeopleDataset(train_file)  # 200 rows

  test_file = ".\\Data\\people_test.txt"
  test_ds = PeopleDataset(test_file)    # 40 rows

  # 2. search for good hyperparameters
  search_params(train_ds)

# -----------------------------------------------------------

  print("\nEnd People hyperparameter random search ")

if __name__ == "__main__":
  main()