Hyperparameter Search Using Evolutionary Optimization for a PyTorch Multi-Class Classifier

Neural networks can create powerful prediction systems. But NNs have a lot of hyperparameters that must be specified and the values of the hyperparameters have a huge effect on the effectiveness of a NN. Architecture hyperparameters include number of hidden layers, number of nodes in each hidden layer, the presence or absence of dropout layers (and the associated dropout rates), the hidden layer activation function, the weight and bias initialization algorithm, and so on. Training hyperparameters include batch size, optimization algorithm, learning rate, and so on.

The number of combinations of hyperparameter values is, quite literally, infinite. When a machine learning engineer develops a neural system, he starts by manually experimenting with hyperparameter values, using experience and intuition as a guide. But in a production environment, searching for good hyperparameter values is often done programmatically. The two most common techniques are grid search and random search.

But I’ve always been a fan of using evolutionary optimization to search for good hyperparameter values. I put together a demo. I used one of my standard synthetic datasets where the goal is to predict a person’s political leaning (conservative = 0, moderate = 1, liberal = 2) based on sex (male = -1, female = +1), age (divided by 100), State (Michigan = 100, Nebraska = 010, Oklahoma =001), and income (divided by $100,000).

It’s not feasible to parameterize everything in a NN. For architecture, I set the number of hidden layers to 2 and parameterized the number of nodes (2 to 38), and the activation to tanh() or relu(). For training, I parameterized batch size, learning rate, maximum epochs, and optimizer (SGD or ReLU).

I encoded a set of hyperparameter values as an integer array with 6 cells, where each cell has value 0 to 9. For example, in my demo, a possible set of encoded hyperparameters looks like: [5 1 9 3 9 7]. This represents 22 hidden nodes, tanh() activation, batch size = 20, learning rate = 0.01, 1000 max epochs, and Adam optimization.

In very high level pseudo-code, evolutionary optimization is:

create a population of n random possible solutions
loop g generations
  select 2 parents from population
  create a new child solution from parents
  mutate child solution
  check if child is new best solution
  replace a weak soln in population with child
end-loop
return best solution found

There are many design choices for evolutionary optimization, such as how to select 2 parents, how to combine parents to create a child, how to mutate, and so on. One obvious enhancement is to maintain a Dictionary object of potential solutions/hyperparameters that have been used, in order to prevent duplicated effort. Another easy design option to explore is the child creation mechanism by selecting a random crossover index instead of always using the middle index.

Hyperparameter search using evolutionary optimization isn’t magic, but for the type of work I do, EO is often highly effective. And super interesting!

I’ve always been fascinated by the history of the evolution of submarines. As a young man I built many scale models and submarines were one of my favorite things to draw.

Left: The USS Plunger (SS-2, “submersible ship #2”) was commissioned in 1903. It was arguably the first (barely) practical, modern submarine. It was 64 feet long, had a crew of 7, a single torpedo tube, and could submerge to approximately 75 feet.

Right: The USS Plunger (SSN-595) was commissioned in 1990. It was a nuclear powered attack submarine with fearsome capabilities. It was 278 feet long, had a crew of 100, four torpedo tubes, and could submerge to approximately 1,300 feet.

Demo code below. Replace “lt” and “lte” with Boolean operator symbols. Not thoroughly tested; do not use for production. Training and test data at https://jamesmccaffreyblog.com/2022/09/01/multi-class-classification-using-pytorch-1-12-1-on-windows-10-11/.

# people_evo_hyperparameter.py

# PyTorch 2.0.1-CPU Anaconda3-2022.10  Python 3.9.13
# Windows 10/11 

import numpy as np
import torch as T
device = T.device('cpu')  # apply to Tensor or Module

# -----------------------------------------------------------

class PeopleDataset(T.utils.data.Dataset):
  # sex  age    state    income   politics
  # -1   0.27   0  1  0   0.7610   2
  # +1   0.19   0  0  1   0.6550   0
  # sex: -1 = male, +1 = female
  # state: michigan, nebraska, oklahoma
  # politics: conservative, moderate, liberal

  def __init__(self, src_file):
    all_xy = np.loadtxt(src_file, usecols=range(0,7),
      delimiter="\t", comments="#", dtype=np.float32)
    tmp_x = all_xy[:,0:6]   # cols [0,6) = [0,5]
    tmp_y = all_xy[:,6]     # 1-D

    self.x_data = T.tensor(tmp_x, 
      dtype=T.float32).to(device)
    self.y_data = T.tensor(tmp_y,
      dtype=T.int64).to(device)  # 1-D

  def __len__(self):
    return len(self.x_data)

  def __getitem__(self, idx):
    preds = self.x_data[idx]
    trgts = self.y_data[idx] 
    return preds, trgts  # as a Tuple

# -----------------------------------------------------------

class Net(T.nn.Module):
  def __init__(self, n_hid, activ='tanh'):
    super(Net, self).__init__()
    self.hid1 = T.nn.Linear(6, n_hid)  # 6-(nh-nh)-3
    self.hid2 = T.nn.Linear(n_hid, n_hid)
    self.oupt = T.nn.Linear(n_hid, 3)

    if activ == 'tanh':
      self.activ = T.nn.Tanh()
    elif activ == 'relu':
      self.activ = T.nn.ReLU()

    # use default weight init

  def forward(self, x):
    z = self.activ(self.hid1(x))
    z = self.activ(self.hid2(z)) 
    z = T.log_softmax(self.oupt(z), dim=1)  # NLLLoss() 
    return z

# -----------------------------------------------------------

def accuracy_quick(model, dataset):
  # assumes model.eval()
  X = dataset[0:len(dataset)][0]
  Y = dataset[0:len(dataset)][1]
  with T.no_grad():
    oupt = model(X)  #  [40,3]  logits
  arg_maxs = T.argmax(oupt, dim=1)  # argmax() is new
  num_correct = T.sum(Y==arg_maxs)
  acc = (num_correct * 1.0 / len(dataset))
  return acc.item()

# -----------------------------------------------------------

def train(net, ds, bs, lr, me, opt='sgd', verbose=False):
  # dataset, bat_size, lrn_rate, max_epochs, optimizer
  v = verbose
  train_ldr = T.utils.data.DataLoader(ds, batch_size=bs,
    shuffle=True)
  loss_func = T.nn.NLLLoss()  # log_softmax() activation
  if opt == 'sgd':
    optimizer = T.optim.SGD(net.parameters(), lr=lr)
  elif opt == 'adam':
    optimizer = T.optim.Adam(net.parameters(), lr=lr)  

  if v: print("\nStarting training ")
  le = me // 4  # log interval: 4 log prints
  for epoch in range(0, me):
    epoch_loss = 0.0  # for one full epoch
    for (batch_idx, batch) in enumerate(train_ldr):
      X = batch[0]  # inputs
      Y = batch[1]  # correct class/label/politics

      optimizer.zero_grad()
      oupt = net(X)
      loss_val = loss_func(oupt, Y)  # a tensor
      epoch_loss += loss_val.item()  # accumulate
      loss_val.backward()
      optimizer.step()

    if v:
      if epoch % le == 0:
        print("epoch = %5d  |  loss = %10.4f" % \
          (epoch, epoch_loss)) 
  if v: print("Done ") 

# -----------------------------------------------------------

def evaluate(soln, trn_ds, tst_ds, verbose=False):
  # [n_hid, activ, bs, lr, me, opt]
  #   [0]    [1]   [2] [3] [4] [5]

  v = verbose

  # map soln cell values to hyperparameter values
  n_hid = (soln[0] * 4) + 2  # 2, 6, . . 38

  if soln[1] "lte" 4: activ = 'tanh'
  else: activ = 'relu'

  bs = int( (2 * soln[2]) + 2 )  # 2, 4, . . 20

  rates = [0.001, 0.005, 0.008, 0.01, 0.02, 0.03, 0.05,
    0.08, 0.10, 0.12]
  lr = rates[soln[3]]

  me = (100 * soln[4]) + 100  # 100, 200, . . 1000

  if soln[5] "lte" 4: opt = 'sgd'
  else: opt = 'adam'

  T.manual_seed(1)  # prepare
  np.random.seed(1)

  net = Net(n_hid, activ).to(device)  # create

  net.train()
  train(net, trn_ds, bs, lr, me, opt, verbose)  # train

  net.eval()
  acc_train = accuracy_quick(net, trn_ds)  # evaluate
  acc_test = accuracy_quick(net, tst_ds) 
  acc_avg = (acc_train + acc_test) / 2
  error = 1.0 - acc_avg  # [0.0, 1.0]
  if v: print("train acc = %0.4f " % acc_train)
  if v: print("test_acc = %0.4f " % acc_test)
  return error

# -----------------------------------------------------------

def show_soln_to_hyperparams(soln):
  n_hid = (soln[0] * 4) + 2
  if soln[1] "lte" 4: activ = 'tanh'
  else: activ = 'relu'
  bs = int( (2 * soln[2]) + 2 )  # 2, 4, . . 20
  rates = [0.001, 0.005, 0.008, 0.01, 0.02, 0.03, 0.05,
    0.08, 0.10, 0.12]
  lr = rates[soln[3]]
  me = (100 * soln[4]) + 100  # 100, 200, . . 1000
  if soln[5] "lte" 4: opt = 'sgd'
  else: opt = 'adam'

  print("num hidden nodes = " + str(n_hid))
  print("hidden activation = " + str(activ))
  print("batch size = " + str(bs))
  print("learn rate = %0.4f " % lr)
  print("max epochs = " + str(me))
  print("optimizer = " + str(opt))

# -----------------------------------------------------------

def main():
  # 0. get started
  print("\nBegin People politics evolutionary parameter search ")
  T.manual_seed(1)  # is reset in evaluate()
  np.random.seed(1)  
  rnd = np.random.RandomState(1)  # controls initial population
  
  # 1. create Dataset objects
  print("\nCreating People Datasets ")

  train_file = ".\\Data\\people_train.txt"
  train_ds = PeopleDataset(train_file)  # 200 rows

  test_file = ".\\Data\\people_test.txt"
  test_ds = PeopleDataset(test_file)    # 40 rows

  # 2. create population of possible solutions (hyperparams)
  pop_size = 8
  dim = 6
  print("\nCreating population of " + \
    str(pop_size) + " possible solns ")
  pop = []  # list of tuples, tuple is (np arr, float)
  for i in range(pop_size):  # soln-err / set of hyperparams
    soln = rnd.randint(low=0, high=10, size=dim, dtype=int)
    err = evaluate(soln, train_ds, test_ds, verbose=False)
    pop.append((soln, err))
  pop = sorted(pop, key=lambda tup:tup[1])  # low to hi

  # 3. find best set of hyperparams
  best_soln = pop[0][0].copy()
  best_err = pop[0][1]

  print("\nBest initial soln: ")
  print(best_soln)
  print("Best initial error = %0.4f " % best_err)
  print("Best initial avg acc = %0.4f " % (1-best_err))

# -----------------------------------------------------------

  # 4. evolve
  print("\nBegin evolution ")
  max_gen = 7
  for gen in range(max_gen):
    print("\ngeneration = " + str(gen))
    # 4a. pick two parents and make a child
    first = rnd.randint(0, pop_size // 2)  # good one
    second = rnd.randint(pop_size // 2, pop_size)  # weaker
    flip = rnd.randint(2)  # 0 or 1
    if flip == 0:
      parent_idxs = (first, second)
    else:
      parent_idxs = (second, first)

    # 4b. create child
    child_soln = np.zeros(dim, dtype=int)
    i = parent_idxs[0]; j = parent_idxs[1]
    parent1 = pop[i][0]
    parent2 = pop[j][0]
    for k in range(0, dim // 2):  # left half
      child_soln[k] = parent1[k]
    for k in range(dim // 2, dim):  # right half
      child_soln[k] = parent2[k]

    # 4c. mutate child
    mutate_prob = 0.5
    for k in range(dim):
      q = rnd.random()  # [0.0, 1.0] 
      if q "lt" mutate_prob:
        child_soln[k] = rnd.randint(0, 10, size=1,
          dtype=int)
    child_err = evaluate(child_soln, train_ds, test_ds,
      verbose=True)
    print(child_soln)
    print("%0.4f " % child_err)

# -----------------------------------------------------------

    # 4d. is child new best soln?
    if child_err "lt" best_err: 
      print("New best soln found at gen " + str(gen))
      best_soln = child_soln.copy()
      best_err = child_err
    else:
      # print("No improvement at gen " + str(gen))
      pass

    # 4e. replace weak pop soln with child
    idx = rnd.randint(pop_size // 2, pop_size)
    pop[idx] = (child_soln, child_err)  # Tuple

    # 4f. sort solns from best to worst
    pop = sorted(pop, key=lambda tup:tup[1]) 

  print("\nEnd evolution ")

# -----------------------------------------------------------

  # 5. show best hyperparameters found
  # best_soln = pop[0][0].copy()
  # best_err = pop[0][1]

  print("\nFinal best soln found: ")
  print(best_soln)
  print("\nFinal best hyperparameters found: ")
  show_soln_to_hyperparams(best_soln)

  print("\nFinal best error = %0.4f " % best_err)
  print("Final best avg acc = %0.4f " % (1-best_err))

  print("\nEnd evolutionary parameter search ")

if __name__ == "__main__":
  main()