PyTorch Multi-Class Classification With One-Hot Label Encoding and Softmax Output Activation

I’ve been doing a deep dive into nuances and quirks of the PyTorch neural network code library. A few years ago, before the availability of stable libraries like PyTorch, TensorFlow and Keras, if you wanted to create a neural network, you’d have to write code from scratch using a language like C/C++, C#, Java, or Python.

When implementing a neural network from scratch, engineers and scientists would use fundamental math principles. For a multi-class classifier, this meant encoding the class label (dependent variable) using one-hot encoding, applying softmax activation on the output nodes, and using mean squared error during back-propagation training.

For example, suppose you want to predict a person’s political leaning (conservative, moderate, liberal) from sex, age, region, and income. Your training data would look like:

# sex, age, region, income, politic
-1, 0.35, 1, 0, 0, 0.5500,  0, 0, 1
 1, 0.29, 0, 0, 1, 0.6200,  0, 1, 0
. . .

Your neural network would have three output nodes, and you would apply softmax activation on them in the feed-forward function so that the output values would sum to 1. So, during training, for one data item, the computed outputs would be like (0.25, 0.70, 0.05) and the corresponding targets from the training data would be like (0, 1, 0). You would use mean squared error between computeds and targets to train.

Now, fast forward to the present. With PyTorch, to do multi-class classification, you encode the class labels using ordinal encoding (0, 1, 2, . .) and you don’t explicitly apply any output activation, and you use the highly specialized (and completely misnamed) CrossEntropyLoss() function.

When I was first learning how to use PyTorch, this new scheme baffled me. As it turns out, Pytorch does some very complex manipulations for multi-class classification behind the scenes. It took me many hours of investigation to fully understand what PyTorch was doing. But eventually, by experimenting with code and reading through piles of documentation, I was satisfied that I knew what PyTorch was doing.

Note that in order to use PyTorch, you don’t have to understand the behind-the-scenes details. All you need to know is, “use ordinal encoding on the class labels, don’t apply any activation on the output nodes in the forward() method, and use CrossEntropyLoss() as the loss function.”

Anyway, one rainy Sunday afternoon in the Pacific Northwest, I sat down and decided to code up a demo of multi-class classification using Python with the old approach: one-hot encoding of the class labels, softmax activation on the output nodes, and mean squared error for training.

My experiment was successful in the sense that I got a prediction model up and running. While I was creating my old scheme based experiment, I gained several insights into why the new scheme is used. The new scheme is easier because ordinal encoding is easier than one-hot encoding, and the new scheme is more efficient for several technical reasons (you’ll have to trust me on this — it would take several pages to explain).

So, I learned more details about PyTorch and increased my knowledge.

But in a way I was disappointed that the new scheme for multi-class classification was clearly better than the old one-hot, softmax, MSE scheme. The old scheme has great mathematical beauty to me, and the new scheme hides that underlying beauty. The situation reminds me of automobiles. Current cars are vastly better than cars of the 1960s in every way, but current cars don’t have any real beauty compared to older cars. Sometimes progress comes at the expense of beauty.

Left: The Ford Seattle-ite XXI (1962) was designed by Alex Tremulis and displayed on Ford stand at the Seattle World’s Fair. The car had interchangeable fuel cell power units, computer navigation, six wheels with four of them driving and steering wheels. Center: The Plymouth XNR (1960) was designed by Virgil Exner. It was originally named the Plymouth Asymmetrica. Right: The AMC (American Motors Corporation, the successor to Rambler) AMX concept car (1965). A production version looked very much like this concept.

# people_politic.py
# predict politic from sex, age, region, income
# experiment with one-hot, softmax, mse
# PyTorch 1.6.0-CPU Anaconda3-2020.02  Python 3.7.6
# Windows 10 

import numpy as np
import torch as T
device = T.device("cpu")  # apply to Tensor or Module

# -----------------------------------------------------------

class PeopleDataset(T.utils.data.Dataset):
  # sex age   region     income  politic
  # -1  0.27  0  1  0    0.7610  0 0 1
  # +1  0.19  0  0  1    0.6550  1 0 0
  # sex: -1 = male, +1 = female
  # region: eastern, western, central
  # politic: conservative, moderate, liberal

  def __init__(self, src_file, num_rows=None):
    all_xy = np.loadtxt(src_file, max_rows=None,
      usecols=range(0,9), delimiter="\t", 
      skiprows=0, dtype=np.float32)

    tmp_x = np.loadtxt(src_file, max_rows=num_rows,
      usecols=range(0,6), delimiter="\t", skiprows=0,
      dtype=np.float32)
    tmp_y = np.loadtxt(src_file, max_rows=num_rows,
      usecols=[6,7,8], delimiter="\t", skiprows=0,
      dtype=np.float32)

    self.x_data = T.tensor(all_xy[:,0:6], dtype=T.float32)
    self.y_data = T.tensor(all_xy[:,6:9], dtype=T.float32) 

  def __len__(self):
    return len(self.x_data)

  def __getitem__(self, idx):
    preds = self.x_data[idx,:]
    trgts = self.y_data[idx,:] 
    sample = { 'predictors' : preds, 'targets' : trgts }
    return sample

# -----------------------------------------------------------

class Net(T.nn.Module):
  def __init__(self):
    super(Net, self).__init__()
    self.hid1 = T.nn.Linear(6, 10)  # 6-(10-10)-3
    self.hid2 = T.nn.Linear(10, 10)
    self.oupt = T.nn.Linear(10, 3)

    T.nn.init.xavier_uniform_(self.hid1.weight)
    T.nn.init.zeros_(self.hid1.bias)
    T.nn.init.xavier_uniform_(self.hid2.weight)
    T.nn.init.zeros_(self.hid2.bias)
    T.nn.init.xavier_uniform_(self.oupt.weight)
    T.nn.init.zeros_(self.oupt.bias)

  def forward(self, x):
    z = T.tanh(self.hid1(x))
    z = T.tanh(self.hid2(z))
    z = T.softmax(self.oupt(z), dim=1)  # NOTE!
    return z

# -----------------------------------------------------------

def accuracy(model, ds):
  # assumes model.eval()
  n_correct = 0; n_wrong = 0
  # using loader avoids resize() issues
  ldr = T.utils.data.DataLoader(ds, batch_size=1,
    shuffle=False)
  for (_, batch) in enumerate(ldr):
    X = batch['predictors']
    Y = batch['targets']
    with T.no_grad():
      oupt = model(X)  # probs form    
    if T.argmax(Y) == T.argmax(oupt):
      n_correct += 1
    else:
      n_wrong += 1
  acc = (n_correct * 1.0) / (n_correct + n_wrong)
  return acc


# -----------------------------------------------------------

def main():
  # 0. get started
  print("\nBegin predict politic (one-hot) \n")
  T.manual_seed(1)
  np.random.seed(1)
  
  # 1. create Dataset objects
  print("Creating People Datasets ")

  train_file = ".\\Data\\OneHot\\people_train_one_hot.txt"
  train_ds = PeopleDataset(train_file)  # all 200 rows

  test_file = ".\\Data\\OneHot\\people_test_one_hot.txt"
  test_ds = PeopleDataset(test_file)  # all 40 rows

  bat_size = 10
  train_ldr = T.utils.data.DataLoader(train_ds,
    batch_size=bat_size, shuffle=True)

  # 2. create network
  net = Net().to(device)

  # 3. train model
  max_epochs = 4000
  ep_log_interval = 400
  lrn_rate = 0.01

  loss_func = T.nn.MSELoss() 
  optimizer = T.optim.SGD(net.parameters(), lr=lrn_rate)

  print("\nbat_size = %3d " % bat_size)
  print("loss = " + str(loss_func))
  print("optimizer = SGD")
  print("max_epochs = %3d " % max_epochs)
  print("lrn_rate = %0.3f " % lrn_rate)

  print("\nStarting training")
  net.train()
  for epoch in range(0, max_epochs):
    epoch_loss = 0  # for one full epoch
    # num_lines_read = 0

    for (batch_idx, batch) in enumerate(train_ldr):
      X = batch['predictors'] 
      Y = batch['targets'] 
      optimizer.zero_grad()
      oupt = net(X)

      loss_val = loss_func(oupt, Y)  # a tensor
      epoch_loss += loss_val.item()  # accumulate
      loss_val.backward()
      optimizer.step()

    if epoch % ep_log_interval == 0:
      print("epoch = %4d   loss = %0.4f" % (epoch, epoch_loss))
  print("Done ")

  # 4. evaluate model accuracy
  print("\nComputing model accuracy")
  net.eval()
  acc_train = accuracy(net, train_ds)  # item-by-item
  print("Accuracy on training data = %0.4f" % acc_train)
  acc_test = accuracy(net, test_ds)  # item-by-item
  print("Accuracy on test data = %0.4f" % acc_test)

  # 5. make a prediction
  print("\nPolitic for M  30  central  $50,000: ")
  inpt = np.array([[-1, 0.30,  0,0,1,  0.5000]], 
    dtype=np.float32)
  inpt = T.tensor(inpt, dtype=T.float32).to(device) 

  with T.no_grad():
    probs = net(inpt).to(device)  # values sum to 1.0
  probs = probs.numpy()  # numpy vector prints better
  np.set_printoptions(precision=4, suppress=True)
  print(probs)

  # 6. save model (state_dict approach)
  print("\nSaving trained model state")
  fn = ".\\Models\\people_model.pth"
  T.save(net.state_dict(), fn)

  print("\nEnd People predict politic demo")
if __name__ == "__main__":
  main()