A Custom Embedding Layer for Numeric Input for PyTorch

Transformer architecture (TA) neural networks were designed for natural language processing (NLP). I’ve been exploring the idea of applying TA to tabular data. The problem is that in NLP all inputs are integers that represent words/tokens. For example, an input of “I think therefore I am” is mapped to integer tokens something like [19, 47, 132, 19, 27]. Then the integer tokens are converted to an embedding vector. For example 19 = [0.1234, -1.0987, 0.3579, 1.1333] where the number of values (4 here) is a hyperparameter called the embedding dim. The embedding values are determined during training.

Demo of a custom embedding layer for numeric input data

Now suppose that instead of dealing with NLP input, you are dealing with numeric input such as a person’s normalized age of 0.31 and normalized annual income of 0.7850. Because the inputs are not integers, you can’t use the PyTorch built-in torch.nn.Embedding layer to create embedding vectors. I wondered if it would be possible to create a custom embedding layer that converts numeric input into embedding vectors.

After some experimentation I managed to create an example of a custom PyTorch embedding layer for numeric input data.

When I design complex neural architectures I often use pen and paper. Here’s the paper I used while designing the code presented in this blog post. The paw in the lower right is a canine visitor named “Llama” who was helping me.

I used the Iris dataset. It has four numeric input values: sepal length, sepal width, petal length, petal width. The goal is to classify an iris flower as one of three species: setosa (0), versicolor (1), or virginica (2). Each input is converted to an embedding vector with 2 values.

Note: Conceptually, a word embedding creates vectors where similar words (“boy” and “man”) are mathematically close together. For numeric input, an embedding doesn’t do that. The idea is to create a layer that isn’t fully connected and therefore the input values don’t have a direct relationship with each other. The ideas are pretty deep.

My experiment was hard-coded specifically for the Iris dataset and is just a proof of concept. The idea is to create a separate weight matrix for each of the four input values. Each of the four inputs generates a temp result matrix, and then the four temp matrices are combined into the final result.

The key network definition code looks like

class Net(T.nn.Module):
  def __init__(self):
    super(Net, self).__init__()  # Python 3.2 and earlier
    self.embed = NumericEmbedLayer(4, 2)  # 4-8
    self.hid1 = T.nn.Linear(8, 10)        # 8-10
    self.oupt = T.nn.Linear(10, 3)        # 10-3
    
  def forward(self, x):       # x is [bs, 4]
    z = self.embed(x)         # z is [bs, 8]
    z = T.tanh(self.hid1(z))  # z is [bs, 10]
    z = T.log_softmax(self.oupt(z), dim=1)  # NLLLoss() 
    return z                  # z is [bs, 3]

The 4 inputs are fed to the custom NumericEmbedLayer which produces 8 values. Those 8 values go to a hidden layer which outputs 10 values. The final output layer maps the 10 values to 3 values.

The experiement was a lot more difficult than I thought it’d be. Creating a general purpose embedding layer for arbitrary numeric input would require a significant effort. Maybe I’ll get around to it some day.

Three nice images from a search for “embedded portrait”. Left: By artist Hans Jochem Bakker. Center: By artist Christopher Kennedy. Right: By artist Daniel Arrhakis.

Demo code. Replace “lt”, “gt”, “lte”, “gte” with Boolean operator symbols.

# iris_embedding.py
# PyTorch 1.10.0-CPU Anaconda3-2020.02  Python 3.7.6
# Windows 10/11 

# experiment with embedding for numeric data

import numpy as np
import torch as T
device = T.device('cpu')  # apply to Tensor or Module

# -----------------------------------------------------------

class NumericEmbedLayer(T.nn.Module):
  def __init__(self, n_in, embed_dim):  # n_in = 4 for Iris
    super().__init__()  # shortcut syntax

    # hard-coded for Iris dataset - not a general soln

    # one weight matrix per feature
    self.weights_0 = T.nn.Parameter(T.zeros((embed_dim, 1),
      dtype=T.float32))
    self.weights_1 = T.nn.Parameter(T.zeros((embed_dim, 1),
      dtype=T.float32))
    self.weights_2 = T.nn.Parameter(T.zeros((embed_dim, 1),
      dtype=T.float32))
    self.weights_3 = T.nn.Parameter(T.zeros((embed_dim, 1),
      dtype=T.float32))
    # no biases

    T.nn.init.uniform_(self.weights_0, -0.10, 0.10)
    T.nn.init.uniform_(self.weights_1, -0.10, 0.10)
    T.nn.init.uniform_(self.weights_2, -0.10, 0.10)
    T.nn.init.uniform_(self.weights_3, -0.10, 0.10)

  def forward(self, x):
    col_0 = x[:,0:1]  # fetch each input column
    col_1 = x[:,1:2]
    col_2 = x[:,2:3]
    col_3 = x[:,3:4]

    # create the embeddings
    tmp_0 = T.mm(col_0, self.weights_0.t())  # [bs, 2]
    tmp_1 = T.mm(col_1, self.weights_1.t())
    tmp_2 = T.mm(col_2, self.weights_2.t())
    tmp_3 = T.mm(col_3, self.weights_3.t())

    # combine
    res = T.hstack((tmp_0, tmp_1, tmp_2, tmp_3))  # [bs, 8]
    return res

# -----------------------------------------------------------

class IrisDataset(T.utils.data.Dataset):
  def __init__(self, src_file, num_rows=None):
    # 5.0, 3.5, 1.3, 0.3, 0
    tmp_x = np.loadtxt(src_file, max_rows=num_rows,
      usecols=range(0,4), delimiter=",", comments="#",
      dtype=np.float32)
    tmp_y = np.loadtxt(src_file, max_rows=num_rows,
      usecols=4, delimiter=",", comments="#",
      dtype=np.int64)

    self.x_data = T.tensor(tmp_x, dtype=T.float32).to(device)
    self.y_data = T.tensor(tmp_y, dtype=T.int64).to(device)

  def __len__(self):
    return len(self.x_data)

  def __getitem__(self, idx):
    preds = self.x_data[idx]
    spcs = self.y_data[idx] 
    sample = { 'predictors' : preds, 'species' : spcs }
    return sample  # as Dictionary

# -----------------------------------------------------------

class Net(T.nn.Module):
  def __init__(self):
    super(Net, self).__init__()  # Python 3.2 and earlier
    # super().__init__()  # shortcut syntax 3.3 and later
    self.embed = NumericEmbedLayer(4, 2)  # 4-8
    self.hid1 = T.nn.Linear(8, 10)        # 8-10
    self.oupt = T.nn.Linear(10, 3)        # 10-3
    
    # override default initialization
    lo = -0.10; hi = +0.10
    T.nn.init.uniform_(self.hid1.weight, lo, hi)
    T.nn.init.zeros_(self.hid1.bias)
    T.nn.init.uniform_(self.oupt.weight, lo, hi)
    T.nn.init.zeros_(self.oupt.bias)
    
  def forward(self, x):       # x is [bs, 4]
    z = self.embed(x)         # z is [bs, 8]
    z = T.tanh(self.hid1(z))  # z is [bs, 10]
    z = T.log_softmax(self.oupt(z), dim=1)  # NLLLoss() 
    return z                  # z is [bs, 3]

# -----------------------------------------------------------

def accuracy(model, dataset):
  # assumes model.eval()
  dataldr = T.utils.data.DataLoader(dataset, batch_size=1,
    shuffle=False)
  n_correct = 0; n_wrong = 0
  for (_, batch) in enumerate(dataldr):
    X = batch['predictors'] 
    Y = batch['species']  # already 1D shaped by Dataset
    with T.no_grad():
      oupt = model(X)  # logits form

    big_idx = T.argmax(oupt)
    # if big_idx.item() == Y.item():
    if big_idx == Y:
      n_correct += 1
    else:
      n_wrong += 1

  acc = (n_correct * 1.0) / (n_correct + n_wrong)
  return acc

# -----------------------------------------------------------

def main():
  # 0. get started
  print("\nBegin Iris numeric embedding experiment ")
  T.manual_seed(1)
  np.random.seed(1)
  
  # 1. create DataLoader objects
  print("\nCreating Iris train and test Datasets ")

  train_file = ".\\Data\\iris_train.txt"  
  test_file = ".\\Data\\iris_test.txt"  

  train_ds = IrisDataset(train_file)  # 120 items
  test_ds = IrisDataset(test_file)    # 30 

  bat_size = 6
  train_ldr = T.utils.data.DataLoader(train_ds,
    batch_size=bat_size, shuffle=True)
  
# -----------------------------------------------------------

  # 2. create network
  print("\nCreating 4-(8)-10-3 neural network ")
  net = Net().to(device)

  # 3. train model
  max_epochs = 500
  ep_log_interval = 50
  lrn_rate = 0.01

  loss_func = T.nn.NLLLoss()  # assumes log_softmax()
  optimizer = T.optim.SGD(net.parameters(), lr=lrn_rate)

  print("\nbat_size = %3d " % bat_size)
  print("loss = " + str(loss_func))
  print("optimizer = SGD")
  print("max_epochs = %3d " % max_epochs)
  print("lrn_rate = %0.3f " % lrn_rate)

  print("\nStarting training")
  net.train()
  for epoch in range(0, max_epochs):
    epoch_loss = 0  # for one full epoch
    for (batch_idx, batch) in enumerate(train_ldr):
      X = batch['predictors']  # [10,4]
      Y = batch['species']  # OK; alreay flattened
      optimizer.zero_grad()
      oupt = net(X)
      loss_val = loss_func(oupt, Y)  # a tensor
      epoch_loss += loss_val.item()  # accumulate
      loss_val.backward()  # compute gradients
      optimizer.step()     # update weights and biases

    if epoch % ep_log_interval == 0:
      print("epoch = %4d  |  loss = %8.4f  | " % \
        (epoch, epoch_loss), end="")
      net.eval()
      train_acc = accuracy(net, train_ds)
      print(" acc = %8.4f " % train_acc)
      net.train()
  print("Done ")

# -----------------------------------------------------------

  # 4. evaluate model accuracy
  print("\nComputing model accuracy")
  net.eval()
  acc = accuracy(net, test_ds)  # item-by-item
  print("Accuracy on test data = %0.4f" % acc)
  
  # 5. make a prediction
  print("\nPredicting species for [6.1, 3.1, 5.1, 1.1]: ")
  x = np.array([[6.1, 3.1, 5.1, 1.1]], dtype=np.float32)
  x = T.tensor(x, dtype=T.float32).to(device) 

  with T.no_grad():
    logits = net(x)      # as log_softmax
  probs = T.exp(logits)    # pseudo-probs
  T.set_printoptions(precision=4)
  print(probs)

# -----------------------------------------------------------

  # 6. save model (state_dict approach)
  print("\nSaving trained model state")
  fn = ".\\Models\\iris_model.pt"
  T.save(net.state_dict(), fn)

  # saved_model = Net()
  # saved_model.load_state_dict(T.load(fn))
  # use saved_model to make prediction(s)

  print("\nEnd numeric embedding experiment ")

if __name__ == "__main__":
  main()