PyTorch Transformer System Naive Approach Classifier - Fail

The vast majority of my blog posts show a successful program of some sort. But behind the scenes, every successful program is preceded by many, many failures. For example, I’ve been looking at PyTorch Transformer modules. Transformer architecture is the foundation for GPT-x which has stunned the world.

Update: After a ton of work, I finally got the failed example presented in this post to work. See https://jamesmccaffreyblog.com/2023/04/10/example-of-a-pytorch-multi-class-classifier-using-a-transformer/.

My idea goes something like this: a Transformer is a super-complex neural network module that accepts a sequence (like a sentence) and creates an abstract representation of the source data. So, it seems like it’d be possible to feed a Transformer system ordinary data that doesn’t have any inherent sequence, and viola! Maybe it’ll do something wonderful.

Failure. For now.

Well, to cut to the chase, I spent several hours poking around with a demo and it failed. Specifically, I looked at one of my standard synthetic datasets where the goal is to predict a person’s political leaning (conservative, moderate, liberal) from sex, age, State, and income.

The major obstacle I faced is that normal Transformer systems accept a sequence of word tokens that have been converted to integers, such as “the” = 5. Then each integer token is converted to an embedding vector like 5 = [0.123, -2.345, . . . ] using a sophisticated lookup table where the integer token is the lookup index. And then the embedding vectors are augmented by a positional encoding that tells the Transformer modules where each word is in the sentence.

class TransformerNet(T.nn.Module):
  def __init__(self):
    # vocab_size = NA, embed = NA
    super(TransformerNet, self).__init__()  # old syntax

    # no Embedding
    # self.embed = T.nn.Embedding(256, 4)  # no embedding

    self.pos_enc = \
      PositionalEncoding(1, dropout=0.00)  # positional

    self.enc_layer = T.nn.TransformerEncoderLayer(d_model=1,
      nhead=1, dim_feedforward=100, 
      batch_first=True)  # d_model divisible by nhead

    self.trans_enc = T.nn.TransformerEncoder(self.enc_layer,
      num_layers=2)  # 6 layers default

    # People dataset has 6 inputs
    self.fc1 = T.nn.Linear(1*6, 10)  # 10 hidden nodes classes
    self.fc2 = T.nn.Linear(10, 3)    # 3 classes

  def forward(self, x):
    # x = 6 inputs, fixed length
    # z = self.embed(x)  # pixels to embed vector
    z = x.reshape(-1, 6, 1)  # bat seq embed 
    # z = self.pos_enc(z) 
    z = self.trans_enc(z) 
    z = z.reshape(-1, 1*6)  # torch.Size([bs, xxx])
    z = T.tanh(self.fc1(z))
    z = T.log_softmax(self.fc2(z), dim=1)  # NLLLoss()
    return z

Well, with my People data example, all the inputs are float32 values so there’s no embedding and no positional encoding. I ditched those two components and, after a lot of fiddling with network parameters, got nowhere — the model just didn’t learn.

OK. That’s how software engineering is. But if you can imagine something, you can usually make it work somehow.

My next steps will be to create a custom PyTorch numeric “pseudo-embedding” layer that accepts numeric input values and emits multiple values for each input (much like each integer token in NLP is mapped to a vector of floating point values). This will require a lot of effort. Maybe it will work — maybe it won’t.

But I’ll continue to probe away at using Transformer modules for standard tabular data classification models. I’ll succeed eventually.

I did. See update link above.

My Transformer architecture demo program looked good but didn’t work well. Here are three examples of car designs from U.S. automobile companies that are no longer in business. I think all three designs are beautiful, but like many automobiles of the 1950s, the cars themselves didn’t work very well.

Left: A 1957 DeSoto Fireflite. The DeSoto company produced over two million cars from 1928 to 1961. When I was a young man, one of our neighbors had a DeSoto like this and I admired it.

Center: A 1953 Studebaker Commander. Studebaker (1902-1967) was one of the largest U.S. car companies and produced many millions of automobiles. My father bought ’53 Commander for my mother but it was unreliable and was quickly replaced by a ’56 Plymouth Suburban station wagon.

Right: A 1953 Nash-Healey Roadster. Nash Motors produced cars from 1916 to 1934. The Nash-Healey was a collaboration between Nash and the Healy Motor Company of the U.K. Clark Kent drove a ’53 Nash-Healey in the 1950s TV series “The Adventures of Superman” that I watched in my young days.