The vast majority of my blog posts show a successful program of some sort. But behind the scenes, every successful program is preceded by many, many failures. For example, I’ve been looking at PyTorch Transformer modules. Transformer architecture is the foundation for GPT-x which has stunned the world.
Update: After a ton of work, I finally got the failed example presented in this post to work. See https://jamesmccaffreyblog.com/2023/04/10/example-of-a-pytorch-multi-class-classifier-using-a-transformer/.
My idea goes something like this: a Transformer is a super-complex neural network module that accepts a sequence (like a sentence) and creates an abstract representation of the source data. So, it seems like it’d be possible to feed a Transformer system ordinary data that doesn’t have any inherent sequence, and viola! Maybe it’ll do something wonderful.
Well, to cut to the chase, I spent several hours poking around with a demo and it failed. Specifically, I looked at one of my standard synthetic datasets where the goal is to predict a person’s political leaning (conservative, moderate, liberal) from sex, age, State, and income.
The major obstacle I faced is that normal Transformer systems accept a sequence of word tokens that have been converted to integers, such as “the” = 5. Then each integer token is converted to an embedding vector like 5 = [0.123, -2.345, . . . ] using a sophisticated lookup table where the integer token is the lookup index. And then the embedding vectors are augmented by a positional encoding that tells the Transformer modules where each word is in the sentence.
class TransformerNet(T.nn.Module):
def __init__(self):
# vocab_size = NA, embed = NA
super(TransformerNet, self).__init__() # old syntax
# no Embedding
# self.embed = T.nn.Embedding(256, 4) # no embedding
self.pos_enc = \
PositionalEncoding(1, dropout=0.00) # positional
self.enc_layer = T.nn.TransformerEncoderLayer(d_model=1,
nhead=1, dim_feedforward=100,
batch_first=True) # d_model divisible by nhead
self.trans_enc = T.nn.TransformerEncoder(self.enc_layer,
num_layers=2) # 6 layers default
# People dataset has 6 inputs
self.fc1 = T.nn.Linear(1*6, 10) # 10 hidden nodes classes
self.fc2 = T.nn.Linear(10, 3) # 3 classes
def forward(self, x):
# x = 6 inputs, fixed length
# z = self.embed(x) # pixels to embed vector
z = x.reshape(-1, 6, 1) # bat seq embed
# z = self.pos_enc(z)
z = self.trans_enc(z)
z = z.reshape(-1, 1*6) # torch.Size([bs, xxx])
z = T.tanh(self.fc1(z))
z = T.log_softmax(self.fc2(z), dim=1) # NLLLoss()
return z
Well, with my People data example, all the inputs are float32 values so there’s no embedding and no positional encoding. I ditched those two components and, after a lot of fiddling with network parameters, got nowhere — the model just didn’t learn.
OK. That’s how software engineering is. But if you can imagine something, you can usually make it work somehow.
My next steps will be to create a custom PyTorch numeric “pseudo-embedding” layer that accepts numeric input values and emits multiple values for each input (much like each integer token in NLP is mapped to a vector of floating point values). This will require a lot of effort. Maybe it will work — maybe it won’t.
But I’ll continue to probe away at using Transformer modules for standard tabular data classification models. I’ll succeed eventually.
I did. See update link above.

My Transformer architecture demo program looked good but didn’t work well. Here are three examples of car designs from U.S. automobile companies that are no longer in business. I think all three designs are beautiful, but like many automobiles of the 1950s, the cars themselves didn’t work very well.
Left: A 1957 DeSoto Fireflite. The DeSoto company produced over two million cars from 1928 to 1961. When I was a young man, one of our neighbors had a DeSoto like this and I admired it.
Center: A 1953 Studebaker Commander. Studebaker (1902-1967) was one of the largest U.S. car companies and produced many millions of automobiles. My father bought ’53 Commander for my mother but it was unreliable and was quickly replaced by a ’56 Plymouth Suburban station wagon.
Right: A 1953 Nash-Healey Roadster. Nash Motors produced cars from 1916 to 1934. The Nash-Healey was a collaboration between Nash and the Healy Motor Company of the U.K. Clark Kent drove a ’53 Nash-Healey in the 1950s TV series “The Adventures of Superman” that I watched in my young days.

.NET Test Automation Recipes
Software Testing
SciPy Programming Succinctly
Keras Succinctly
R Programming
2026 Visual Studio Live
2025 Summer MLADS Conference
2026 DevIntersection Conference
2025 Machine Learning Week
2025 Ai4 Conference
2026 G2E Conference
2026 iSC West Conference
You must be logged in to post a comment.