A binary classification problem is one where the goal is to predict a discrete value where there are only two possibilities. For example, you might want to predict the sex of a person based on their age, income, and so on. I’ve been looking at incorporating a Transformer component into a PyTorch neural network to see if the technique works or not.
Bottom line: I got a binary classification demo up and running but because the system is so complicated, it’s not clear to me if a binary classifier network with a Transformer component is better than, worse then, or roughly equivalent to, a standard deep neural network with a Transformer component.

The training of the demo system showed an unusual pattern where nothing happened for the first 2000 epochs but then the loss value dropped quickly.
I used one of my standard datasets for binary classification. The data looks like:
1 0.24 1 0 0 0.2950 0 0 1 0 0.39 0 0 1 0.5120 0 1 0 1 0.63 0 1 0 0.7580 1 0 0 0 0.36 1 0 0 0.4450 0 1 0 . . .
Each line of data represents a person. The fields are sex (male = 0, female = 1), age (normalized by dividing by 100), State (Michigan = 100, Nebraska = 010, Oklahoma = 001), annual income (divided by 100,000), and politics type (conservative = 100, moderate = 010, liberal = 001). The goal is to predict the gender of a person from their age, state, income, and politics type. There are 200 training items and 40 test items.
My demo network used an (8-32)-T-10-1 architecture. There are 8 input nodes that are mapped to 4 nodes each using a custom numeric embedding layer. Those are fed to a TransformerEncoder, then a fully connected hidden layer with 10 nodes, then a single output node. The output node value will be between 0.0 and 1.0 and a values less than 0.5 means class 0 = male, and a value greater than 0.5 means class 1 = female.
I implemented a program-defined metrics() function that computes accuracy, precision, recall, and F1 score. After training, the model scored 85.50% accuracy on the training data (171 out of 200 correct), and 80.00% accuracy on the test data (32 out of 40 correct).
It’s not possible to draw any strong conclusions because the dataset is so small and the neural architecture has many hyperparameters that I didn’t experiemnt with. In addition to all the usual neural parameters, a TransfomerEncoder has an embedding size, number of attention heads, size of the internal hidden layer, number of encoding layers, and positional encoding dropout rate.
It was a very interesting challenge.

I’m not a fan of most Japanese sci-fi monster movies, with the exception of Godzilla (1954) and Rodan (1956) — two excellent movies. In Japanese sci-fi, one form of gender classification is pretty easy: women aliens are usually evil.
Top Left: In “Destroy All Monsters” (1968), the alien Kilaaks use mind control to get Godzilla, Rodan (a giant pterodactyl), Mothra (a giant moth/larva), and a few other monsters to attack Earth. The monsters break free of the control and help Earth defeat the Kilaaks.
Top Right: In “Invasion of Astro-Monster” aka “Monster Zero” (1965), the alien Xiliens ask Earth if they can borrow Godzilla and Rodan to defeat Ghidorah (giant three-headed dragon) who is ravaging their home planet. But the Xiliens then try to use all three monsters to conquer Earth. Earth’s technology prevails.
Bottom Left: In “Gamera vs. Guiron” (1969), the alien Terrans have an appetite for human brains. Yuck. The Earth’s Gamera (giant turtle monster) defeats the Terran’s Guiron (a monster that has a knife-shaped head).
Bottom Right: In “Godzilla vs. Megalon” (1973), an alien race, the Seatopians, have been living undiscovered under the sea. I won’t even try to summarize the incomprehensible plot, but one highlight of the movie was a Seatopian ritual dance complete with clear plastic outfits, pointy hats, and white go-go boots.
Demo code. Replace “lt” (less-than), “gt”, “lte”, “gte” with Boolean operator symbols. The training and test data is at https://jamesmccaffreyblog.com/2022/09/23/binary-classification-using-pytorch-1-12-1-on-windows-10-11/.
# people_gender_transformer.py
# binary classification using a TransformerEncoder
# PyTorch 2.0.0-CPU Anaconda3-2022.10 Python 3.9.13
# Windows 10/11
import numpy as np
import torch as T
device = T.device('cpu') # apply to Tensor or Module
T.set_num_threads(1)
class PeopleDataset(T.utils.data.Dataset):
# sex age state income politics
# 0 0.27 0 1 0 0.7610 0 0 1
# 1 0.19 0 0 1 0.6550 1 0 0
# sex: 0 = male, 1 = female
# state: michigan, nebraska, oklahoma
# politics: conservative, moderate, liberal
def __init__(self, src_file):
all_data = np.loadtxt(src_file, usecols=range(0,9),
delimiter="\t", comments="#", dtype=np.float32)
self.x_data = T.tensor(all_data[:,1:9],
dtype=T.float32).to(device)
self.y_data = T.tensor(all_data[:,0],
dtype=T.float32).to(device) # float32 required
self.y_data = self.y_data.reshape(-1,1) # 2-D required
def __len__(self):
return len(self.x_data)
def __getitem__(self, idx):
feats = self.x_data[idx,:] # idx row, all 8 cols
sex = self.y_data[idx,:] # idx row, the only col
return (feats, sex) # as a Tuple
# -----------------------------------------------------------
class SkipLinear(T.nn.Module): # numeric embedding layer
# -----
class Core(T.nn.Module):
def __init__(self, n):
super().__init__()
# 1 node to n nodes, n gte 2
self.weights = T.nn.Parameter(T.zeros((n,1),
dtype=T.float32))
self.biases = T.nn.Parameter(T.tensor(n,
dtype=T.float32))
lim = 0.01
T.nn.init.uniform_(self.weights, -lim, lim)
T.nn.init.zeros_(self.biases)
def forward(self, x):
wx= T.mm(x, self.weights.t())
v = T.add(wx, self.biases)
return v
# -----
def __init__(self, n_in, n_out):
super().__init__()
self.n_in = n_in; self.n_out = n_out
if n_out % n_in != 0:
print("FATAL: n_out must be divisible by n_in")
n = n_out // n_in # num nodes per input
self.lst_modules = \
T.nn.ModuleList([SkipLinear.Core(n) for \
i in range(n_in)])
def forward(self, x):
lst_nodes = []
for i in range(self.n_in):
xi = x[:,i].reshape(-1,1)
oupt = self.lst_modules[i](xi)
lst_nodes.append(oupt)
result = T.cat((lst_nodes[0], lst_nodes[1]), 1)
for i in range(2,self.n_in):
result = T.cat((result, lst_nodes[i]), 1)
result = result.reshape(-1, self.n_out)
return result
# -----------------------------------------------------------
class TransformerNet(T.nn.Module):
def __init__(self):
super(TransformerNet, self).__init__() # old syntax
# numeric pseudo-embedding, dim=4
self.embed = SkipLinear(8, 32) # 8 inputs, each goes to 4
self.pos_enc = \
PositionalEncoding(4, dropout=0.00) # positional
self.enc_layer = T.nn.TransformerEncoderLayer(d_model=4,
nhead=2, dim_feedforward=10,
batch_first=True) # d_model divisible by nhead
self.trans_enc = T.nn.TransformerEncoder(self.enc_layer,
num_layers=2) # 6 layers default
self.fc1 = T.nn.Linear(32, 10) # 10 hidden nodes
self.fc2 = T.nn.Linear(10, 1)
def forward(self, x):
# x = 8 inputs, fixed length
z = self.embed(x) # 8 inpts to 32 embed
z = z.reshape(-1, 8, 4) # bat seq embed
z = self.pos_enc(z)
z = self.trans_enc(z)
z = z.reshape(-1, 32) # torch.Size([bs, xxx])
z = T.tanh(self.fc1(z))
z = T.sigmoid(self.fc2(z)) # for BCELoss()
return z
# -----------------------------------------------------------
class PositionalEncoding(T.nn.Module): # documentation code
def __init__(self, d_model: int, dropout: float=0.0,
max_len: int=5000):
super(PositionalEncoding, self).__init__() # old syntax
self.dropout = T.nn.Dropout(p=dropout)
pe = T.zeros(max_len, d_model) # like 10x4
position = \
T.arange(0, max_len, dtype=T.float).unsqueeze(1)
div_term = T.exp(T.arange(0, d_model, 2).float() * \
(-np.log(10_000.0) / d_model))
pe[:, 0::2] = T.sin(position * div_term)
pe[:, 1::2] = T.cos(position * div_term)
pe = pe.unsqueeze(0).transpose(0, 1)
self.register_buffer('pe', pe) # allows state-save
def forward(self, x):
x = x + self.pe[:x.size(0), :]
return self.dropout(x)
# -----------------------------------------------------------
def metrics(model, ds, thresh=0.5):
# note: N = total number of items = TP + FP + TN + FN
# accuracy = (TP + TN) / N
# precision = TP / (TP + FP)
# recall = TP / (TP + FN)
# F1 = 2 / [(1 / precision) + (1 / recall)]
tp = 0; tn = 0; fp = 0; fn = 0
for i in range(len(ds)):
inpts = ds[i][0].reshape(1,-1) # make it a batch
target = ds[i][1].reshape(1) # float32 [0.0] or [1.0]
target = target.long() # int 0 or 1
with T.no_grad():
p = model(inpts) # between 0.0 and 1.0
# should really avoid 'target == 1.0'
if target == 1 and p "gte" thresh: # TP
tp += 1
elif target == 1 and p "lt" thresh: # FN
fn += 1
elif target == 0 and p "lt" thresh: # TN
tn += 1
elif target == 0 and p "gte" thresh: # FP
fp += 1
N = tp + fp + tn + fn
if N != len(ds):
print("FATAL LOGIC ERROR in metrics()")
accuracy = (tp + tn) / (N * 1.0)
precision = (1.0 * tp) / (tp + fp) # tp + fp != 0
recall = (1.0 * tp) / (tp + fn) # tp + fn != 0
f1 = 2.0 / ((1.0 / precision) + (1.0 / recall))
return (accuracy, precision, recall, f1) # as a Tuple
# -----------------------------------------------------------
def main():
# 0. get started
print("\nPeople gender using PyTorch TransformerEncoder")
T.manual_seed(1)
np.random.seed(1)
# 1. create Dataset and DataLoader objects
print("\nCreating People train and test Datasets ")
train_file = ".\\Data\\people_train.txt"
test_file = ".\\Data\\people_test.txt"
train_ds = PeopleDataset(train_file) # 200 rows
test_ds = PeopleDataset(test_file) # 40 rows
bat_size = 10
train_ldr = T.utils.data.DataLoader(train_ds,
batch_size=bat_size, shuffle=True)
# 2. create neural network
print("\nCreating (8--32)-T-10-1 classifier ")
net = TransformerNet().to(device)
net.train() # set training mode
# 3. train network
lrn_rate = 0.05
loss_func = T.nn.BCELoss() # binary cross entropy
# loss_func = T.nn.MSELoss()
optimizer = T.optim.SGD(net.parameters(), lr=lrn_rate)
# optimizer = T.optim.Adam(net.parameters(), lr=lrn_rate)
max_epochs = 2500
ep_log_interval = 200
print("\nLoss function: " + str(loss_func))
print("Optimizer: " + str(optimizer.__class__.__name__))
print("Learn rate: " + "%0.3f" % lrn_rate)
print("Batch size: " + str(bat_size))
print("Max epochs: " + str(max_epochs))
print("\nStarting training")
for epoch in range(0, max_epochs):
epoch_loss = 0.0 # for one full epoch
for (batch_idx, batch) in enumerate(train_ldr):
X = batch[0] # [bs,8] inputs
Y = batch[1] # [bs,1] targets
oupt = net(X) # [bs,1] computeds
loss_val = loss_func(oupt, Y) # a tensor
epoch_loss += loss_val.item() # accumulate
optimizer.zero_grad() # reset all gradients
loss_val.backward() # compute new gradients
optimizer.step() # update all weights
if epoch % ep_log_interval == 0:
print("epoch = %4d loss = %8.4f" % \
(epoch, epoch_loss))
print("Done ")
# -----------------------------------------------------------
# 4. evaluate model
net.eval()
metrics_train = metrics(net, train_ds, thresh=0.5)
print("\nMetrics for train data: ")
print("accuracy = %0.4f " % metrics_train[0])
print("precision = %0.4f " % metrics_train[1])
print("recall = %0.4f " % metrics_train[2])
print("F1 = %0.4f " % metrics_train[3])
metrics_test = metrics(net, test_ds, thresh=0.5)
print("\nMetrics for test data: ")
print("accuracy = %0.4f " % metrics_test[0])
print("precision = %0.4f " % metrics_test[1])
print("recall = %0.4f " % metrics_test[2])
print("F1 = %0.4f " % metrics_test[3])
# 5. save model
print("\nSaving trained model state_dict ")
net.eval()
# path = ".\\Models\\people_gender_model.pt"
# T.save(net.state_dict(), path)
# 6. make a prediction
print("\nSetting age = 30 Oklahoma $40,000 moderate ")
X = np.array([[0.30, 0,0,1, 0.4000, 0,1,0]],
dtype=np.float32)
X = T.tensor(X, dtype=T.float32).to(device)
net.eval()
with T.no_grad():
oupt = net(X) # a Tensor
pred_prob = oupt.item() # scalar, [0.0, 1.0]
print("Computed output: ", end="")
print("%0.4f" % pred_prob)
if pred_prob "lt" 0.5:
print("Prediction = male")
else:
print("Prediction = female")
print("\nEnd binary demo ")
if __name__== "__main__":
main()
.NET Test Automation Recipes
Software Testing
SciPy Programming Succinctly
Keras Succinctly
R Programming
2026 Visual Studio Live
2025 Summer MLADS Conference
2026 DevIntersection Conference
2025 Machine Learning Week
2025 Ai4 Conference
2026 G2E Conference
2026 iSC West Conference
You must be logged in to post a comment.