After training a PyTorch binary classifier, it’s important to evaluate the accuracy of the trained model. Simple classification accuracy is OK but in many scenarios you want a so-called confusion matrix that gives details of the number of correct and wrong predictions for each of the two target classes. You also want precision, recall, and F1 metrics.
For example, suppose you’re predicting the sex (0 = male, 1 = female) of a person based on their age (divided by 100), State (Michigan = 100, Nebraska = 010, Oklahoma = 001), income (divided by $100,000), and political leaning (conservative = 100, moderate = 010, liberal = 001). An example of a formatted confusion matrix and metrics computed from the matrix might look like:
Computing confusion matrix actual 0: 21 5 actual 1: 1 13 ------------ predicted 0 1 Computing metrics from confusion acc = 0.8500 pre = 0.7222 rec = 0.9286 f1 = 0.8125
Here’s my function to compute a raw confusion matrix for a binary classifier:
def confusion_matrix_bin(model, ds, n_classes):
if n_classes != 2:
print("ERROR n_classes must be 2 ")
return None
cm = np.zeros((n_classes,n_classes), dtype=np.int64)
for i in range(len(ds)):
X = ds[i][0].reshape(1,-1) # make it a batch
Y = ds[i][1].reshape(1) # actual class 0.0 or 1.0, 1D
Y = Y.type(T.int64) # make it an int/index
with T.no_grad():
oupt = model(X) # logits form
if oupt "lt" 0.5: pred_class = 0
else: pred_class = 1
cm[Y][pred_class] += 1
return cm
The function accepts a trained PyTorch classifier and a PyTorch Dataset object that is composed of either a Tuple or a Dictionary where the predictors are at [0] and the target labels are at [1]. The n_classes could be determined programmatically but it’s easier to pass that value in as a parameter.
Note: A function to compute a confusion matrix for a multi-class classifier, where there are three or more possible outcomes, uses slightly different code. See https://jamesmccaffreyblog.com/2023/03/15/computing-and-displaying-a-confusion-matrix-for-a-pytorch-neural-network-classifier/
The raw confusion matrix is difficult to interpret so I wrote a function to format the matrix by adding some labels:
def show_confusion(cm):
dim = len(cm)
mx = np.max(cm) # largest count in cm
wid = len(str(mx)) + 1 # width to print
fmt = "%" + str(wid) + "d" # like "%3d"
for i in range(dim):
print("actual ", end="")
print("%3d:" % i, end="")
for j in range(dim):
print(fmt % cm[i][j], end="")
print("")
print("------------")
print("predicted ", end="")
for j in range(dim):
print(fmt % j, end="")
print("")
If you have a binary confusion matrix, you can compute precision, recall, and F1 score from it. The true negatives are at [0][0]. The true positives are at [1][1]. The false positives (“incorrectly predicted as positive/1”) are at [0][1]. The false negatives (“incorrectly predicted as negative/0”) are at [1][0].
Note that in a binary classification scenario, which outcome you specify is positive and which outcome is negative is arbitrary, so the meanings/values of precision and recall are arbitrary too. This is one reason why the F1 score, the harmonic mean of precision and recall, is often used.
Here is a function that computes accuracy, precision, recall and F1 from a raw binary confusion matrix. It assumes a particular geometry of the matrix.
def metrics_from_confusion_bin(cm):
# return (accuracy, precision, recall, F1)
N = 0 # total count
dim = len(cm)
for i in range(dim):
for j in range(dim):
N += cm[i][j]
n_correct = 0
for i in range(dim):
n_correct += cm[i][i] # on the diagonal
acc = n_correct / N
# pred 0 pred 1
# act 0 tn fp
# act 1 fn tp
tp = cm[1][1]
tn = cm[0][0]
fp = cm[0][1] # falsely predicted as positive
fn = cm[1][0] # falsely predicted as negative
pre = tp / (tp + fp)
rec = tp / (tp + fn)
f1 = 1.0 / ( ((1.0 / pre) + (1.0 / rec)) / 2.0 )
return (acc, pre, rec, f1)
Good fun. Demo code below. Replace “lt”, “gt”, “lte”, “gte” with Boolean operator symbols (my lame blog editor chokes on symbols). The training and test data can be found at https://jamesmccaffreyblog.com/2022/09/23/binary-classification-using-pytorch-1-12-1-on-windows-10-11/.

There is all kinds of social science research about the many biological and behavioral differences between men and women. For example, women will often tilt their heads down and look up. According to the research, this pose is appealing and shows vulnerability. Men do not ever use such a posture, at least none of the guys I know. Facts like this are interesting but they’re not terribly useful for prediction.
# people_gender.py
# binary classification
# confusion matrix and metrics demo
# PyTorch 1.12.1-CPU Anaconda3-2020.02 Python 3.7.6
# Windows 10/11
import numpy as np
import torch as T
device = T.device('cpu') # apply to Tensor or Module
class PeopleDataset(T.utils.data.Dataset):
# sex age state income politics
# 0 0.27 0 1 0 0.7610 0 0 1
# 1 0.19 0 0 1 0.6550 1 0 0
# sex: 0 = male, 1 = female
# state: michigan, nebraska, oklahoma
# politics: conservative, moderate, liberal
def __init__(self, src_file):
all_data = np.loadtxt(src_file, usecols=range(0,9),
delimiter=",", comments="#", dtype=np.float32)
self.x_data = T.tensor(all_data[:,1:9],
dtype=T.float32).to(device)
self.y_data = T.tensor(all_data[:,0],
dtype=T.float32).to(device) # float32 required
self.y_data = self.y_data.reshape(-1,1) # 2-D required
def __len__(self):
return len(self.x_data)
def __getitem__(self, idx):
feats = self.x_data[idx,:] # idx row, all 8 cols
sex = self.y_data[idx,:] # idx row, the only col
return feats, sex # as a Tuple
# ---------------------------------------------------------
def metrics(model, ds, thresh=0.5):
# compute accuracy, precision, recall, F1 directly
# note: N = total number of items = TP + FP + TN + FN
# accuracy = (TP + TN) / N
# precision = TP / (TP + FP)
# recall = TP / (TP + FN)
# F1 = 2 / [(1 / precision) + (1 / recall)]
tp = 0; tn = 0; fp = 0; fn = 0
for i in range(len(ds)):
inpts = ds[i][0] # dictionary style
target = ds[i][1] # float32 [0.0] or [1.0]
target = target.type(T.int64) # make it an int
with T.no_grad():
p = model(inpts) # between 0.0 and 1.0
# should really avoid 'target == 1.0'
if target == 1 and p "gte" thresh: # TP
tp += 1
elif target == 1 and p "lt" thresh: # FN
fn += 1
elif target == 0 and p "lt" thresh: # TN
tn += 1
elif target == 0 and p "gte" thresh: # FP
fp += 1
N = tp + fp + tn + fn
if N != len(ds):
print("FATAL LOGIC ERROR in metrics()")
accuracy = (tp + tn) / (N * 1.0)
precision = (1.0 * tp) / (tp + fp)
recall = (1.0 * tp) / (tp + fn)
f1 = 2.0 / ((1.0 / precision) + (1.0 / recall))
return (accuracy, precision, recall, f1) # as a Tuple
# -----------------------------------------------------------
def confusion_matrix_bin(model, ds, n_classes):
if n_classes != 2:
print("ERROR n_classes must be 2 ")
return None
cm = np.zeros((n_classes,n_classes), dtype=np.int64)
for i in range(len(ds)):
X = ds[i][0].reshape(1,-1) # make it a batch
Y = ds[i][1].reshape(1) # actual class 0.0 or 1.0, 1D
Y = Y.type(T.int64)
with T.no_grad():
oupt = model(X) # logits form
if oupt "lt" 0.5: pred_class = 0
else: pred_class = 1
cm[Y][pred_class] += 1
return cm
# -----------------------------------------------------------
def show_confusion(cm):
dim = len(cm)
mx = np.max(cm) # largest count in cm
wid = len(str(mx)) + 1 # width to print
fmt = "%" + str(wid) + "d" # like "%3d"
for i in range(dim):
print("actual ", end="")
print("%3d:" % i, end="")
for j in range(dim):
print(fmt % cm[i][j], end="")
print("")
print("------------")
print("predicted ", end="")
for j in range(dim):
print(fmt % j, end="")
print("")
# -----------------------------------------------------------
def metrics_from_confusion_bin(cm):
# return (accuracy, precision, recall, F1)
N = 0 # total count
dim = len(cm)
for i in range(dim):
for j in range(dim):
N += cm[i][j]
n_correct = 0
for i in range(dim):
n_correct += cm[i][i] # on the diagonal
acc = n_correct / N
# pred 0 pred 1
# act 0 tn fp
# act 1 fn tp
tp = cm[1][1]
tn = cm[0][0]
fp = cm[0][1] # falsely predicted as positive
fn = cm[1][0] # falsely predicted as negative
pre = tp / (tp + fp)
rec = tp / (tp + fn)
f1 = 1.0 / ( ((1.0 / pre) + (1.0 / rec)) / 2.0 )
return (acc, pre, rec, f1)
# -----------------------------------------------------------
class Net(T.nn.Module):
def __init__(self):
super(Net, self).__init__()
self.hid1 = T.nn.Linear(8, 10) # 8-(10-10)-1
self.hid2 = T.nn.Linear(10, 10)
self.oupt = T.nn.Linear(10, 1)
T.nn.init.xavier_uniform_(self.hid1.weight)
T.nn.init.zeros_(self.hid1.bias)
T.nn.init.xavier_uniform_(self.hid2.weight)
T.nn.init.zeros_(self.hid2.bias)
T.nn.init.xavier_uniform_(self.oupt.weight)
T.nn.init.zeros_(self.oupt.bias)
def forward(self, x):
z = T.tanh(self.hid1(x))
z = T.tanh(self.hid2(z))
z = T.sigmoid(self.oupt(z)) # for BCELoss()
return z
# ----------------------------------------------------------
def main():
# 0. get started
print("\nPeople gender using PyTorch ")
T.manual_seed(1)
np.random.seed(1)
# 1. create Dataset and DataLoader objects
print("\nCreating People train and test Datasets ")
train_file = ".\\Data\\people_train.txt"
test_file = ".\\Data\\people_test.txt"
train_ds = PeopleDataset(train_file) # 200 rows
test_ds = PeopleDataset(test_file) # 40 rows
bat_size = 10
train_ldr = T.utils.data.DataLoader(train_ds,
batch_size=bat_size, shuffle=True)
# 2. create neural network
print("\nCreating 8-(10-10)-1 binary NN classifier \n")
net = Net().to(device)
# 3. train network
net.train() # set training mode
lrn_rate = 0.01
loss_func = T.nn.BCELoss() # binary cross entropy
optimizer = T.optim.SGD(net.parameters(),
lr=lrn_rate)
max_epochs = 500
ep_log_interval = 100
print("Loss function: " + str(loss_func))
print("Optimizer: " + str(optimizer.__class__.__name__))
print("Learn rate: " + "%0.3f" % lrn_rate)
print("Batch size: " + str(bat_size))
print("Max epochs: " + str(max_epochs))
print("\nStarting training")
for epoch in range(0, max_epochs):
epoch_loss = 0.0 # for one full epoch
for (batch_idx, batch) in enumerate(train_ldr):
X = batch[0] # [bs,4] inputs
Y = batch[1] # [bs,1] targets
oupt = net(X) # [bs,1] computeds
loss_val = loss_func(oupt, Y) # a tensor
epoch_loss += loss_val.item() # accumulate
optimizer.zero_grad() # reset all gradients
loss_val.backward() # compute new gradients
optimizer.step() # update all weights
if epoch % ep_log_interval == 0:
print("epoch = %4d loss = %8.4f" % \
(epoch, epoch_loss))
print("Done ")
# ----------------------------------------------------------
# 4. evaluate model
net.eval()
metrics_train = metrics(net, train_ds, thresh=0.5)
print("\nMetrics for train data: ")
print("accuracy = %0.4f " % metrics_train[0])
print("precision = %0.4f " % metrics_train[1])
print("recall = %0.4f " % metrics_train[2])
print("F1 = %0.4f " % metrics_train[3])
metrics_test = metrics(net, test_ds, thresh=0.5)
print("\nMetrics for test data: ")
print("accuracy = %0.4f " % metrics_test[0])
print("precision = %0.4f " % metrics_test[1])
print("recall = %0.4f " % metrics_test[2])
print("F1 = %0.4f " % metrics_test[3])
print("\nComputing confusion matrix ")
cm = confusion_matrix_bin(net, test_ds, n_classes=2)
# print(cm) # raw matrix
show_confusion(cm)
print("\nComputing metrics from confusion ")
(acc, pre, rec, f1) = metrics_from_confusion_bin(cm)
print("acc = %0.4f pre = %0.4f rec = %0.4f f1 = %0.4f " % \
(acc, pre, rec, f1))
# 5. save model
print("\nSaving trained model state_dict ")
# path = ".\\Models\\people_model.pt"
# T.save(net.state_dict(), path)
# 6. make a prediction
print("\nSetting age = 30 Oklahoma $40,000 moderate")
inpt = np.array([[0.30, 0,0,1, 0.40, 0,1,0]],
dtype=np.float32)
inpt = T.tensor(inpt, dtype=T.float32).to(device)
net.eval()
with T.no_grad():
oupt = net(inpt) # a Tensor
pred_prob = oupt.item() # scalar, [0.0, 1.0]
print("Computed output: ", end="")
print("%0.4f" % pred_prob)
if pred_prob "lt" 0.5:
print("Prediction = male")
else:
print("Prediction = female")
print("\nEnd People binary demo ")
if __name__== "__main__":
main()

.NET Test Automation Recipes
Software Testing
SciPy Programming Succinctly
Keras Succinctly
R Programming
2026 Visual Studio Live
2025 Summer MLADS Conference
2026 DevIntersection Conference
2025 Machine Learning Week
2025 Ai4 Conference
2026 G2E Conference
2026 iSC West Conference
You must be logged in to post a comment.