Suppose you have a machine learning dataset for training, where only a few data items have a positive label (class = 1), but all the other data items are unlabeled and could be either negative (class = 0) or positive. This is called a positive and unlabeled learning (PUL) problem. PUL problems often appear in medical scenarios (only a few patients are diagnosed as class 1, all others are unknown) and in security scenarios.
To make sense of PUL data and use it to train a prediction model, you must somehow use the information contained in the PUL data to make intelligent guesses about the labels for the unlabeled items. This is called “finding reliable negatives”.
This is a very difficult problem. I’ve experimented with dozens of schemes for identifying reliable negatives in PUL data. The bottom line is that all techniques have many hyperparameters and results can vary wildly.
For my experiments, I set up a synthetic dataset with 200 items of Employee information. The data looks like:
-2 0.39 0 0 1 0.5120 0 1 0 1 0.24 1 0 0 0.2950 0 0 1 -2 0.36 1 0 0 0.4450 0 1 0 -2 0.50 0 1 0 0.5650 0 1 0 -2 0.19 0 0 1 0.3270 1 0 0 . . .
The first column is introvert or extrovert, encoded as 1 = positive = extrovert (20 items), and -2 = unlabeled (180 items). The goal of PUL is to intelligently guess 0 = negative, or 1 = positive, for as many of the unlabeled data items as possible.
The other columns in the dataset are employee age (normalized by dividing by 100), city (one of three, one-hot encoded), annual income (normalized by dividing by $100,000), and job-type (one of three, one-hot encoded).
The dataset was artificially constructed so that even numbered items [0], [2], [4], etc. are actually class 0 = negative, and odd numbered items [1], [3], [5], etc. are actually class 1. This allows the PUL system to measure its accuracy. In a non-demo PUL scenario, you won’t know the true class labels.
My latest exploration used this approach:
create a dataset with all 20 known positive items
and 20 items with random inputs marked as negative
use dataset to train a binary classifier (where
the output is a p-value between 0 and 1)
scan dataset to find min p-score for the 20
positive items and the max p-score
loop each item of the PUL data
feed item to binary classifier and
compute the p-score
if label = 1 then
it's a known positive, continue
else if p-score less-than min_p_score * 0.9
mark this item as a reliable negative class 0
else if p-score grtr-than max_p_score * 0.9
mark this item as a relaible positive class 1
else
not enough evidence so leave as unlabeled
end-if
end-loop
Once you have examined the PUL data and identified reliable negatives (and new reliable positives), you can either 1.) repeat the process with the updated dataset, or 2.) toss out the unlabeled items and then use the dataset to train a prediction model.
The ideas are conceptually very simple, but implementation is tricky. My results were quite satisfactory — but depend on over a dozen hyperparameters (batch_size, optimization algorithm, learning rate, NN architecture, weight initialization algorithm, etc., etc.)
Interesting topic.

Here are three cars made in 1970 that routinely show up in Internet searches for “ugliest cars of the 70s” and so they’d be labeled class 1 = positive (ugly). But I would assign a class label of class 0 = not ugly to all three. Left: AMC Javelin AMX (a competitor to the Ford Mustang of the time). Center: Datsun (Nissan) 510 in front of Univ. of Calif. at Irvine which was under construction at the time. I had this model of car and went to UCI when it was still under construction. Right: AMC Pacer. Weird but appealing (to me) car with a passenger side door that was 4 inches longer than the driver side door!
Code (PyTorch) below. Long.
# employee_pul_find_reliables.py
# PyTorch 1.9.0-CPU Anaconda3-2020.02 Python 3.7.6
# Windows 10
# load all 20 known positives = 1, create 20 random input
# labelled as negative = 0
import numpy as np
import torch as T
device = T.device("cpu") # apply to Tensor or Module
# ----------------------------------------------------------
class ExploreDataset(T.utils.data.Dataset):
# label age city income job-type
# 1 0.39 1 0 0 0.5432 1 0 0
# -2 0.29 0 0 1 0.4985 0 1 0 (unlabeled)
# . . .
# [0] [1] [2 3 4] [5] [6 7 8]
def __init__(self, fn):
self.rnd = np.random.RandomState(1)
tmp_x = np.zeros((40,8), dtype=np.float32)
tmp_y = np.zeros(40, dtype=np.float32)
# 1. load just the 20 known positives into memory
i = 0
f = open(fn, "r")
for line in f:
line = line.strip()
if line.startswith("#"): continue
arr = np.fromstring(line, sep="\t", dtype=np.float32)
if int(arr[0]) == 1: # known positive
tmp_y[i] = arr[0]
tmp_x[i][0] = arr[1]
tmp_x[i][1] = arr[2]
tmp_x[i][2] = arr[3]
tmp_x[i][3] = arr[4]
tmp_x[i][4] = arr[5]
tmp_x[i][5] = arr[6]
tmp_x[i][6] = arr[7]
tmp_x[i][7] = arr[8]
i += 1
f.close()
tmp_y = tmp_y.reshape(-1,1) # 2D
# 2. create 20 synthetic items labelled as negative = 0
for i in range(20, 40):
# tmp_y[i] = 0 # is already 0
tmp_x[i][0] = self.rnd.random() # age
city = self.rnd.randint(0,3)
if city == 0: tmp_x[i][1] = 1
elif city == 1: tmp_x[i][2] = 1
elif city == 2: tmp_x[i][3] = 1
tmp_x[i][4] = self.rnd.random() # income
job = self.rnd.randint(0,3)
if job == 0: tmp_x[i][5] = 1
elif job == 1: tmp_x[i][6] = 1
elif job == 2: tmp_x[i][7] = 1
self.x_data = T.tensor(tmp_x, dtype=T.float32).to(device)
self.y_data = T.tensor(tmp_y, dtype=T.float32).to(device)
def __len__(self):
return len(self.x_data)
def __getitem__(self, idx):
preds = self.x_data[idx,:] # idx rows, all 8 cols
lbl = self.y_data[idx,:] # idx rows, the only col
sample = { 'predictors' : preds, 'lbl' : lbl }
return sample
# ----------------------------------------------------------
class Net(T.nn.Module):
def __init__(self):
super(Net, self).__init__()
self.hid1 = T.nn.Linear(8, 10) # 8-(10-10)-1
self.hid2 = T.nn.Linear(10, 10)
self.oupt = T.nn.Linear(10, 1)
T.nn.init.xavier_uniform_(self.hid1.weight)
T.nn.init.zeros_(self.hid1.bias)
T.nn.init.xavier_uniform_(self.hid2.weight)
T.nn.init.zeros_(self.hid2.bias)
T.nn.init.xavier_uniform_(self.oupt.weight)
T.nn.init.zeros_(self.oupt.bias)
def forward(self, x):
z = T.tanh(self.hid1(x))
z = T.tanh(self.hid2(z))
z = T.sigmoid(self.oupt(z)) # see BCELoss() below
return z
# ----------------------------------------------------------
def train(net, ds, bs, me, le, lr, verbose):
# NN, dataset, batch_size, max_epochs,
# log_every, learn_rate. optimizer and loss hard-coded.
data_ldr = T.utils.data.DataLoader(ds, batch_size=bs,
shuffle=True)
loss_func = T.nn.BCELoss() # assumes sigmoid activation
opt = T.optim.SGD(net.parameters(), lr=lr)
for epoch in range(0, me):
epoch_loss = 0.0
for (batch_idx, batch) in enumerate(data_ldr):
X = batch['predictors'] # inputs
Y = batch['lbl'] # 0 or 1 targets
opt.zero_grad() # prepare gradients
oupt = net(X) # compute output/target
loss_val = loss_func(oupt, Y) # a tensor
epoch_loss += loss_val.item() # accumulate for display
loss_val.backward() # compute gradients
opt.step() # update weights
if epoch % le == 0 and verbose == True:
print("epoch = %4d loss = %0.4f" % (epoch, epoch_loss))
# ----------------------------------------------------------
def main():
# 0. get started
print("\nBegin PUL two-step: find reliables ")
T.manual_seed(1)
np.random.seed(1)
# 1. create Dataset and DataLoader objects
print("\nCreating Employee exploration Dataset ")
pul_file = ".\\Data\\employee_pul_200.txt"
train_ds = ExploreDataset(pul_file)
# 2. create neural network
print("\nCreating 8-(10-10)-1 binary NN classifier ")
net = Net().to(device)
net.train() # set mode
# 3. train
print("\nSetting training parameters: ")
bat_size = 4
lrn_rate = 0.01
max_epochs = 2000
log_every = 500
print("batch size = " + str(bat_size))
print("lrn_rate = %0.2f " % lrn_rate)
print("max_epochs = " + str(max_epochs))
print("loss function = BCELoss() ")
print("optimizer = SGD ")
print("\nStarting training")
train(net, train_ds, bat_size, max_epochs,
log_every, lrn_rate, verbose=True)
print("Training complete ")
# 4. score the 20 known positives
print("\nScoring the 20 known positives ")
min_score = 1.0; max_score = 0.0
net.eval()
for i in range(20):
x = train_ds[i]['predictors']
with T.no_grad():
p = net(x)
if p.item() "lt" min_score: min_score = p.item()
elif p.item() "gt" max_score: max_score = p.item()
print("Min score for known positives: %0.4f" % min_score)
print("Max score for known positives: %0.4f" % max_score)
# 5. scan and score the unlabelled itemss.
# if p-score is less than min_score, mark item as negative
# if p-score is grtr than max_score, mark item as positive
# because there's no training, no need Dataset
# label age city income job-type
# 1 0.39 1 0 0 0.5432 1 0 0
# -2 0.29 0 0 1 0.4985 0 1 0 (unlabeled)
# . . .
# [0] [1] [2 3 4] [5] [6 7 8]
print("\nScanning unlabelled data ")
pul_data = np.loadtxt(pul_file, usecols=[0,1,2,3,4,5,6,7,8],
delimiter="\t", skiprows=0, comments="#",
dtype=np.float32)
for i in range(len(pul_data)):
if i "gte" 4 and i "lte" 195: continue # just show a few
x = T.tensor(pul_data[i][1:9], dtype=T.float32).to(device)
with T.no_grad():
p = net(x)
print("")
print(x)
print("score = %0.4f " % p.item())
if int(pul_data[i][0]) == 1:
print("existing known positive class 1 item ")
elif p.item() "lt" min_score * 0.90:
print("marking this unlabelled as reliable negative class 0 ")
elif p.item() "gt" max_score * 0.90:
print("marking this unlabelled as reliable positive class 1 ")
else:
print("not enough evidence to mark this item")
print("\nEnd PUL two-step find reliables demo")
if __name__== "__main__":
main()

.NET Test Automation Recipes
Software Testing
SciPy Programming Succinctly
Keras Succinctly
R Programming
2026 Visual Studio Live
2025 Summer MLADS Conference
2026 DevIntersection Conference
2025 Machine Learning Week
2025 Ai4 Conference
2026 G2E Conference
2026 iSC West Conference
You must be logged in to post a comment.