Finding Reliable Negatives For Positive and Unlabeled Learning (PUL) Datasets

Suppose you have a machine learning dataset for training, where only a few data items have a positive label (class = 1), but all the other data items are unlabeled and could be either negative (class = 0) or positive. This is called a positive and unlabeled learning (PUL) problem. PUL problems often appear in medical scenarios (only a few patients are diagnosed as class 1, all others are unknown) and in security scenarios.

To make sense of PUL data and use it to train a prediction model, you must somehow use the information contained in the PUL data to make intelligent guesses about the labels for the unlabeled items. This is called “finding reliable negatives”.

This is a very difficult problem. I’ve experimented with dozens of schemes for identifying reliable negatives in PUL data. The bottom line is that all techniques have many hyperparameters and results can vary wildly.

For my experiments, I set up a synthetic dataset with 200 items of Employee information. The data looks like:

-2   0.39   0   0   1   0.5120   0   1   0
 1   0.24   1   0   0   0.2950   0   0   1
-2   0.36   1   0   0   0.4450   0   1   0
-2   0.50   0   1   0   0.5650   0   1   0
-2   0.19   0   0   1   0.3270   1   0   0
. . .

The first column is introvert or extrovert, encoded as 1 = positive = extrovert (20 items), and -2 = unlabeled (180 items). The goal of PUL is to intelligently guess 0 = negative, or 1 = positive, for as many of the unlabeled data items as possible.

The other columns in the dataset are employee age (normalized by dividing by 100), city (one of three, one-hot encoded), annual income (normalized by dividing by $100,000), and job-type (one of three, one-hot encoded).

The dataset was artificially constructed so that even numbered items [0], [2], [4], etc. are actually class 0 = negative, and odd numbered items [1], [3], [5], etc. are actually class 1. This allows the PUL system to measure its accuracy. In a non-demo PUL scenario, you won’t know the true class labels.

My latest exploration used this approach:

create a dataset with all 20 known positive items
 and 20 items with random inputs marked as negative

use dataset to train a binary classifier (where
 the output is a p-value between 0 and 1)

scan dataset to find min p-score for the 20 
 positive items and the max p-score

loop each item of the PUL data
  feed item to binary classifier and
   compute the p-score
  if label = 1 then
    it's a known positive, continue
  else if p-score less-than min_p_score * 0.9
    mark this item as a reliable negative class 0
  else if p-score grtr-than max_p_score * 0.9
    mark this item as a relaible positive class 1
  else
    not enough evidence so leave as unlabeled
  end-if
end-loop

Once you have examined the PUL data and identified reliable negatives (and new reliable positives), you can either 1.) repeat the process with the updated dataset, or 2.) toss out the unlabeled items and then use the dataset to train a prediction model.

The ideas are conceptually very simple, but implementation is tricky. My results were quite satisfactory — but depend on over a dozen hyperparameters (batch_size, optimization algorithm, learning rate, NN architecture, weight initialization algorithm, etc., etc.)

Interesting topic.

Here are three cars made in 1970 that routinely show up in Internet searches for “ugliest cars of the 70s” and so they’d be labeled class 1 = positive (ugly). But I would assign a class label of class 0 = not ugly to all three. Left: AMC Javelin AMX (a competitor to the Ford Mustang of the time). Center: Datsun (Nissan) 510 in front of Univ. of Calif. at Irvine which was under construction at the time. I had this model of car and went to UCI when it was still under construction. Right: AMC Pacer. Weird but appealing (to me) car with a passenger side door that was 4 inches longer than the driver side door!

Code (PyTorch) below. Long.

# employee_pul_find_reliables.py
# PyTorch 1.9.0-CPU Anaconda3-2020.02  Python 3.7.6
# Windows 10 

# load all 20 known positives = 1, create 20 random input
# labelled as negative = 0

import numpy as np
import torch as T
device = T.device("cpu")  # apply to Tensor or Module

# ----------------------------------------------------------

class ExploreDataset(T.utils.data.Dataset):
  # label  age   city   income   job-type
  #   1    0.39  1 0 0  0.5432   1 0 0
  #  -2    0.29  0 0 1  0.4985   0 1 0  (unlabeled)
  # . . .
  #  [0]   [1]  [2 3 4]   [5]   [6 7 8]

  def __init__(self, fn):
    self.rnd = np.random.RandomState(1)

    tmp_x = np.zeros((40,8), dtype=np.float32)
    tmp_y = np.zeros(40, dtype=np.float32)

    # 1. load just the 20 known positives into memory
    i = 0
    f = open(fn, "r")  
    for line in f:
      line = line.strip()
      if line.startswith("#"): continue
      arr = np.fromstring(line, sep="\t", dtype=np.float32)
      if int(arr[0]) == 1:  # known positive
        tmp_y[i] = arr[0]

        tmp_x[i][0] = arr[1]
        tmp_x[i][1] = arr[2]
        tmp_x[i][2] = arr[3]
        tmp_x[i][3] = arr[4]
        tmp_x[i][4] = arr[5]
        tmp_x[i][5] = arr[6]
        tmp_x[i][6] = arr[7]
        tmp_x[i][7] = arr[8]
        i += 1
    f.close()
    tmp_y = tmp_y.reshape(-1,1)  # 2D

    # 2. create 20 synthetic items labelled as negative = 0
    for i in range(20, 40):
      # tmp_y[i] = 0  # is already 0
      tmp_x[i][0] = self.rnd.random()  # age
      city = self.rnd.randint(0,3)
      if city == 0: tmp_x[i][1] = 1
      elif city == 1: tmp_x[i][2] = 1
      elif city == 2: tmp_x[i][3] = 1
      tmp_x[i][4] = self.rnd.random()  # income 
      job = self.rnd.randint(0,3)
      if job == 0: tmp_x[i][5] = 1
      elif job == 1: tmp_x[i][6] = 1
      elif job == 2: tmp_x[i][7] = 1           

    self.x_data = T.tensor(tmp_x, dtype=T.float32).to(device)
    self.y_data = T.tensor(tmp_y, dtype=T.float32).to(device)

  def __len__(self):
    return len(self.x_data)

  def __getitem__(self, idx):
    preds = self.x_data[idx,:]  # idx rows, all 8 cols
    lbl = self.y_data[idx,:]    # idx rows, the only col
    sample = { 'predictors' : preds, 'lbl' : lbl }
    return sample

# ----------------------------------------------------------

class Net(T.nn.Module):
  def __init__(self):
    super(Net, self).__init__()
    self.hid1 = T.nn.Linear(8, 10)  # 8-(10-10)-1
    self.hid2 = T.nn.Linear(10, 10)
    self.oupt = T.nn.Linear(10, 1)

    T.nn.init.xavier_uniform_(self.hid1.weight) 
    T.nn.init.zeros_(self.hid1.bias)
    T.nn.init.xavier_uniform_(self.hid2.weight) 
    T.nn.init.zeros_(self.hid2.bias)
    T.nn.init.xavier_uniform_(self.oupt.weight) 
    T.nn.init.zeros_(self.oupt.bias)

  def forward(self, x):
    z = T.tanh(self.hid1(x))
    z = T.tanh(self.hid2(z))
    z = T.sigmoid(self.oupt(z))  # see BCELoss() below
    return z

# ----------------------------------------------------------

def train(net, ds, bs, me, le, lr, verbose):
  # NN, dataset, batch_size, max_epochs,
  # log_every, learn_rate. optimizer and loss hard-coded.

  data_ldr = T.utils.data.DataLoader(ds, batch_size=bs,
    shuffle=True)
  loss_func = T.nn.BCELoss()  # assumes sigmoid activation
  opt = T.optim.SGD(net.parameters(), lr=lr)
  for epoch in range(0, me):
    epoch_loss = 0.0
    for (batch_idx, batch) in enumerate(data_ldr):
      X = batch['predictors']  # inputs
      Y = batch['lbl']         # 0 or 1 targets

      opt.zero_grad()                # prepare gradients
      oupt = net(X)                  # compute output/target
      loss_val = loss_func(oupt, Y)  # a tensor
      epoch_loss += loss_val.item()  # accumulate for display
      loss_val.backward()            # compute gradients
      opt.step()                     # update weights

    if epoch % le == 0 and verbose == True:
      print("epoch = %4d   loss = %0.4f" % (epoch, epoch_loss))

# ----------------------------------------------------------

def main():
  # 0. get started
  print("\nBegin PUL two-step: find reliables ")
  T.manual_seed(1)
  np.random.seed(1)

  # 1. create Dataset and DataLoader objects
  print("\nCreating Employee exploration Dataset ")

  pul_file = ".\\Data\\employee_pul_200.txt"
  train_ds = ExploreDataset(pul_file)

  # 2. create neural network
  print("\nCreating 8-(10-10)-1 binary NN classifier ")
  net = Net().to(device)
  net.train()  # set mode

  # 3. train
  print("\nSetting training parameters: ")
  bat_size = 4
  lrn_rate = 0.01
  max_epochs = 2000
  log_every = 500

  print("batch size = " + str(bat_size))
  print("lrn_rate = %0.2f " % lrn_rate)
  print("max_epochs = " + str(max_epochs))
  print("loss function = BCELoss() ")
  print("optimizer = SGD ")

  print("\nStarting training")
  train(net, train_ds, bat_size, max_epochs,
    log_every, lrn_rate, verbose=True)
  print("Training complete ")

  # 4. score the 20 known positives
  print("\nScoring the 20 known positives ")
  min_score = 1.0; max_score = 0.0
  net.eval()
  for i in range(20):
    x = train_ds[i]['predictors']
    with T.no_grad():
      p = net(x)
    if p.item() "lt" min_score: min_score = p.item()
    elif p.item() "gt" max_score: max_score = p.item()

  print("Min score for known positives: %0.4f" % min_score)
  print("Max score for known positives: %0.4f" % max_score)

  # 5. scan and score the unlabelled itemss.
  # if p-score is less than min_score, mark item as negative
  # if p-score is grtr than max_score, mark item as positive
  # because there's no training, no need Dataset

  # label  age   city   income   job-type
  #   1    0.39  1 0 0  0.5432   1 0 0
  #  -2    0.29  0 0 1  0.4985   0 1 0  (unlabeled)
  # . . .
  #  [0]   [1]  [2 3 4]   [5]   [6 7 8]

  print("\nScanning unlabelled data ")

  pul_data = np.loadtxt(pul_file, usecols=[0,1,2,3,4,5,6,7,8],
    delimiter="\t", skiprows=0, comments="#",
    dtype=np.float32)

  for i in range(len(pul_data)):
    if i "gte" 4 and  i "lte" 195: continue  # just show a few
    x = T.tensor(pul_data[i][1:9], dtype=T.float32).to(device)
    with T.no_grad():
      p = net(x)

    print("")
    print(x)
    print("score = %0.4f " % p.item())
    if int(pul_data[i][0]) == 1:
      print("existing known positive class 1 item ")
    elif p.item() "lt" min_score * 0.90:
      print("marking this unlabelled as reliable negative class 0 ")
    elif p.item() "gt" max_score * 0.90:
      print("marking this unlabelled as reliable positive class 1 ")
    else:
      print("not enough evidence to mark this item")

  print("\nEnd PUL two-step find reliables demo")

if __name__== "__main__":
  main()