The Best Algorithm I've Discovered for Positive and Unlabeled Learning (PUL)

A positive and unlabeled learning (PUL) problem occurs when a machine learning set of training data has only a few positive (class 1) labeled items and many unlabeled (could be either negative class 0, or positive class 1) items. For example, you might have a dataset of security information where there are only a few dozen network attacks (class 1) but many thousands of items where it’s not known if each is class 0 or class 1.

The goal of PUL is to use the information contained in the dataset to guess the true labels of the unlabeled data items. After the class labels of some of the unlabeled items have been guessed, the resulting labeled dataset can be used to train a binary classification model using any standard machine learning technique, such as k-nearest neighbors classification or neural binary classification.

PUL is very difficult and there are many techniques. But none of the published techniques I’ve seen were completely convincing in my opinion.

Several months ago I came up with an algorithm that gave me the best results I’d seen in my explorations. My algorithm was based on an algorithm I found in a rather obscure research paper. My adaptation algorithm worked well but in the back of my mind I wasn’t 100% satisfied. For one thing, the algorithm randomly selects some of the unlabeled items and temporarily assigns them as class 0 negative. Why negative class 0 rather than positive class 1? Also, the algorithm did not extend from binary classification problems to multi-class classification problems. That characteristic just didn’t feel right.

A screenshot of the older algorithm that I extended.

The ideas percolated in my head for many weeks, especially when I was walking my dogs. On one such walk I came up with an extension of my old algorithm and knew for sure it would work before I even coded up a demo experiment.

The technique is moderately complicated. A neural binary classifier accepts inputs and produces a prediction in the form of a p-value between 0.0 and 1.0 where values less than 0.5 indicate a prediction of negative class 0 and values greater than 0.5 indicate positive class 1. The key to my idea is to repeatedly train neural models with three types of data: all available positive class 1 items, an equal number of random noise items labeled as negative class 0, and an equal number of randomly selected unlabeled items, temporarily labeled as negative class 0, and then train additional models where the randomly selected unlabeled items are now temporarily labeled as positive class 1.

The existing positive class 1 training items provide information about what data patterns are positive. The random noise items are highly likely to be negative class 0 so they give information abut negative data item patterns. The randomly selected unlabeled items give information about data item patterns in general, and they could be either negative class 0 or positive class 1 so they are treated in both ways.

Suppose you have a dataset with 200 total training item. Of these, 20 are positive class 1. The remaining 180 data items are unlabeled and could be either negative class 0 or positive class 1.

phase 0:    
loop several times
  create a 60-item train dataset with all 20 positive,
  20 random noise labeled as negative class 0,
  and 20 randomly selected unlabeled items that
  are temporarily treated as negative class 0

  train a binary classifier using the 60-item train data

  use trained model to score the 160 unused unlabeled
    data items

  accumulate the p-score for each unused unlabeled item
    
  generate a new train dataset, train, score
end-loop

phase 1:
(same as phase 0 except randomly selected unlabeled items
 are now temporarily treated as positive class 1)

loop several times
  create a 60-item train dataset with all 20 positive,
  20 random noise labeled as negative class 0,
  and 20 randomly selected unlabeled items that
  are temporarily treated as positive class 1

  train a binary classifier using the 60-item train data

  use trained model to score the 160 unused unlabeled
    data items

  accumulate the p-score for each unused unlabeled item
    
  generate a new train dataset, train, score
end-loop
  
for-each of the 180 unlabeled items
  compute the average p-value from phases 0 and 1

  if avg p-value  hi threshold
    guess its label as positive
  else
    insufficient evidence to make a guess
  end-if
end-for

I coded up a demo. The code was surprisingly long and tricky. But this algorithm worked better than any other PUL technique I’ve tried. However, there are many hyperparameters so it’s not possible to state with certainty this algorithm is best in any sense. But conceptually it feels absolutely correct. Notice that the algorithm naturally extends to multi-class classification by adding phase 2, phase 3, and so on.

Unlabeled data hides its true underlying nature. During the covid pandemic, face masks hide the true nature of people’s appearance. Here are three covid face masks with questionable functionality. Left: Good filter for an automobile but not so good for a person. Center: Looks good but probably isn’t very effective. Right: I suspect the face mask here was chosen for reasons other than medical ones.