Computing the Similarity of Two Datasets

Suppose you have two datasets P and Q and you want to know how similar they are. This is a surprisingly difficult problem. One technique I experimented with is to use a deep neural autoencoder to reduce each data item down to a single value, and then compute the Kullback-Leibler divergence between the two datasets.

The idea is best explained using a concrete example. Suppose the parent/reference dataset P is 10,000 lines of MNIST data. Each line has 784 pixel values followed by a single digit/label. Each of the 785 values per line is normalized to [0.0 to 1.0]. Suppose the other dataset Q is 1,000 randomly selected lines from the P dataset.

The first step is to create and train an autoencoder using P that compresses each 785-value data item down to a single numeric value in [0.0 to 1.0]. This is called dimensionality reduction.

Next, run the P data through the autoencoder and compute the frequencies of each of the 10,000 items using 10 bins — [0.0 to 1.0), [1.0 to 2.0), . . [0.9, 1.0]. Next, run the Q data items through the autoencoder and compute their frequencies.

At this point you have two frequency arrays that look something like:

P: [0.100, 0.125, . . . 0.090]
Q: [0.008, 0.140, . . . 0.105]

Now if the two datasets are similar, the frequency values in each bin will be close to each other, but if the two datasets are different, the frequency array values will be different.

The last step is to compute the similarity/difference between the two frequency arrays. There are many ways to do this. Four of the most common techniques are Kullback-Leibler divergence, chi-square divergence, squared error difference, and Kolmogorov-Smirnov statistic. Each technique has technical pros and cons but for dataset similarity, any are OK in most scenarios. My demo program uses K-L divergence.

To summarize, the idea is to reduce multi-dimensional data down to 1-D using an autoencoder, and then use a standard classical statistics technique.

My technique is a divergence rather than a distance because the result depends on which dataset is specified as the reference dataset P. Therefore, d(P,Q) != d(Q,P) in general. If this is a problem, you can compute symmetric K-L divergence instead of regular K-L.

Compared to alternatives, this technique has some advantages: 1.) It can work with numeric or categorical or mixed data, as long as you encode the categorical data. 2.) It works with datasets that have unequal sizes. 3.) It is very simple. 4.) It can be easily customized by using different f-divergence functions, more frequency bins, etc. 5.) The technique is interpretable to a significant extent.

Some disadvantages: 1.) You have to specify autoencoder hyperparameters (architecture and training parameters). 2.) If the P reference/parent dataset is huge, you have the annoying engineering problem of streaming when training the autoencoder.

One possible use of a dataset divergence metric is for creating a coreset — a small subset of machine learning training data so you can train a model quickly. You repeatedly generate random subsets of data from the large parent dataset, and select the subset that is most similar (least divergent) to the parent.

One thing about my dataset similarity metric that might be improved is the dimensionality reduction. A single value might not be enough to represent complex data items. So, I’ll need to explore using a latent dim greater than 1. I’m not sure exactly how this would work . . .

Two mixed media abstract paired-portrait illustrations by artists whose work I find interesting. Left: By artist Daniel Arrhakis. Right: By artist Hanneke Treffers. There are techniques which can compute the similarity of the two images, or compute the similarity of the two portraits in each image. But I prefer to just admire the art.

Code (long) below.

# divergence.py

# divergence between datasets P and Q
# assumes P and Q data normalized to [0.0, 1.0]

# PyTorch 1.7.1-CPU Anaconda3-2020.02  Python 3.7.6
# Torchvision 0.8.1+cpu
# CPU, Windows 10

import numpy as np
import time
import torch as T
import scipy.stats as sps

device = T.device("cpu")

# -----------------------------------------------------------

class MNIST_Dataset(T.utils.data.Dataset):
  # for an Autoencoder (not a classifier)
  # assumes data has been converted to text files:
  # 784 pixel values (0-255) (tab) label (0-9)
  # [0] [1] . . [783] [784] 

  def __init__(self, src_file):
    all_xy = np.loadtxt(src_file, usecols=range(785),
      delimiter="\t", comments="#", dtype=np.float32)
    self.xy_data = T.tensor(all_xy, dtype=T.float32).to(device) 
    self.xy_data[:, 0:784] /= 255.0  # normalize pixels
    self.xy_data[:, 784] /= 9.0      # normalize digit labels

  def __len__(self):
    return len(self.xy_data)

  def __getitem__(self, idx):
    xy = self.xy_data[idx]
    return xy

# -----------------------------------------------------------

class Autoencoder(T.nn.Module):  # [785-400-30-1-30-400-785]
  def __init__(self):
    super(Autoencoder, self).__init__()  
    self.layer1 = T.nn.Linear(785, 400)  # includes labels
    self.layer2 = T.nn.Linear(400, 30)
    self.layer3 = T.nn.Linear(30, 1)
    self.layer4 = T.nn.Linear(1, 30)
    self.layer5 = T.nn.Linear(30, 400)
    self.layer6 = T.nn.Linear(400, 785)

  def encode(self, x):             # [bs,785]
    z = T.tanh(self.layer1(x))     # [bs,400]
    z = T.tanh(self.layer2(z))     # [bs,30]
    z = T.sigmoid(self.layer3(z))  # [bs,1]
    return z 

  def decode(self, x):              # [bs,1]
    z = T.tanh(self.layer4(x))      # [bs,30]
    z = T.tanh(self.layer5(z))      # [bs,400]
    z = T.sigmoid(self.layer6(z))   # [bs,785]
    return z 

  def forward(self, x):
    z = self.encode(x)
    oupt = self.decode(z)
    return oupt

# -----------------------------------------------------------

def train(ae, ds, bs, me, le):
  # train autoencoder ae with dataset ds using batch size bs, 
  # with max epochs me, log_every le
  data_ldr = T.utils.data.DataLoader(ds, batch_size=bs,
    shuffle=True)
  
  loss_func = T.nn.MSELoss() 
  opt = T.optim.Adam(ae.parameters(), lr=0.001)
  print("Starting training")
  for epoch in range(0, me):
    for (b_idx, batch) in enumerate(data_ldr):
      opt.zero_grad()
      X = batch
      oupt = ae(X)
      loss_val = loss_func(oupt, X)  # note X not Y
      loss_val.backward()
      opt.step()

    if epoch "gt" 0 and epoch % le == 0:
      print("epoch = %6d" % epoch, end="")
      print("  curr batch loss = %7.4f" % loss_val.item(), end="")
      print("")
  print("Training complete ")

# -----------------------------------------------------------

def show_mnist_file(fn, n_lines):
  fin = open(fn, "r")
  i = 0
  for line in fin:
    line = line.strip()
    tokens = line.split("\t")
    for j in range(0,3):                 # first three pixels
      print("%4s" % tokens[j], end="")
    print(" . . . ", end="")
    for j in range(300,304):             # four middle pixels
      print("%4s" % tokens[j], end="")
    print(" . . . ", end="")
    for j in range(781,784):             # last three pixels  
      print("%4s" % tokens[j], end="")
    print("   ** ", end="")
    print("%3s" % tokens[784])           # label / digit

    i += 1
    if i == n_lines: break
  fin.close()

# -----------------------------------------------------------

def make_random_subset_file(n_lines, src_fn, dest_fn):
  # count lines in src
  fin = open(src_fn, "r")
  ct = 0
  for line in fin:
    ct += 1
  fin.close()

  # make array of rows to select
  all_rows = np.arange(ct)
  np.random.shuffle(all_rows)
  # get some rows
  selected_rows = all_rows[0:n_lines]
  # write selected rows to dest file
  fin = open(src_fn, "r")
  fout = open(dest_fn, "w")
  i = 0
  for line in fin:
    if i in selected_rows:
      fout.write(line)  # includes new_line
    i += 1
  fout.close()
  fin.close()

# -----------------------------------------------------------

def make_freq_arr(ae, ds):
  result = np.zeros(10, dtype=np.int64)
  n = len(ds)
  for i in range(n):
    x = ds[i]
    with T.no_grad():
      xx = ae.encode(x).item()

    if xx "gte" 0.0 and xx "lt" 0.1:   result[0] += 1
    elif xx "gte" 0.1 and xx "lt" 0.2: result[1] += 1
    elif xx "gte" 0.2 and xx "lt" 0.3: result[2] += 1
    elif xx "gte" 0.3 and xx "lt" 0.4: result[3] += 1
    elif xx "gte" 0.4 and xx "lt" 0.5: result[4] += 1
    elif xx "gte" 0.5 and xx "lt" 0.6: result[5] += 1
    elif xx "gte" 0.6 and xx "lt" 0.7: result[6] += 1
    elif xx "gte" 0.7 and xx "lt" 0.8: result[7] += 1
    elif xx "gte" 0.8 and xx "lt" 0.9: result[8] += 1
    else:                        result[9] += 1

  result = (result * 1.0) / n
  return result 

# -----------------------------------------------------------

def main():
  # 0. get started
  print("\nBegin MNIST neural dataset divergence demo ")
  T.manual_seed(1)
  np.random.seed(1)
  p_file = ".\\Data\\mnist_train_10000.txt"
  q_file = ".\\Data\\mnist_random_1000.txt"
  make_random_subset_file(1000, p_file, q_file)

  print("\nP file = " + str(p_file))
  print("Q file = " + str(q_file))
  
  # 1. create Dataset objects
  print("\nCreating P and Q Datasets ")
  p_ds = MNIST_Dataset(p_file)
  q_ds = MNIST_Dataset(q_file)

  # 2. create and train autoencoder model using parent 
  print("\nCreating autoencoder using P \n")
  autoenc = Autoencoder()   # 785-400-30-1-30-400-785
  autoenc.train()           # set mode

  bat_size = 10
  max_epochs = 5
  log_every = 1
  train(autoenc, p_ds, bat_size, max_epochs,
    log_every)

  # 3. TODO: save trained autoencoder

  # 4. create frequency arrays for parent dataset P
  print("\nCreating frequency arrays for P, Q ")
  p_freq = make_freq_arr(autoenc, p_ds)
  q_freq = make_freq_arr(autoenc, q_ds)
  np.set_printoptions(formatter={'float': '{: 0.3f}'.format})
  print(p_freq)
  print(q_freq)

  # 5. compute an f-divergence: KL, chi-square, etc.
  print("\nComputing KL divergence (smaller is more similar) ")
  kl = sps.entropy(p_freq, q_freq)
  print("%0.5f" % kl)

  print("\nEnd MNIST neural dataset divergence demo")

# -----------------------------------------------------------

if __name__ == "__main__":
  main()

2 Responses to Computing the Similarity of Two Datasets

Daniel Arrhakis says:

October 5, 2022 at 7:08 pm

Thank you so much for choose one of my works for your interesting article !

Loading...
jamesdmccaffrey says:

October 6, 2022 at 5:28 am

I absolutely love your art Daniel. I’m amazed by artists who have talent like yours.

Loading...