Improved Technique for Computing the Similarity of Two Datasets

I recently published a blog post where I described an algorithm for computing the similarity between two datasets. (See https://jamesmccaffreyblog.com/2021/03/10/computing-the-similarity-of-two-datasets/) More precisely, the algorithm computes the divergence (asymmetric dissimilarity) between two datasets. Based on comments from two of my colleagues (Ziqi M, and KC T), I made an improvement to my algorithm.

If you don’t know this problem, it’s much more difficult than you’d expect. You have to worry about unequal dataset sizes, non-numeric data, and you can’t directly compare all items in two different datasets because that is O(n^2).

Demo run of the improved algorithm that uses a latent dim larger than 1.

My original algorithm to compare a reference dataset P with an other dataset Q starts by training an autoencoder on the P dataset, where the autoencoder reduces each data item to a single latent variable (so the latent dim = 1). Suppose P has 10,000 data items and Q has 1,000 data items. Next, you run each P data item through the encoder, giving you 10,000 values, each between 0.0 and 1.0. Using these, you create a frequency vector of the percentage of P items between [0.0, 0.10), [0.1, 0.2), . . (0.9, 1.0], so the frequency vector has len = 10.

Then you run each of the 1,000 Q items through the encoder, get their latent values, and use them to construct a second frequency vector.

At this point you have two frequency vectors that look something like:

P: [0.100, 0.125, . . . 0.090]
Q: [0.008, 0.140, . . . 0.105]

If the two datasets are similar, the frequency values in each bin will be close to each other, but if the two datasets are different, the frequency array values will be different.

The last step of the original algorithm is to compute the similarity/difference between the two frequency vectors using Kullback-Leibler divergence. This KL value is a measure of how dissimilar the two datasets are, where a small KL value means the datasets are similar and larger values mean more dissimilar. The value is a divergence instead of a distance because KL(P,Q) != KL(Q,P) in general.

My colleagues suggested that I increase the latent dimension from 1 to a larger value. The thought is that a single value might not be enough to represent a complex data item. So, that’s what I did.

I coded up a demo using 10,000 MNIST data items as the P dataset and 1,000 randomly selected items from P as the Q dataset. I used a latent dim = 4 and sigmoid activation so each data item is represented by a vector with 4 values between 0.0 and 1.0. This leads to 4 frequency vectors for P and 4 frequency vectors for Q, which in turn leads to 4 KL values, where each value is the divergence on one of the 4 latent variables, for example:

In the original algorithm, there was only one KL value so that was the similarity result. With a larger latent dim, the question is whether to interpret the four KL values as the result, or combine the four KL values by computing their simple average (0.0289) or perhaps the average of squared differences from 0. Both approaches could be useful depending on the problem scenario.

The next step in this mini-project will be to run some experiments. I’m imagining that I can start with some dataset P, and then programmatically create different Q datasets that add increasing amounts of randomness. The divergence metric(s) should increase as the amount of randomness in Q increases. Well, we’ll see.

Good fun.

For some reason, exploring algorithms for dataset similarity reminds me of when I was first learning computer science and I was exploring different sorting algorithms such as insertion sort, selection sort, quick sort, and of course, bubble sort.