Researchers Devise a New Machine Learning Algorithm for Dataset Similarity

I contributed to an article titled “Researchers Devise a New Machine Learning Algorithm for Dataset Similarity” in the April 2021 edition of the Pure AI Web site. See https://pureai.com/articles/2021/04/08/similarity-algorithm.aspx.

A dataset similarity metric is a number that compares two sets of data and tells you how similar (or equivalently, how different) the two sets of data are. This sounds like a very easy problem, but in fact, it’s extremely difficult and is essentially an open question in computer science.

There are many ways to compare two individual data items. Two of the most common are Euclidean distance and cosine similarity. But you can’t use a brute force approach to compare all the data items in two different datasets becuase there are just too many comparisons in all but the most trivial problem scenarios.

Additionally, you have to deal with mixed numeric and non-numeric data, and deal with data sets that have different sizes (number of rows), and even datasets that have completely different columns.


Example of the AKLD metric when applied to two subsets of the MNIST handwritten digits dataset. As the difference between datasets increases, the value of the AKLD metric increases.

The Pure AI article describes a metric that was developed in a project I was working on. My team members included: Z. Ma, P. Mineiro, KC Tung, and RA Chavali.

The technique is called “autoencoded Kullback-Leibler divergence” (AKLD). Yes, it’s an ugly name. The idea is to run the first dataset P through an autoencoder and generate n frequency distributions of encoded values. The you run the secod dataset Q through the autoencoder and get a second set of n distributions. Then you apply the Kullback-Liebler divergence metric and average the results. A final result of 0 indicates identical datasets. Larger values indicate increasing difference.

The AKLD dataset similarity metric was a side artifact of the project I was working on. An analogy is when airplane designers make a special tool as part of the design process, and that special tool turns out to have uses beyond the original project.



Three mixed media portraits by artists I like. They are similar in a certian respect because all three portraits feature a radial design centered on the portrait face. Left: by artist Andrea Matus Demeng. Center: by artist Stanislaw Krupp. Right: by artist Karol Bak.

This entry was posted in Machine Learning. Bookmark the permalink.