Transformer Based Dataset Similarity for MNIST

Computing a measure of the similarity of two datasets is a very difficult problem. Several months ago, I worked on a project and devised an algorithm based on a neural autoencoder. See jamesmccaffrey.wordpress.com/2021/04/02/dataset-similarity-sometimes-ideas-work-better-than-expected/.

I wondered if I could adapt my dataset similarity algorithm to use a neural transformer architecture instead of a neural autoencoder architecture. After a few hours of experimentation, I got a promising demo up and running. I used the UCI Digits dataset for my experiments. Each UCI data item is a crude grayscale image of a handwritten digit from ‘0’ to ‘9’. Each image has 8 by 8 = 64 pixels. Each pixel is a value between 0 (white) and 16 (black). The UCI Digits dataset is basically a small, simplified version of the well-known MNIST dataset. See jamesmccaffrey.wordpress.com/2022/10/03/the-distance-between-two-datasets-using-transformer-encoding/.

So, I then wondered if I could adapt the transformer based dataset similarity experiment from the UCI Digits dataset to the more difficult MNIST dataset. After a bit of work, I got an example running. For my demo, I compared 1000 items from the MNIST training data with 1000 items where 400 of the items had been randomized. The experiment worked as expected, but . . .

I observed that the demo took a long time to run. This is because transformer architecture runs in O(n^2) where n is the number of input items. In my original transformer experiment, the UCI Digits have n=65 inputs (64 pixel values and 1 label). But MNIST has n=785 inputs (784 pixels and 1 label).

The key component of my demo code is:

class AutoencoderTransformer(T.nn.Module):  # 785-xx-4-30-400-785 
  def __init__(self):
    # 785 numeric inputs: no exact word embedding equivalent
    # pseudo embed_dim = 2
    # seq_len = 785
    super(AutoencoderTransformer, self).__init__() 

    self.fc1 = T.nn.Linear(785, 785*2)  # pseudo-embedding
    self.fc2 = T.nn.Linear(785*2, 4)

    self.pos_enc = \
      PositionalEncoding(2, dropout=0.00)  # positional

    self.enc_layer = T.nn.TransformerEncoderLayer(d_model=2,
      nhead=2, dim_feedforward=100, dropout=0.0,
      batch_first=True)  # d_model divisible by nhead

    self.trans_enc = T.nn.TransformerEncoder(self.enc_layer,
      num_layers=6)

    self.dec1 = T.nn.Linear(4, 30) 
    self.dec2 = T.nn.Linear(30, 400)
    self.dec3 = T.nn.Linear(400, 785)

     # use default weight initialization
 
    self.latent_dim = 4

  def encode(self, x):           # x is [bs, 785]
    z = T.tanh(self.fc1(x))      # [bs, 1570]
    z = z.reshape(-1, 785, 2)    # [bs, 785, 2]
    z = self.pos_enc(z)          # [bs, 785, 2]
    z = self.trans_enc(z)        # [bs, 785, 2]
    z = z.reshape(-1, 785*2)     # [bs, 785]
    z = T.sigmoid(self.fc2(z))   # [bs, 4]
    return z

  def decode(self, x):
    z = T.tanh(self.dec1(x))     # [bs, 30]
    z = T.tanh(self.dec2(z))     # [bs, 400]
    z = T.sigmoid(self.dec3(z))  # [bs, 785]
    return z    

  def forward(self, x):            # x is [bs,785]
    z = self.encode(x)
    oupt = self.decode(z)
    return oupt

The details are quite complicated but briefly, the AutoencoderTransformer accepts as input 785 values (784 pixels and the associated class label), and encodes the input as a vector of four values. This is called the latent representation.

Each of the 1000 items in the reference P dataset are fed to the AutoencoderTransformer which results in a frequency distribution. Each of the 1000 items in the “other” Q dataset are fed to the AutoencoderTransformer which results in a second frequency distribution. The two distributions are compared using the Kullback-Leibler divergence which is a measure of how similar the P and Q datasets are.

Good fun (at least for me)

Three covers of “The Master Mind of Mars” (1927), the sixth book in the Mars series by author Edgar R. Burroughs. Left: The 1927 version by artist Frank R. Paul. Center: The 1963 version by artist Roy Krenkel. Right: The 1969 version by artist Robert Abbett. Without using any fancy similarity metrics, I’d say the left and center images are closest.