Transformer Based Dataset Similarity for MNIST

Computing a measure of the similarity of two datasets is a very difficult problem. Several months ago, I worked on a project and devised an algorithm based on a neural autoencoder. See jamesmccaffrey.wordpress.com/2021/04/02/dataset-similarity-sometimes-ideas-work-better-than-expected/.

I wondered if I could adapt my dataset similarity algorithm to use a neural transformer architecture instead of a neural autoencoder architecture. After a few hours of experimentation, I got a promising demo up and running. I used the UCI Digits dataset for my experiments. Each UCI data item is a crude grayscale image of a handwritten digit from ‘0’ to ‘9’. Each image has 8 by 8 = 64 pixels. Each pixel is a value between 0 (white) and 16 (black). The UCI Digits dataset is basically a small, simplified version of the well-known MNIST dataset. See jamesmccaffrey.wordpress.com/2022/10/03/the-distance-between-two-datasets-using-transformer-encoding/.

So, I then wondered if I could adapt the transformer based dataset similarity experiment from the UCI Digits dataset to the more difficult MNIST dataset. After a bit of work, I got an example running. For my demo, I compared 1000 items from the MNIST training data with 1000 items where 400 of the items had been randomized. The experiment worked as expected, but . . .

I observed that the demo took a long time to run. This is because transformer architecture runs in O(n^2) where n is the number of input items. In my original transformer experiment, the UCI Digits have n=65 inputs (64 pixel values and 1 label). But MNIST has n=785 inputs (784 pixels and 1 label).



The key component of my demo code is:

class AutoencoderTransformer(T.nn.Module):  # 785-xx-4-30-400-785 
  def __init__(self):
    # 785 numeric inputs: no exact word embedding equivalent
    # pseudo embed_dim = 2
    # seq_len = 785
    super(AutoencoderTransformer, self).__init__() 

    self.fc1 = T.nn.Linear(785, 785*2)  # pseudo-embedding
    self.fc2 = T.nn.Linear(785*2, 4)

    self.pos_enc = \
      PositionalEncoding(2, dropout=0.00)  # positional

    self.enc_layer = T.nn.TransformerEncoderLayer(d_model=2,
      nhead=2, dim_feedforward=100, dropout=0.0,
      batch_first=True)  # d_model divisible by nhead

    self.trans_enc = T.nn.TransformerEncoder(self.enc_layer,
      num_layers=6)

    self.dec1 = T.nn.Linear(4, 30) 
    self.dec2 = T.nn.Linear(30, 400)
    self.dec3 = T.nn.Linear(400, 785)

     # use default weight initialization
 
    self.latent_dim = 4

  def encode(self, x):           # x is [bs, 785]
    z = T.tanh(self.fc1(x))      # [bs, 1570]
    z = z.reshape(-1, 785, 2)    # [bs, 785, 2]
    z = self.pos_enc(z)          # [bs, 785, 2]
    z = self.trans_enc(z)        # [bs, 785, 2]
    z = z.reshape(-1, 785*2)     # [bs, 785]
    z = T.sigmoid(self.fc2(z))   # [bs, 4]
    return z

  def decode(self, x):
    z = T.tanh(self.dec1(x))     # [bs, 30]
    z = T.tanh(self.dec2(z))     # [bs, 400]
    z = T.sigmoid(self.dec3(z))  # [bs, 785]
    return z    

  def forward(self, x):            # x is [bs,785]
    z = self.encode(x)
    oupt = self.decode(z)
    return oupt

The details are quite complicated but briefly, the AutoencoderTransformer accepts as input 785 values (784 pixels and the associated class label), and encodes the input as a vector of four values. This is called the latent representation.

Each of the 1000 items in the reference P dataset are fed to the AutoencoderTransformer which results in a frequency distribution. Each of the 1000 items in the “other” Q dataset are fed to the AutoencoderTransformer which results in a second frequency distribution. The two distributions are compared using the Kullback-Leibler divergence which is a measure of how similar the P and Q datasets are.

Good fun (at least for me)



Three covers of “The Master Mind of Mars” (1927), the sixth book in the Mars series by author Edgar R. Burroughs. Left: The 1927 version by artist Frank R. Paul. Center: The 1963 version by artist Roy Krenkel. Right: The 1969 version by artist Robert Abbett. Without using any fancy similarity metrics, I’d say the left and center images are closest.


This entry was posted in PyTorch, Transformers. Bookmark the permalink.