Creating a Streaming Data Loader for the MNIST Dataset

The MNIST image dataset is one of the most commonly used datasets in machine learning. But MNIST is rather difficult to work with because a.) there are 60,000 training images and so the data won’t fit into memory on a standard desktop or laptop machine, and b.) the training and test files store the 784-pixels per image and the image labels (‘0’ to ‘9’) in separate files, and c.) the source files are in a proprietary binary format.

Because MNIST is so awkward to work with, most ML libraries, such as Keras and PyTorch, have library functions to read and serve up MNIST data. However, these built-in MNIST dataset libraries are very difficult to customize.

My usual preferred approach for working with MNIST is to a.) convert the four binary source files into two files with pixels and labels together, as ordinary tab-delimited test files, and b.) write a custom streaming data loader to read the data in chunks and serve up batches for training. This gives me complete control over MNIST. I use PyTorch but the ideas can be easily adapted for Keras or TensorFlow.

Screenshots of a demo run of streaming a 10,000-item MNIST file, in batches of 5 items at a time, three complete passes (epochs). Behind the scenes, the data loader buffers 1,000 lines at a time. Left: The first two batches. Center: End of the first epoch, start of the second epoch. Right: The last few batches of the third epoch.

I intend to write up an article for Visual Studio Magazine here I’ll present code for both tasks, and explain the code carefully so anyone who reads the article will know exactly how to customize the code.

The code fragment below shows an example of how my custom MNIST streaming data loader is called. I used my MNIST converter utility program to create a 10,000-item subset of the 60,000-item MNIST training data. I set up a streaming data loader to fetch data 1,000 lines/images at a time into an internal buffer, and serve up batches of 5 images at a time. The data loader makes three complete passes through the 10,000 items.

  def main():
  print("\nBegin MNIST streaming data loader demo \n")
  np.random.seed(1)

  fn = ".\\Data\\mnist_train_10000.txt"
  bat_size = 5      # 2000 batches of 5 items per epoch
  buff_size = 1000  # (reload buffer 10 times per epoch)
  mnist_ldr = MNIST_StreamLoader(fn, bat_size, buff_size, \
    shuffle=False) 

  max_epochs = 3
  for epoch in range(max_epochs):
    print(" \n== Epoch: " + str(epoch) + " == ")
    for (b_idx, batch) in enumerate(mnist_ldr):
      print("epoch: " + str(epoch) + "   batch idx: " + \
        str(b_idx)) 
      pixels = batch[0]
      digits = batch[1]
      show_batch(pixels)
      print(digits)

  mnist_ldr.fin.close()
  print("\nEnd demo ")

if __name__ == "__main__":
  main()

The two tasks of a.) converting MNIST from binary to text, and b.) implementing a streaming data loader are like riding a bicycle — very difficult when you don’t know how to do them, but simple once you do.

Left: Treadmill bicycle. Center: Gyroscopic mono-wheel. Right: Suspension bicycle.