Getting MNIST Data as Text Files from TorchVision

The MNIST image dataset is arguably the most commonly used data in neural-based machine learning. MNIST consists of 60,000 images for training and 10,000 images for testing. Each image is a crude 28×28 pixels (784 pixels total) handwritten digit from ‘0’ to ‘9’. Each pixel value is a grayscale (i.e., one channel) value between 0 (white) and 255 (black).

The source MNIST data is stored in GNU-zipped .tar (tape archive) files that use a custom binary format. Working with MNIST data is a real pain in the patooty.

All of the major ML code libraries, including PyTorch, Keras, and scikit, have some form of built-in MNIST dataset. But using these built-in versions of MNIST data has two serious problems. First, the data becomes a magic black box that can’t be edited easily. Second, you get all 60,000 training and 10,000 test images and the huge size is very difficult to work with.

There are a few ways to get MNIST data as ordinary text files. One approach is to download the tar.gz source binary files, extract, and write a program that understands the binary format of the files, and converts the files to text. See https://jamesmccaffreyblog.com/2022/01/21/working-with-mnist-data/.

A second approach for getting MNIST data as simple text files is to use the built-in TorchVision library to download the MNIST files and get them as a proprietary dataset object, and then iterate through the dataset, converting each image to text, and writing one image per line to a text file. Like this:

# get_mnist.py
# get MNIST data from torchvision, save as text

# Python 3.7.6 (Anaconda3-2020.02)
# PyTorch 1.10.0-CPU
# TorchVision 0.11.3

# get TorchVision from:
# https://pypi.org/project/torchvision/#files  OR
# https://download.pytorch.org/whl/torch_stable.html

import numpy as np
import torch as T
import torchvision as tv

def main():
  print("\nCreating 100-item MNIST training text file "

  train_ds = tv.datasets.MNIST(root=".\\Data", train=True,
    download=True, transform=tv.transforms.ToTensor())
  train_ldr = T.utils.data.DataLoader(train_ds,
    batch_size=1, shuffle=False)

  fn = ".\\Data\\mnist_train_100.txt"  # where to save
  f = open(fn, "w", encoding="utf8")

  num_written = 0;                     # num images saved
  for pixels, label in train_ldr:      # each image
    pixels = pixels.flatten().numpy()  # pixels is 28x28
    pixels *= 255                      # un-normalize
    pixels = pixels.astype(np.int64)   # float to int

    for i in range(784):               # each pixel
      f.write(str(pixels[i]) + "\t")   # tab separated
    f.write(str(label.item()))         # label at end
    f.write("\n")                      # line-by-line
    
    num_written += 1
    if num_written == 100: break       # set 60,000 for all
  f.close()

  print("\nDone ")

if __name__ == "__main__":
  main()

The tv.datasets.MNIST() constructor will download compressed MNIST training data to a “Data” directory, or wherever you specify. The “Data” directory will have an “MNIST” directory, which in turn has a “raw” directory that has all the actual data files (in compressed form). To get test data, set train=False.

The Transform is needed to convert the data from PIL format to PyTorch tensors. Somewhat unfortunately, the Transform also divides each pixel by 255 to normalize — without asking you if that’s what you want or telling you it’s happening.

The program fetches one image at a time, converts 784 pixel values to a numpy array, undoes the dividing by 255, and converts to integers. Each line of the destination file is one image. There are 785 tab-delimited values on each line. The first 784 are the pixel values, and the last is the ‘0’ to ‘9’ label.

You might want to use a comma delimiter instead of a tab, and you might want to put the label as the first value on each line instead of the last value.

The program has an early exit to stop after 100 images have been written to file. To get all 60,000 training images you can set the break value to 60,000.



The digits (base 12) of the fictitious Tengwar alphabet used in the Lord of the Rings book series written by J.R.R. Tolkien (1892-1973).


This entry was posted in Machine Learning, PyTorch. Bookmark the permalink.