Preparing Raw MNIST Data for use by a CNTK Program

The MNIST (“modified National Institute of Standards and Technology”) image dataset is often used to demonstrate image classification. The dataset has 60,000 images for training a model, and 10,000 images for evaluating a trained model.

Each image is 28 pixels wide by 28 pixels high which is 784 pixels. Each image represents a single handwritten digit of a ‘0’ through ‘9’. Somewhat weirdly, the 60,000 training data items are stored in two files: one file contains the pixel values and the second file contains the associated label (‘0’ to ‘9’) values. The test data is also stored in two files.

I work with MNIST so often, I wrote a viewer program.

Additionally, each of the four files is stored in a proprietary binary format, using big endian format rather than little endian which is used by Intel based machines. And to top it off, the four files are compressed as .gz files which can’t be unzipped by default by a Windows based machine. In short, getting MNIST data into a useable form is not trivial.

Step 1: Go to the MNIST storage site at http://yann.lecun.com/exdb/mnist/ and download to your machine into a directory named ZippedBinary these four files:

train-images-idx3-ubyte.gz (60,000 train images) 
train-labels-idx1-ubyte.gz (60,000 train labels) 
t10k-images-idx3-ubyte.gz  (10,000 test images) 
t10k-labels-idx1-ubyte.gz  (10,000 test labels)

Step 2: Unzip the four files into a directory named UnzippedBinary. To unzip the files you’ll need a utility program. I strongly recommend the free 7-Zip at https://www.7-zip.org/. After unzipping, I recommend adding a “.bin” file extension to each name to remind you the files are in a proprietary binary format. So you should now have:

train-images-idx3-ubyte.bin
train-labels-idx1-ubyte.bin
t10k-images-idx3-ubyte.bin 
t10k-labels-idx1-ubyte.bin

Step 3: Suppose the desired format of a CNTK-friendly file containing the images is:

|digit 0 0 0 0 0 1 0 0 0 0 |pixels 0 .. 170 52 .. 0
|digit 0 0 1 0 0 0 0 0 0 0 |pixels 0 .. 254 66 .. 0
. . .

Each line is one image. The |digit information is one-hot encoded where the position of the 1 bit indicates the digit, so the two images above are ‘5’ and ‘2’. The |pixels information is 784 values, where each is between 0 and 255. Here’s a program to create training and test files with a specified number of images:

# converter_cntk.py

def generate(img_file, label_file, txt_file, n_images):
  lbl_f = open(label_file, "rb")   # labels file
  img_f = open(img_file, "rb")     # pixels file
  txt_f = open(txt_file, "w")      # file to write to

  img_f.read(16)   # discard header info
  lbl_f.read(8)    # discard header info

  for i in range(n_images):   # number images requested
    lbl = ord(lbl_f.read(1))  # get label (unicode, one byte)
    vector = [0] * 10         # [0,0,0,0,0,0,0,0,0,0]
    vector[lbl] = 1           # [0,0,0,0,0,0,0,1,0,0]
    txt_f.write("|digit ")
    txt_f.write(" ".join(str(x) for x in vector))
    txt_f.write(" |pixels ")
    for j in range(784):  # get 784 pixel vals 
      val = ord(img_f.read(1))
      txt_f.write(str(val) + " ")  # trailing space OK
    txt_f.write("\n")  # next image

  img_f.close(); txt_f.close(); lbl_f.close()

def main():
  generate(".\\UnzippedBinary\\train-images.idx3-ubyte.bin",
          ".\\UnzippedBinary\\train-labels.idx1-ubyte.bin",
          ".\\mnist_train_1000_cntk.txt", 1000)  # 1-60,000

  generate(".\\UnzippedBinary\\t10k-images.idx3-ubyte.bin",
          ".\\UnzippedBinary\\t10k-labels.idx1-ubyte.bin",
          ".\\mnist_test_100_cntk.txt", 100)   # 1-10,000  

if __name__ == "__main__":
  main()

Executing this program would generate a file named mnist_train_1000_cntk.txt with 1,000 images in CNTK format, and file mnist_test_100_cntk.txt with 100 images.