Preparing Raw MNIST Data for use by a Keras Program

(Note: this blog post is closely related to an earlier post, “Preparing MNIST Data for use by a CNTK Program”)

The MNIST (“modified National Institute of Standards and Technology”) image dataset is often used to demonstrate image classification. The dataset has 60,000 images for training a model, and 10,000 images for evaluating a trained model.

Each image is 28 pixels wide by 28 pixels high which is 784 pixels. Each image represents a single handwritten digit of a ‘0’ through ‘9’. Somewhat weirdly, the 60,000 training data items are stored in two files: one file contains the pixel values and the second file contains the associated label (‘0’ to ‘9’) values. The test data is also stored in two files.

I wrote a utility program to read the resulting data file and display a specified image

Additionally, each of the four files is stored in a proprietary binary format, using big endian format rather than little endian which is used by Intel based machines. And to top it off, the four files are compressed as .gz files which can’t be unzipped by default by a Windows based machine. In short, getting MNIST data into a useable form is not trivial.

Step 1: Go to the MNIST storage site at http://yann.lecun.com/exdb/mnist/ and download to your machine into a directory named ZippedBinary these four files:

train-images-idx3-ubyte.gz (60,000 train images) 
train-labels-idx1-ubyte.gz (60,000 train labels) 
t10k-images-idx3-ubyte.gz  (10,000 test images) 
t10k-labels-idx1-ubyte.gz  (10,000 test labels)

Step 2: Unzip the four files into a directory named UnzippedBinary. To unzip the files you’ll need a utility program. I strongly recommend the free 7-Zip at https://www.7-zip.org/. After unzipping, I recommend adding a “.bin” file extension to each name to remind you the files are in a proprietary binary format. So you should now have:

train-images-idx3-ubyte.bin
train-labels-idx1-ubyte.bin
t10k-images-idx3-ubyte.bin 
t10k-labels-idx1-ubyte.bin

Step 3: Suppose the desired format of a data file containing the images is:

0 0 0 0 0 1 0 0 0 0 * 0 .. 170 52 .. 0
0 0 1 0 0 0 0 0 0 0 * 0 .. 254 66 .. 0
. . .

Each line is one image. The first 10 value are the digit/label information in one-hot encoded form, where the position of the 1 bit indicates the digit. So the two images above are ‘5’ and ‘2’. There is a dummy ‘*’ character in column [10] which is just for readability. The next 784 values are the pixels for the image, where each is between 0 and 255. Here’s a program to create training or test files in this format, with a specified number of images:

# converter_keras.py

def generate(img_bin_file, lbl_bin_file,
            result_file, n_images):

  img_bf = open(img_bin_file, "rb")    # pixels
  lbl_bf = open(lbl_bin_file, "rb")    # labels
  res_tf = open(result_file, "w")      # result file

  img_bf.read(16)   # discard image header info
  lbl_bf.read(8)    # discard label header info

  for i in range(n_images):   # number images requested 
    # digit label first
    lbl = ord(lbl_bf.read(1))  # get label like '3'
    encoded = [0] * 10         # make one-hot vector
    encoded[lbl] = 1
    for i in range(10):
      res_tf.write(str(encoded[i]))
      res_tf.write(" ")  # like 0 0 0 1 0 0 0 0 0 0 

    res_tf.write("* ")  # arbitrary separator char

    # now do the image pixels
    for j in range(784):  # get 784 vals for each image
      val = ord(img_bf.read(1))
      res_tf.write(str(val))
      if j != 783: res_tf.write(" ")  # avoid trail space 
    res_tf.write("\n")  # next image

  img_bf.close(); lbl_bf.close();  # close the binary files
  res_tf.close()                   # close the result file

# ==========================================================

def main():
  generate(".\\UnzippedBinary\\train-images.idx3-ubyte.bin",
          ".\\UnzippedBinary\\train-labels.idx1-ubyte.bin",
          ".\\mnist_train_keras_3.txt",
          n_images = 3)  # first n images

if __name__ == "__main__":
  main()

Executing this program would generate a file named mnist_train_keras_3.txt with 3 images in the format described above. You could change the three file names and rerun to make a file of test data.

In most situations, you could now read the labels and pixels into two different matrices, because that’s what Keras will need:

y_data = np.loadtxt(the_file, delimiter = " ",
  usecols=range(0,10), dtype=np.float32)
x_data = np.loadtxt(the_file, delimiter = " ",
  usecols=range(11,795), dtype=np.float32)

When doing machine learning, getting your data ready is almost always the most time-consuming, annoying, and difficult part of the project.