(Note: this blog post is closely related to an earlier post, “Preparing MNIST Data for use by a CNTK Program”)
The MNIST (“modified National Institute of Standards and Technology”) image dataset is often used to demonstrate image classification. The dataset has 60,000 images for training a model, and 10,000 images for evaluating a trained model.
Each image is 28 pixels wide by 28 pixels high which is 784 pixels. Each image represents a single handwritten digit of a ‘0’ through ‘9’. Somewhat weirdly, the 60,000 training data items are stored in two files: one file contains the pixel values and the second file contains the associated label (‘0’ to ‘9’) values. The test data is also stored in two files.

I wrote a utility program to read the resulting data file and display a specified image
Additionally, each of the four files is stored in a proprietary binary format, using big endian format rather than little endian which is used by Intel based machines. And to top it off, the four files are compressed as .gz files which can’t be unzipped by default by a Windows based machine. In short, getting MNIST data into a useable form is not trivial.
Step 1: Go to the MNIST storage site at http://yann.lecun.com/exdb/mnist/ and download to your machine into a directory named ZippedBinary these four files:
train-images-idx3-ubyte.gz (60,000 train images) train-labels-idx1-ubyte.gz (60,000 train labels) t10k-images-idx3-ubyte.gz (10,000 test images) t10k-labels-idx1-ubyte.gz (10,000 test labels)
Step 2: Unzip the four files into a directory named UnzippedBinary. To unzip the files you’ll need a utility program. I strongly recommend the free 7-Zip at https://www.7-zip.org/. After unzipping, I recommend adding a “.bin” file extension to each name to remind you the files are in a proprietary binary format. So you should now have:
train-images-idx3-ubyte.bin train-labels-idx1-ubyte.bin t10k-images-idx3-ubyte.bin t10k-labels-idx1-ubyte.bin
Step 3: Suppose the desired format of a data file containing the images is:
0 0 0 0 0 1 0 0 0 0 * 0 .. 170 52 .. 0 0 0 1 0 0 0 0 0 0 0 * 0 .. 254 66 .. 0 . . .
Each line is one image. The first 10 value are the digit/label information in one-hot encoded form, where the position of the 1 bit indicates the digit. So the two images above are ‘5’ and ‘2’. There is a dummy ‘*’ character in column [10] which is just for readability. The next 784 values are the pixels for the image, where each is between 0 and 255. Here’s a program to create training or test files in this format, with a specified number of images:
# converter_keras.py
def generate(img_bin_file, lbl_bin_file,
result_file, n_images):
img_bf = open(img_bin_file, "rb") # pixels
lbl_bf = open(lbl_bin_file, "rb") # labels
res_tf = open(result_file, "w") # result file
img_bf.read(16) # discard image header info
lbl_bf.read(8) # discard label header info
for i in range(n_images): # number images requested
# digit label first
lbl = ord(lbl_bf.read(1)) # get label like '3'
encoded = [0] * 10 # make one-hot vector
encoded[lbl] = 1
for i in range(10):
res_tf.write(str(encoded[i]))
res_tf.write(" ") # like 0 0 0 1 0 0 0 0 0 0
res_tf.write("* ") # arbitrary separator char
# now do the image pixels
for j in range(784): # get 784 vals for each image
val = ord(img_bf.read(1))
res_tf.write(str(val))
if j != 783: res_tf.write(" ") # avoid trail space
res_tf.write("\n") # next image
img_bf.close(); lbl_bf.close(); # close the binary files
res_tf.close() # close the result file
# ==========================================================
def main():
generate(".\\UnzippedBinary\\train-images.idx3-ubyte.bin",
".\\UnzippedBinary\\train-labels.idx1-ubyte.bin",
".\\mnist_train_keras_3.txt",
n_images = 3) # first n images
if __name__ == "__main__":
main()
Executing this program would generate a file named mnist_train_keras_3.txt with 3 images in the format described above. You could change the three file names and rerun to make a file of test data.
In most situations, you could now read the labels and pixels into two different matrices, because that’s what Keras will need:
y_data = np.loadtxt(the_file, delimiter = " ", usecols=range(0,10), dtype=np.float32) x_data = np.loadtxt(the_file, delimiter = " ", usecols=range(11,795), dtype=np.float32)
When doing machine learning, getting your data ready is almost always the most time-consuming, annoying, and difficult part of the project.

.NET Test Automation Recipes
Software Testing
SciPy Programming Succinctly
Keras Succinctly
R Programming
2026 Visual Studio Live
2025 Summer MLADS Conference
2026 DevIntersection Conference
2025 Machine Learning Week
2025 Ai4 Conference
2026 G2E Conference
2026 iSC West Conference
You must be logged in to post a comment.