Displaying the UCI Digits Data

The UCI Digits dataset is like a scaled down MNIST digits dataset. MNIST has 70,000 images (60,000 for training and 10,000 for test). Each MNIST image is 28×28 pixels (784 total pixels) and is a handwritten digit from ‘0’ to ‘9’. Each pixel is a grayscale value between 0 and 255. The MNIST digits dataset is great but because the dataset is relatively large, sometimes it’s awkward to work with when experimenting with neural techniques.

The UCI Digits dataset has 5,620 images (3,823 for training and 1,797 for test). Each UCI digit image is 8×8 pixels (64 total pixels) and is a handwritten digit from ‘0’ to ‘9’. Each pixel is a grayscale value between 0 and 16. So, the UCI digits dataset is easier to work with.

You can download the UCI Digits Data from archive.ics.uci.edu/ml/datasets/optical+recognition+of+handwritten+digits. The 3823-item training file is named optdigits.tra and the 1797-item test file is named optdigits.tes. The files are text files so you can rename them with “.txt” extensions. Each line has 65 comma-delimited values. The first 64 values are the pixels (0 to 16) and the last value on each line is the digit (0 to 9).

Whenever I work with a dataset of images, I always like to write a utility program to display some of the images so I know what I’m working with.

I wrote a short program to display specified rows of the UCI Digits dataset. My program displays each digit in two ways: first by hex values in the command shell, and second by using matplotlib to show the image visually.

The moral of the story isn’t very profound. When you’re working on a machine learning problem, it’s important to understand the problem and that always means understanding the data.

Three stock photos of “digits and numerology”. While I was in college I took one class in number theory, and it wasn’t one of my favorites. But the class I took in discrete mathematics was fantastic. You can argue that the inventions of writing and mathematics are the two greatest achievements in human history. Well, that and pizza.

# show_uci_digits.py

import numpy as np
import matplotlib.pyplot as plt

# data file looks like:
# 0,0,5,16 . . 12,0,0,7
# first 64 values are grayscale pixel (0-16),
# last value on line is digit (0-9)

def load_data(data_file):
  x_data = np.loadtxt(data_file, delimiter = ",",
    usecols=range(0,64), dtype=np.int)
  y_data = np.loadtxt(data_file, delimiter = ",",
    usecols=[64], dtype=np.int)
  return (x_data, y_data)

def display(data_file, idxs):
  (x_data, y_data) = load_data(data_file)

  for idx in idxs:
    label = y_data[idx]  # like '5'
    print("digit = ", str(label), "\n")

    pixels = np.array(x_data[idx])  # row of pixels
    pixels = pixels.reshape((8,8))
    for i in range(8):
      for j in range(8):
        print("%.2X" % pixels[i,j], end="")
        print(" ", end="")
      print("")

    plt.imshow(pixels, cmap=plt.get_cmap('gray_r'))
    plt.savefig("digit_" + str(label) + \
      ".png", bbox_inches='tight')
    plt.show() 
    plt.close() 

def main():
  print("\nBegin show UCI mini-digit \n")

  data_file = ".\\digits_uci_test_1797.txt"
  display(data_file, idxs=[0,1,2,3,4,5,6,7,8,9]) # 1st ten

  print("\nEnd \n")

if __name__ == "__main__":
  main()