PyTorch DataLoader and Dataset

When working with any of the neural network code libraries — TensorFlow, Keras, CNTK, PyTorch — you must write code to serve up batches of training items. This is a surprisingly annoying and time-consuming task.

When you can load all training and test data into memory as a NumPy array-of-arrays style matrix, then you can write a custom Batcher class that will serve up indices into the data. This isn’t too difficult.

But when your data is too big to fit into memory, you have to write buffering code to read chunks at a time from file. This is not easy at all.

The PyTorch library has a mechanism to help out. I don’t fully understand it yet, but I coded up a demo to explore. Briefly, you code a custom MyDataset class that corresponds to your data. The custom class inherits from the built-in Dataset class. You must implement three methods: __init__(self), __getitem__(self, index), and __len__(self). The __getitem__(self, index) method must be coded to return a Pytorch Tensor that is the data item at index.

After implementing the custom Dataset class, you instantiate objects and pass them to the built-in DataLoader class. A DataLoader object can serve up batches.

Whew! That’s a bit complicated.

I took a stab at this mechanism. I cheated by loading all data into memory (which of course defeats the purpose of the technique) but I wanted as simple an example as possible. I used the Iris Dataset data.

The custom Dataset class is:

import numpy as np
import torch as T
import torch.utils.data as td

class IrisDataset(td.Dataset):
  def __init__(self, file_path):
    self.data_x = np.loadtxt(file_path,
      usecols=range(0,4), delimiter=",", dtype=np.float32)
    self.data_y = np.loadtxt(file_path,
      usecols=[4], delimiter=",", dtype=np.float64)

  def __getitem__(self, index):
    features = T.Tensor(self.data_x[index])
    labels = T.LongTensor(np.array(self.data_y[index]))
    return (features, labels)

  def __len__(self):
    return len(self.data_x)

The code to create the training and test DataLoader objects is:

  train_ds = IrisDataset(train_file)
  train_ldr = td.DataLoader(train_ds, batch_size=16,
    shuffle=True, num_workers=1)

  test_ds = IrisDataset(test_file)
  test_ldr = td.DataLoader(test_ds, batch_size=30,
    shuffle=False, num_workers=1)

Some code that uses a DataLoader is:

  print("Starting training")
  for epoch in range(0, max_epochs):
    for (bat, data) in enumerate(train_ldr, 0):
      (X, Y) = data
      optimizer.zero_grad()
      oupt = net(X)
      loss_obj = loss_func(oupt, Y)
. . .

There are a lot of tricky details, but my demo code works. I noticed that the DataLoader approach was much, much, much slower than the standard approach, so I may be doing something wrong here.

Coding up my demo has given me a solid understanding of the key principles involved with a DataLoader so now I can investigate a more realistic example.

A simple design is usually better than a complex design.