How to Create a PyTorch Dataset for the IMDB Movie Review Data

I’ve been working with LSTM (long short-term memory) and TA (transformer architecture) prediction systems for natural language processing (NLP) problems. My standard data for NLP experimentation is the IMDB movie review dataset.

A major challenge is getting the raw IMDB data (50,000 movie reviews) and saving a training file and a test file. The result imdb_train_20w.txt file looks like this:

0 0 0 0 0 12 38 135 9 4 118 38 7 126 45 58 49 34 7 12 1
0 0 0 13 9 4 627 20 30 7 34 32 13 6 21 50 26 59 16 85 1
. . .
0 0 6 68 7 41 12 24 87 8 25 8 757 22 5 674 87 13 20 9 0

The reviews were filtered to only those reviews that have 20 words or less. The 20 is a parameter and in a non-demo scenario would be set to a larger value like 80 or 100 words. Each line is a movie review. Reviews are padded with a special 0 character. The integer values are token IDs where small values are the most common words. For example, the most common word, “the” = 4 and the second most common word is “and” = 5, and so on. Token IDs of 1, 2, and 3 are reserved for other purposes. The last integer on each line is the class label to predict, 0 = negative review, 1 = positive review.

In addition to getting training and test data, another major challenge is serving up training data in batches. My preferred technique is to use the PyTorch Dataset plus DataLoader appraoch. Briefly, a Dataset loads data nito memory and has a special __getitem__() method that returns a single data item. A Dataset object is fed into a DataLoader object which can serve up the data in batches. You must implement a Dataset that’s specific to your data, but a DataLoader can be used as-is.

Here’s a Dataset definition for 20-item IMDB data:

import numpy as np
import torch as T

class IMDB_Dataset(T.utils.data.Dataset):
  # each line: 20 token IDs, 0 or 1 label. space delimited
  def __init__(self, src_file):
    all_xy = np.loadtxt(src_file, usecols=range(0,21),
      delimiter=" ", comments="#", dtype=np.int64)
    tmp_x = all_xy[:,0:20]   # cols [0,20) = [0,19]
    tmp_y = all_xy[:,20]     # all rows, just col 20
    self.x_data = T.tensor(tmp_x, dtype=T.int64) 
    self.y_data = T.tensor(tmp_y, dtype=T.int64)  # CE loss

  def __len__(self):
    return len(self.x_data)

  def __getitem__(self, idx):
    token_ids = self.x_data[idx]
    trgts = self.y_data[idx] 
    return (token_ids, trgts)

The __init__() method loads all the data from a training or test file into X (inputs) and Y (class labels) buffers in memory as PyTorch tensors. I use the numpy loadtxt() function to reads numeric data. There are many alternatives, including using a pandas library DataFrame.

The __getitem__() method accepts an index and returns a single item’s input values, and the associated class label, as a tuple. Alternatives include return values as a Dictionary or in a List or in an Array.

The Datset can be used like this:

  train_file = ".\\Data\\imdb_train_20w.txt"
  train_ds = IMDB_Dataset(train_file) 
  train_ldr = T.utils.data.DataLoader(train_ds,
    batch_size=3, shuffle=True, drop_last=True)
 
  # serve up all data in batches of 3 items
  for (bix, batch) in enumerate(train_ldr):
    (X, Y) = batch
    # use X and Y to update model weights
    . . .

The drop_last=True argument means that if the total number of data items is not evenly divisible by the batch size, the last batch will be smaller than all the other batches, and will not be used.

The shuffle=True argument is important during training so that the weight updates don’t go into an oscillation which could stall training.

The DataLoader class has 9 other parameters but I almost never use those.

For some PyTorch models, such as a deep neural network, the batches can be used directly. For LSTM and TA systems, the X input values in each batch must be transposed.

A DataLoader serves up machine learning training data. A restaurant waitress serves up food. When I was an undergraduate student at the University of California at Irvine, I worked at Disneyland in Anaheim. Sometimes I worked on the Jungle Cruise ride. The ride passed by the Tahitian Terrace restaurant. The Jungle Cruise operators would sometimes hang out with the Tahititan Terrace waitresses at parties like the annual employee Banana Ball.