Why Standard PyTorch Datasets Don't Work for LSTM Networks - James D. McCaffreyJames D. McCaffrey

I have worked with LSTM (long short-term memory) networks for a long time. LSTMs are very complex. One common pitfall in PyTorch is the fact that to serve up training data to an LSTM system, for a default LSTM module with “sequence-first” geometry, you can use the standard PyTorch Dataset plus DataLoader technique but only for a batch size of 1. For any other batch size you must modify the default DataLoader output in some way, or modify the LSTM module.

The standard LSTM example is the IMDB movie review sentiment problem. The goal is to predict if a movie review is positive (“It was a great move”) or negative (“I didn’t like this film”).

Briefly, if you use the default Dataset plus DataLoader technique to serve up data, the technique works for a batch size of 1 because there’s no geometry to the batch. But if you try to use a batch size of 2 or more, the DataLoader serves up batches of training data that is stacked vertically (one on top of another) but an LSTM with default configuration (“sequence-first”) requires data that is stacked horizontally (left-to-right).

Sadly, there’s no easy way to modify the behavior of a DataLoader to serve up batches with a non-default geometry. Therefore, there are three main approaches. First, you can take a batch of data from a DataLoader, and reshape to a horizontal geometry. Second, you can implement a custom data loader from scratch that serves up batches with the correct horizontal geometry. Third, you can specify “batch_first” in the LSTM. All three approaches are a bit tricky, but the first approach — reshaping standard DataLoader batches — is easiest in my opinion.

Suppose you have input sentences with just 5 words, where the words have been encoded as integer token IDs, with padding prepended as 0 and the class 0-1 label as the last value. The source data file might look like:

0   0   4  283  19  0
0  29  16   98  53  1
13  9   8  104  38  0
. . .

If you create a Dataset and DataLoader with batch_size=2, a batch of input data would look like:

[[0,  0,  4, 283, 19],
 [0, 29, 16,  98, 53]]

But for a PyTorch LSTM module with default geometry, the required input has to look like:

[[0,    0],
 [0,   29],
 [4,   16],
 [283, 98],
 [19,  53]]

To reshape the default DataLoader batch, you can transpose the batch. Code would look like:

train_ldr = torch.utils.data.DataLoader(train_ds,
  batch_size=2, shuffle=True, drop_last=True)

for (bix, batch) in enumerate(train_ldr):
  x = batch[0]  # input tokens
  X = torch.transpose(x, 0, 1)
  Y = batch[1]  # target labels
  . . .

This transpose approach is simpler than implementing a custom data loader or setting batch_first=True, but has the possible disadvantage of being slower than a custom loader (depending on exactly how you implement the loader). With the transpose approach, I usually set drop_last=True in the DataLoader so that all batches are the same size.

I’ll show the batch_first approach in another blog sometime.

Fascinating stuff.

Creating models for natural language processing (NLP) problems is very difficult. One of many NLP issues is the ambiguity of the English language — a word can have multiple meanings. Here are three images from an Internet search for “models with models”. I did not search for “models with models designing ML models” because I was pretty sure I wouldn’t find any.