A Preliminary Look at the New torchtext Library for PyTorch

Update: The new TorchText library was released in version 0.9 in March 2021, so this post is historical information.

The PyTorch neural network code library has several closely related libraries, such as torchvision for image processing and torchtext for natural language processing. The existing torchtext library has common datasets such as the IMDB dataset for sentiment analysis. The torchtext library also has many functions that work with the datasets, such as functions to load a dataset, parse a dataset, and build a vocabulary of words from a dataset.

Unfortunately, the torchtext library has two big problems. First, the current torchtext Dataset objects are not compatible with standard PyTorch torch.utils.data Dataset objects. Second, the current torchtext API is really weird and ugly.

When you use a current torchtext dataset, you get all kinds of warnings that everything in the library is being deprecated. I came across a nice blog post that describes the new API for the new torchtext library, which is under development: github.com/pytorch/text/issues/664

In order to experiment with the new torchtext library, I had to install the experimental daily builds of PyTorch and torchtext. I went to download.pytorch.org/whl/nightly/cpu/torch_nightly.html and found a whl file for the January 30, 2021 build of PyTorch, and the build for the same day for torchtext. I downloaded the torch and torchtext whl files to my local machine. I uninstalled my current torch and torchtext modules, and then installed the experimental nightly builds of PyTorch and torchtext without problem — I was very lucky because nightly builds are often wildly unstable.

I coded up a little demo. The demo loads the IMDB dataset, and splits it into training and test sets. Then the demo creates a vocabulary from the training data. And then the demo extracts a short movie review I found at index position [93] in the training data.

Here are some of the key lines of code in the demo:

import torchtext as tt
toker = tt.data.utils.get_tokenizer("basic_english")
train_ds, test_ds = \
  tt.experimental.datasets.IMDB(tokenizer=toker)
tmp_vocab = train_ds.get_vocab()
vocab = tt.vocab.Vocab(counter=tmp_vocab.freqs, \
  max_size=14_000, min_freq=10)
for idx, (label, txt) in enumerate(train_ds):
  # idx is 0, 1, ..
  # label is 0 or 1 (the sentiment)
  # txt is like [29, 70, 10, . . .] (the review)

The new torchtext Dataset object has the same structure as the torch.utils.data.Dataset and so it can be used with a DataLoader. At the time I wrote this post, the new API doesn’t have a nice way to serve up batches of data that have similar review lengths (the current API has a BucketIterator class that does that). You can easily write code to adapt the new API to serve up batches with similar review lengths, but maybe the new API will have a built-in way to do this when the API is released.

The new torchtext library API is a big improvement over the current API. I’m looking forward to using the new library and its API when the library reaches stability and is released.

In IMDB movie reviews, each movie gets a text review and a rating from 1 to 10 stars. In the IMDB machine learning dataset, movies that were rated from 1 to 4 stars are classified as class 0 (bad), and movies that were rated from 7 to 10 stars are classified as class 1 (good). Movies rated 5 or 6 stars are available, but not used in the main dataset. Here are three movies that have posters that are much better (in my opinion anyway) than the movies themselves. Left: The “The Brain Eaters” (1958) movie has a rating of 4.8 stars (class 0) but I rate the poster a 9 out of 10. Center: The “Beginning of the End” (1957) movie has a rating of 3.8 stars (class 0) but I rate the poster an 8 out of 10. Right: The “Target Earth” (1954) movie has a rating of 4.6 stars (class 0) but I rate the poster 8.5 out of 10.

# new_torchtext_imdb.py
# the old way to access datasets is being revamped
# https://github.com/pytorch/text/issues/664

# replace "greater-than" with operator symbol

import torchtext as tt
import time

def print_w_time(str):
  print(str + "  ", end="")
  dt = time.strftime("%Y_%m_%d-%H_%M_%S")
  print(dt)

def main():
  print("\nBegin demo of new torchtext interface for IMDB ")

  print_w_time("\nFetching IMDB using basic_english tokenizer ")
  toker = tt.data.utils.get_tokenizer("basic_english")
  train_ds, test_ds = \
    tt.experimental.datasets.IMDB(tokenizer=toker)
  print_w_time("Data has been fetched ")

  print_w_time("\nCreating vocabulary, min_freq=10 ")
  tmp_vocab = train_ds.get_vocab()
  vocab = tt.vocab.Vocab(counter=tmp_vocab.freqs,
    max_size=14_000, min_freq=10)
  print_w_time("Vocabulary created ")

  print("\nExamining short train item [93] ")
  for idx, (label, txt) in enumerate(train_ds):
    if idx == 93:
      print(str(idx) + "  " + str(label) + "  " + str(txt))
      for i in range(len(txt)):
        if i % 16 == 0: print("")
        n = txt[i].item()
        if n "greater-than" 13_999:
          s = "[unk]"
        else: 
          s = vocab.itos[n]
        print(s + " ", end="")
      print("")
      break
 
  print("\nEnd demo \n")

if __name__ == "__main__":
  main()