IMDB Sentiment Classification Using PyTorch 1.10 on Windows 11

One of my standard neural network examples is sentiment classification on the IMDB Movie Review dataset. The goal is to predict the sentiment (0 = negative, 1 = positive) of a natural language movie review such as, “The movie was a great waste of my time.” This is a very difficult natural language processing (NLP) problem.

My basic example uses an LSTM (long, short-term memory) architecture. I have an advanced version that uses Transformer Architecture, but that’s another topic.

Processing the data to get it ready for a neural system is a big challenge. I fetched the raw movie review data from https://ai.stanford.edu/~amaas/data/sentiment/. The movie review data is in gnu-zip, tape-archive format. I extracted the data using the 7-Zip utility program. This created a complex set of files. There are a total of 50,000 movie reviews — 25,000 for training and 25,000 for testing. Each set has 12,500 positive reviews and 12,500 negative reviews.

I wrote a Python language helper program that filtered the training and test reviews down to those with 50 words or less. The program converts each word to an integer ID where low numbers are common words, e.g. ID = 4 is “the”, ID = 5 is “and”, and so on. ID = 0 is used for padding so that all reviews have exactly 50 tokens. See https://jamesmccaffreyblog.com/2022/01/17/imdb-movie-review-sentiment-analysis-using-an-lstm-with-pytorch/.

The result is a file where each line is a movie review. The first 50 values are the encoded movie review words and the last value on the line is 0 (negative review) or 1 (positive review). I put the padding 0s at the beginning of each review that’s shorter than 50 words but I could have put them at the end.

I designed a simple LSTM network. I used an Embedding layer where each word/token ID is converted to a vector of 32 values. My simple LSTM layer uses an internal state size of 100 values. I used a batch-first geometry (dealing with the geometries of NLP systems is a big pain). I applied a Dropout layer to limit overfitting and a Linear layer with sigmoid() activation to condense the output to a single value between 0.0 and 1.0 and so an output of less than 0.5 means class 0 = negative review, otherwise class 1 = positive review.

For training, I used Adam optimization with an initial learning rate of 0.001 and a batch size of 16 reviews.

The demo achieved about 74% accuracy on the test data: pretty weak because I didn’t use enough reviews. I got much better results by using reviews with 80 words or less.

NLP problems are, in my opinion, among the most difficult in all of machine learning. But they’re very interesting.

In general, I don’t like movies that are intended for children and have children characters. But here are three such movies that I give positive reviews/sentiment for:

Left: “Matilda” (1996) tells the story of young Matilda Wormwood. She has horrible parents, a horrible brother, and goes to a school run by the horrible Agatha Trunchbull. But Matilda’s teacher, Miss Honey, helps Matilda discover her secret powers.

Center: “A Series of Unfortunate Events” (2004) tells the story of three orphans who are the targets of the evil Count Olaf. But everything works out well in the end.

Right: “Nanny McPhee” (2005) tells the story of widower Cedric Brown who has seven children. The children drive away one nanny after another until Nanny McPhee arrives. She has magical powers and brings order to the household and love to Cedric.

Demo code. Replace “lt”, “gt”, “lte”, “gte” with Boolean operator symbols. My lame blog editor chokes on symbols.

# imdb_lstm.py
# uses preprocessed data instead of built-in data
# batch_first geometry
# PyTorch 1.10.0-CPU Anaconda3-2020.02  Python 3.7.6
# Windows 10/11

import numpy as np
import torch as T
device = T.device('cpu')

# -----------------------------------------------------------

class LSTM_Net(T.nn.Module):
  def __init__(self):
    # vocab_size = 129892
    super(LSTM_Net, self).__init__()
    self.embed = T.nn.Embedding(129892, 32)
    self.lstm = T.nn.LSTM(32, 100, batch_first=True)
    self.do1 = T.nn.Dropout(0.20)
    self.fc1 = T.nn.Linear(100, 1)  # binary
 
  def forward(self, x):
    # x = review/sentence. length = fixed w/ padding (front)
    z = self.embed(x)  # expand each token to 32 values
    z = z.reshape(-1, 50, 32)  # bat seq embed
    lstm_oupt, (h_n, c_n) = self.lstm(z)
    z = lstm_oupt[:,-1]  # shape [bs,100]  # [-1] is seq first
    z = self.do1(z)
    z = T.sigmoid(self.fc1(z))  # BCELoss()
    return z 

# -----------------------------------------------------------

class IMDB_Dataset(T.utils.data.Dataset):
  # 50 token IDs then 0 or 1 label, space delimited
  def __init__(self, src_file):
    all_xy = np.loadtxt(src_file, usecols=range(0,51),
      delimiter=" ", comments="#", dtype=np.int64)
    tmp_x = all_xy[:,0:50]   # cols [0,50) = [0,49]
    tmp_y = all_xy[:,50]     # all rows, just col 50
    self.x_data = T.tensor(tmp_x, dtype=T.int64).to(device) 
    self.y_data = T.tensor(tmp_y, dtype=T.float32).to(device)
    self.y_data = self.y_data.reshape(-1, 1)  # float32 2D 

  def __len__(self):
    return len(self.x_data)

  def __getitem__(self, idx):
    tokens = self.x_data[idx]
    trgts = self.y_data[idx] 
    return (tokens, trgts)

# -----------------------------------------------------------

def accuracy(model, dataset):
  # data_x and data_y are lists of tensors
  # assumes model.eval()
  num_correct = 0; num_wrong = 0
  ldr = T.utils.data.DataLoader(dataset,
    batch_size=1, shuffle=False)
  for (batch_idx, batch) in enumerate(ldr):
    X = batch[0]  # inputs
    Y = batch[1]  # target sentiment label 0 or 1

    with T.no_grad():
      oupt = model(X)  # single [0.0, 1.0]
    if oupt = 0.5 and Y == 1:
      num_correct += 1
    else:
      num_wrong += 1
    
  acc = (num_correct * 100.0) / (num_correct + num_wrong)
  return acc

# -----------------------------------------------------------

def main():
  # 0. get started
  print("\nBegin PyTorch IMDB LSTM demo ")
  print("Using only reviews with 50 or less words ")
  T.manual_seed(3)  
  np.random.seed(3)

  # 1. load data 
  print("\nLoading preprocessed train and test data ")
  train_file = ".\\Data\\imdb_train_50w.txt"
  train_ds = IMDB_Dataset(train_file) 

  test_file = ".\\Data\\imdb_test_50w.txt"
  test_ds = IMDB_Dataset(test_file) 

  bat_size = 16
  train_ldr = T.utils.data.DataLoader(train_ds,
    batch_size=bat_size, shuffle=True, drop_last=False)
  n_train = len(train_ds)
  n_test = len(test_ds)
  print("Num train = %d Num test = %d " % (n_train, n_test))

# -----------------------------------------------------------

  # 2. create network
  print("\nCreating LSTM binary classifier ")
  net = LSTM_Net().to(device)

  # 3. train model
  loss_func = T.nn.BCELoss()  # binary cross entropy
  lrn_rate = 0.001
  optimizer = T.optim.Adam(net.parameters(), lr=lrn_rate)
  max_epochs = 30
  log_interval = 5  # display progress

  print("\nbatch size = " + str(bat_size))
  print("loss func = " + str(loss_func))
  print("optimizer = Adam ")
  print("learn rate = %0.4f " % lrn_rate)
  print("max_epochs = %d " % max_epochs)

  print("\nStarting training ")
  net.train()  # set training mode
  for epoch in range(0, max_epochs):
    tot_err = 0.0  # for one epoch
    for (batch_idx, batch) in enumerate(train_ldr):
      X = batch[0]  # [bs,50]
      Y = batch[1]
      optimizer.zero_grad()
      oupt = net(X)
      loss_val = loss_func(oupt, Y) 
      tot_err += loss_val.item()
      loss_val.backward()  # compute gradients
      optimizer.step()     # update weights
  
    if epoch % log_interval == 0:
      print("epoch = %4d  |" % epoch, end="")
      print("   loss = %10.4f  |" % tot_err, end="")
      net.eval()
      train_acc = accuracy(net, train_ds)
      print("  acc = %8.2f%%" % train_acc)
      net.train()

  print("Training complete")

# -----------------------------------------------------------

  # 4. evaluate model
  net.eval()
  test_acc = accuracy(net, test_ds)
  print("\nAccuracy on test data = %8.2f%%" % test_acc)

  # 5. save model
  print("\nSaving trained model state")
  # fn = ".\\Models\\imdb_model.pt"
  # T.save(net.state_dict(), fn)

  # saved_model = Net()
  # saved_model.load_state_dict(T.load(fn))
  # use saved_model to make prediction(s)

  # 6. use model
  print("\nFor \"the movie was a great waste of my time\"")
  print("0 = negative, 1 = positive ")
  review = np.array([4, 20, 16, 6, 86, 425, 7, 58, 64],
    dtype=np.int64)  # cheating . . 
  padding = np.zeros(50-len(review), dtype=np.int64)
  review = np.concatenate([padding, review])
  review = T.tensor(review, dtype=T.int64).to(device)
  
  net.eval()
  with T.no_grad():
    prediction = net(review)  # log-probs
  print("raw output : ", end="")
  print("%0.4f " % prediction.item())
  
  print("\nEnd PyTorch IMDB LSTM sentiment demo")

if __name__ == "__main__":
  main()