IMDB Sentiment Classification Using Keras 2.8 on Windows 11

One of my standard neural network examples is sentiment classification on the IMDB Movie Review dataset. The goal is to predict the sentiment (0 = negative, 1 = positive) of a natural language movie review such as, “The movie was a great waste of my time.” This is a very difficult natural language processing (NLP) problem.

My basic example uses an LSTM (long, short-term memory) architecture. I have an advanced version that uses Transformer Architecture, but that’s another topic.

Processing the data to get it ready for a neural system is a big challenge. I fetched the raw movie review data from https://ai.stanford.edu/~amaas/data/sentiment/. The movie review data is in gnu-zip, tape-archive format. I extracted the data using the 7-Zip utility program. This created a complex set of files. There are a total of 50,000 movie reviews — 25,000 for training and 25,000 for testing. Each set has 12,500 positive reviews and 12,500 negative reviews.

I wrote a Python language helper program that filtered the training and test reviews down to those with 50 words or less. The program converts each word to an integer ID where low numbers are common words, e.g. ID = 4 is “the”, ID = 5 is “and”, and so on. ID = 0 is used for padding so that all reviews have exactly 50 tokens. See https://jamesmccaffreyblog.com/2022/01/17/imdb-movie-review-sentiment-analysis-using-an-lstm-with-pytorch/.

The result is a file where each line is a movie review. The first 50 values are the encoded movie review words and the last value on the line is 0 (negative review) or 1 (positive review). I put the padding 0s at the beginning of each review that’s shorter than 50 words but I could have put them at the end.

I designed a simple LSTM network. I used an Embedding layer where each word/token ID is converted to a vector of 32 values. My simple LSTM layer uses an internal state size of 100 values. I used a batch-first geometry (dealing with the geometries of NLP systems is a big pain). I applied a Dropout layer to limit overfitting and a Linear layer with sigmoid() activation to condense the output to a single value between 0.0 and 1.0 and so an output of less than 0.5 means class 0 = negative review, otherwise class 1 = positive review.

For training, I used Adam optimization with an initial learning rate of 0.001 and a batch size of 16 reviews.

The demo achieved about 79% accuracy on the test data: pretty weak because I didn’t use enough reviews. I got much better results by using reviews with 80 words or less.

NLP problems are, in my opinion, among the most difficult in all of machine learning. But they’re very interesting.

In the days before the Internet, movie posters were extremely important for marketing. The older a movie poster is, the more likely it is to have more detail and hints about the plot and characters. Left: “Dr. No” (1962), the first Bond movie, starring Sean Connery. Center: “GoldenEye” (1995), starring Pierce Brosnan. Right: “Casino Royale” (2006), starring Daniel Craig.

Demo code. Replace “lt”, “gt”, “lte”, “gte” with Boolean operator symbols. My lame blog editor chokes on symbols.

# imdb_lstm_tfk.py
# LSTM for sentiment analysis on the IMDB dataset
# Anaconda3-2020.02  (Python 3.7.6)
# TensorFlow 2.8.0 (includes KerasTF 2.8.0)
# Windows 10/11

# -----------------------------------------------------------

import numpy as np
import tensorflow as tf
from tensorflow import keras as K
import os
os.environ['TF_CPP_MIN_LOG_LEVEL']='2'

# -----------------------------------------------------------

class MyLogger(K.callbacks.Callback):
  def __init__(self, n):
    self.n = n   # print loss & acc every n epochs

  def on_epoch_end(self, epoch, logs={}):
    if epoch % self.n == 0:
      curr_loss =logs.get('loss')
      curr_acc = logs.get('accuracy') * 100
      print("epoch = %4d  |  loss = %0.6f  |  acc = %0.2f%%" \
        % (epoch, curr_loss, curr_acc))

# -----------------------------------------------------------

def main():
  # 0. get started
  print("\nBegin Keras IMDB LSTM demo ")
  print("Using only reviews with 50 or less words ")
  np.random.seed(3)
  tf.random.set_seed(3)

  # 1. load data
  print("\nLoading preprocessed train and test data ")
  train_file = ".\\Data\\imdb_train_50w.txt"
  train_xy = np.loadtxt(train_file, usecols=range(0,51),
      delimiter=" ", comments="#", dtype=np.int64)
  train_x = train_xy[:,0:50]   # cols [0,50) = [0,49]
  train_y = train_xy[:,50]

  test_file = ".\\Data\\imdb_test_50w.txt"
  test_xy = np.loadtxt(test_file, usecols=range(0,51),
    delimiter=" ", comments="#", dtype=np.int64)
  test_x = test_xy[:,0:50]   # cols [0,50) = [0,49]
  test_y = test_xy[:,50]

  n_train = len(train_x)
  n_test = len(test_x)
  print("Num train = %d Num test = %d " % (n_train, n_test))

# -----------------------------------------------------------  
  
  # 2. define model
  print("\nCreating LSTM binary classifier ")
  lrn_rate = 0.001
  opt_adam = K.optimizers.Adam(learning_rate=lrn_rate)
  embed_vec_len = 32  # values per word -- 100-500 is typical

  model = K.models.Sequential()
  model.add(K.layers.Embedding(input_dim=129892,
    output_dim=embed_vec_len ))  # consider mask_zero=True
  model.add(K.layers.LSTM(units=100))  # 100 memory
  model.add(K.layers.Dropout(0.2))
  model.add(K.layers.Dense(units=1, activation='sigmoid'))
  model.compile(loss='binary_crossentropy',
    optimizer=opt_adam, metrics=['accuracy'])

  # print(model.summary()) 

# -----------------------------------------------------------

  # 3. train model
  bat_size = 16
  max_epochs = 30

  print("\nbatch size = " + str(bat_size))
  print("loss func = binary_crossentropy ")
  print("optimizer = Adam ")
  print("learn rate = %0.4f " % lrn_rate)
  print("max_epochs = %d " % max_epochs)

  my_logger = MyLogger(n=5)
  print("\nStarting training ")
  h = model.fit(train_x, train_y, epochs=max_epochs,
    batch_size=bat_size, shuffle=True, verbose=0,
     callbacks=[my_logger]) 
  print("Training complete ")

  # 4. evaluate model
  eval = model.evaluate(test_x, test_y, verbose=0)
  print("\nAccuracy on test data = %8.2f%%" % (eval[1]*100))

  # 5. save model
  print("\nSaving model to disk ")
  # mp = ".\\Models\\imdb_model.h5"
  # model.save(mp)  

  # 6. use model
  print("\nFor \"the movie was a great waste of my time\"")
  print("0 = negative, 1 = positive ")
  review = np.array([4, 20, 16, 6, 86, 425, 7, 58, 64],
    dtype=np.int64)  # cheating . . 
  padding = np.zeros(50-len(review), dtype=np.int64)
  review = np.concatenate([padding, review])
  review = review.reshape(1, -1)
  prediction = model.predict(review)

  print("raw output : ", end="")
  print("%0.4f " % prediction)

# -----------------------------------------------------------

  print("\nEnd Keras IMDB LSTM sentiment demo")

if __name__ == "__main__":
  main()