IMDB Movie Review Sentiment Analysis Using an LSTM with Keras

One of the major challenges of machine learning with TensorFlow, Keras, and PyTorch is that these libraries are under continuous development and so breaking changes are introduced every few months. I try to revisit core example programs as often as I can so I can catch breaking changes.

I couldn’t sleep one night so my two dogs and I refactored an existing Keras IMDN movie review sentiment analysis using an LSTM network. The IMDB movie review dataset consists of 50,000 movie reviews such as “This movie had great acting and a clever plot”, or “The film was a waste of my time”. There are 25,000 reviews for training (12,500 positive and 12,500 negative) and 25,000 reviews for testing.

An LSTM (long short-term memory) network works quite well for text that isn’t too long (say, one to ten sentences). For longer text problems, Transformer Architecture networks are better.

Anyway, it took me about two hours to refactor my old program, which was based on Keras 2.2.4 running over TensorFlow 1.11.0, to combined Keras TensorFlow 2.6.

I usually use Keras for relatively simple problems and PyTorch for complex problems. I avoid TensorFlow and use it only when working with an existing system.

Here are three movies that got terrible reviews and lost tons of money, but they’re films I like. Left: “The Chronicles of Riddick” (2004) is a wildly creative science fiction story. Center: “The Man from U.N.C.L.E.” (2015) is a spy story set during the Cold War of the 1960s. Right: “The Great Wall” (2016) is a wacky combination of martial arts and science fiction set in the first century A.D.

Demo code.

# imdb_lstm_tfk.py
# LSTM for sentiment analysis on the IMDB dataset
# Anaconda3-2020.02  (Python 3.7.6)
# TensorFlow 2.6.0 (includes KerasTF 2.6.0)
# Windows 10

# ===============================================================

import numpy as np
# import keras as K
import tensorflow as tf
from tensorflow import keras as K
import os
os.environ['TF_CPP_MIN_LOG_LEVEL']='2'

# ===============================================================

class MyLogger(K.callbacks.Callback):
  def __init__(self, n):
    self.n = n   # print loss & acc every n epochs

  def on_epoch_end(self, epoch, logs={}):
    if epoch % self.n == 0:
      curr_loss =logs.get('loss')
      curr_acc = logs.get('accuracy') * 100
      print("epoch = %4d  loss = %0.6f  acc = %0.2f%%" % \
(epoch, curr_loss, curr_acc))

# ===============================================================

def main():
  # 0. get started
  print("\nIMDB sentiment analysis using Keras/TensorFlow ")
  np.random.seed(1)
  tf.random.set_seed(1)

  # 1. load data
  max_words = 20000
  print("Loading data, max unique words = %d words\n" % max_words)
  (train_x, train_y), (test_x, test_y) = \
    K.datasets.imdb.load_data(seed=1, num_words=max_words)

  max_review_len = 80
  # pad and chop!
  train_x = K.preprocessing.sequence.pad_sequences(train_x,
    truncating='pre', padding='pre', maxlen=max_review_len)  
  test_x = K.preprocessing.sequence.pad_sequences(test_x,
    truncating='pre', padding='pre', maxlen=max_review_len)
  
  # 2. define model
  print("Creating LSTM model")
  e_init = K.initializers.RandomUniform(-0.01, 0.01, seed=1)
  init = K.initializers.glorot_uniform(seed=1)
  simple_adam = K.optimizers.Adam()
  embed_vec_len = 32  # values per word -- 100-500 is typical

  model = K.models.Sequential()
  model.add(K.layers.Embedding(input_dim=max_words,
    output_dim=embed_vec_len, embeddings_initializer=e_init,
    mask_zero=True))
  model.add(K.layers.LSTM(units=100, kernel_initializer=init,
    dropout=0.2, recurrent_dropout=0.2))  # 100 memory
  model.add(K.layers.Dense(units=1, kernel_initializer=init,
    activation='sigmoid'))
  model.compile(loss='binary_crossentropy', optimizer=simple_adam,
    metrics=['accuracy'])

  print(model.summary()) 

# ===============================================================

  # 3. train model
  bat_size = 32
  max_epochs = 5
  my_logger = MyLogger(n=1)
  print("\nStarting training ")
  h = model.fit(train_x, train_y, epochs=max_epochs,
    batch_size=bat_size, shuffle=True, verbose=0,
     callbacks=[my_logger]) 
  print("Training complete \n")

  # 4. evaluate model
  loss_acc = model.evaluate(test_x, test_y, verbose=0)
  print("Test data: loss = %0.6f  accuracy = %0.2f%% " % \
    (loss_acc[0], loss_acc[1]*100))

  # 5. save model
  print("Saving model to disk \n")
  mp = ".\\Models\\imdb_model.h5"
  model.save(mp)  

  # 6. use model
  print("New review: \'the movie was a great waste of my time\'")
  d = K.datasets.imdb.get_word_index()
  review = "the movie was a great waste of my time"
  words = review.split()
  review = []
  for word in words:
    if word not in d: 
      review.append(2)
    else:
      review.append(d[word]+3)
  review = K.preprocessing.sequence.pad_sequences([review],
    truncating='pre',  padding='pre', maxlen=max_review_len)

  prediction = model.predict(review)
  print("Prediction (0 = negative, 1 = positive) = ", end="")
  print("%0.4f" % prediction[0][0])

# ===============================================================

if __name__ == "__main__":
  main()