Fine-Tuning a Hugging Face DistilBERT Model for IMDB Sentiment Analysis

Over the past few weeks I’ve been walking through some of the examples in the Hugging Face (HF) code library. HF provides a set of APIs over Transformer Architecture (TA) models for natural language processing (NLP). Using HF is not simple but it’s much easier than implementing a TA system from scratch, which can take weeks of effort.

My recent HF adventure dissected an example that shows how to fine-tune an existing DistilBERT TA model to work on the well-known IMDB movie review sentiment analysis. The goal is to create a model that accepts the text of a movie review, and returns a value less than 0.5 for a negative review or a value greater than 0.5 for a positive review.

My primary takeaway was that TA systems are very, very complex. A thorough examination of the documentation example would take many days, even for someone like me who is experienced with PyTorch.

The demo program has seven major steps:

1. load raw IMDB text into memory
2. create an HF DistilBERT tokenizer
3. tokenize the raw IMDB text
4. convert raw IMDB text to PyTorch Datasets
5. load pretrained DistilBERT model
6. train / fine-tune model using IMDB data
7. save fine-tuned model

Each step is complicated. Experimenting with the demo was difficult because the IMDB dataset is large so everything — especially loading and training — takes a long time. Therefore, I took the original IMDB data, which has 25,000 reviews (12,500 train, 12,500 test with half of each positive and half negative) and chopped it down to just 200 train (100 positive, 100 negative) and 200 test (100 positive, 100 negative).

I now have a pretty good grasp of the main ideas of fine-tuning an HF TA model. Transformer architecture models aren’t science fiction but understanding them does take effort.

Here are three of my favorite science fiction movies of the 1950s that feature flying saucer style spaceships. My reviews of all three movies would have “positive” sentiment. Left: “Forbidden Planet” (1956). Center: “Invaders from Mars” (1953). Right: “Earth vs. the Flying Saucers” (1956).

Demo code:

# imdb_hf.py
# fine-tune an HF pretrained model for IMDB sentiment analysis
# zipped raw data at:
# https://ai.stanford.edu/~amaas/data/sentiment/

from pathlib import Path
from transformers import DistilBertTokenizerFast
import torch
from torch.utils.data import DataLoader
from transformers import DistilBertForSequenceClassification, \
  AdamW
from transformers import logging  # to suppress warnings

device = torch.device('cpu')

class IMDbDataset(torch.utils.data.Dataset):
  def __init__(self, encodings, labels):
    self.encodings = encodings
    self.labels = labels

  def __getitem__(self, idx):
    item = {key: torch.tensor(val[idx]) for key, val \
      in self.encodings.items()}
    item['labels'] = torch.tensor(self.labels[idx])
    return item

  def __len__(self):
    return len(self.labels)

def read_imdb_split(split_dir):
  split_dir = Path(split_dir)
  texts = []
  labels = []
  for label_dir in ["pos", "neg"]:
    for text_file in (split_dir/label_dir).iterdir():
      texts.append(text_file.read_text(encoding='utf-8'))
      labels.append(0 if label_dir is "neg" else 1)
  return texts, labels

def main():
  print("\nBegin IMDB sentiment using HugFace library ")
  logging.set_verbosity_error()  # suppress wordy warnings

  print("\nLoading data from file into memory ")
  train_texts, train_labels = \
    read_imdb_split(".\\DataSmall\\aclImdb\\train")
  test_texts, test_labels = \
    read_imdb_split(".\\DataSmall\\aclImdb\\test")
  print("Done ")

  print("\nTokenizing train, validate, test text ")
  tokenizer = \
    DistilBertTokenizerFast.from_pretrained(\
    'distilbert-base-uncased')
  train_encodings = \
    tokenizer(train_texts, truncation=True, padding=True)
  test_encodings = \
    tokenizer(test_texts, truncation=True, padding=True)
  print("Done ")

  print("\nLoading tokenized text into Pytorch Datasets ")
  train_dataset = IMDbDataset(train_encodings, train_labels)
  test_dataset = IMDbDataset(test_encodings, test_labels)
  print("Done ")

  print("\nLoading pre-trained DistilBERT model ")
  model = \
    DistilBertForSequenceClassification.from_pretrained( \
    'distilbert-base-uncased')
  model.to(device)
  model.train()  # set mode
  print("Done ")

  print("\nLoading Dataset bat_size = 10 ")
  train_loader = DataLoader(train_dataset, \
    batch_size=10, shuffle=True)
  print("Done ")

  print("\nFine-tuning the model ")
  optim = AdamW(model.parameters(), lr=5e-5)
  for epoch in range(3):
    epoch_loss = 0.0
    for (b_ix, batch) in enumerate(train_loader):
      optim.zero_grad()
      input_ids = batch['input_ids'].to(device)
      attention_mask = batch['attention_mask'].to(device)
      labels = batch['labels'].to(device)
      outputs = model(input_ids, \
        attention_mask=attention_mask, labels=labels)
      loss = outputs[0]
      epoch_loss += loss.item()  # accumulate batch loss
      loss.backward()
      optim.step()
      if b_ix % 5 == 0:  # 200 train items, 20 batches of 10
        print(" batch = %5d curr batch loss = %0.4f " % \
        (b_ix, loss.item()))
      print("end epoch = %4d  epoch loss = %0.4f " % \
      (epoch, epoch_loss))

  print("Training done ")

  print("\nSaving tuned model state ")
  model.eval()
  torch.save(model.state_dict(), \
    ".\\Models\\imdb_state.pt")  # just state
  print("Done ")
  
  print("\nEnd demo ")

if __name__ == "__main__":
  main()