Over the past few weeks I’ve been walking through some of the examples in the Hugging Face (HF) code library. HF provides a set of APIs over Transformer Architecture (TA) models for natural language processing (NLP). Using HF is not simple but it’s much easier than implementing a TA system from scratch, which can take weeks of effort.
My recent HF adventure dissected an example that shows how to fine-tune an existing DistilBERT TA model to work on the well-known IMDB movie review sentiment analysis. The goal is to create a model that accepts the text of a movie review, and returns a value less than 0.5 for a negative review or a value greater than 0.5 for a positive review.
My primary takeaway was that TA systems are very, very complex. A thorough examination of the documentation example would take many days, even for someone like me who is experienced with PyTorch.
The demo program has seven major steps:
1. load raw IMDB text into memory 2. create an HF DistilBERT tokenizer 3. tokenize the raw IMDB text 4. convert raw IMDB text to PyTorch Datasets 5. load pretrained DistilBERT model 6. train / fine-tune model using IMDB data 7. save fine-tuned model
Each step is complicated. Experimenting with the demo was difficult because the IMDB dataset is large so everything — especially loading and training — takes a long time. Therefore, I took the original IMDB data, which has 25,000 reviews (12,500 train, 12,500 test with half of each positive and half negative) and chopped it down to just 200 train (100 positive, 100 negative) and 200 test (100 positive, 100 negative).
I now have a pretty good grasp of the main ideas of fine-tuning an HF TA model. Transformer architecture models aren’t science fiction but understanding them does take effort.

Here are three of my favorite science fiction movies of the 1950s that feature flying saucer style spaceships. My reviews of all three movies would have “positive” sentiment. Left: “Forbidden Planet” (1956). Center: “Invaders from Mars” (1953). Right: “Earth vs. the Flying Saucers” (1956).
Demo code:
# imdb_hf.py
# fine-tune an HF pretrained model for IMDB sentiment analysis
# zipped raw data at:
# https://ai.stanford.edu/~amaas/data/sentiment/
from pathlib import Path
from transformers import DistilBertTokenizerFast
import torch
from torch.utils.data import DataLoader
from transformers import DistilBertForSequenceClassification, \
AdamW
from transformers import logging # to suppress warnings
device = torch.device('cpu')
class IMDbDataset(torch.utils.data.Dataset):
def __init__(self, encodings, labels):
self.encodings = encodings
self.labels = labels
def __getitem__(self, idx):
item = {key: torch.tensor(val[idx]) for key, val \
in self.encodings.items()}
item['labels'] = torch.tensor(self.labels[idx])
return item
def __len__(self):
return len(self.labels)
def read_imdb_split(split_dir):
split_dir = Path(split_dir)
texts = []
labels = []
for label_dir in ["pos", "neg"]:
for text_file in (split_dir/label_dir).iterdir():
texts.append(text_file.read_text(encoding='utf-8'))
labels.append(0 if label_dir is "neg" else 1)
return texts, labels
def main():
print("\nBegin IMDB sentiment using HugFace library ")
logging.set_verbosity_error() # suppress wordy warnings
print("\nLoading data from file into memory ")
train_texts, train_labels = \
read_imdb_split(".\\DataSmall\\aclImdb\\train")
test_texts, test_labels = \
read_imdb_split(".\\DataSmall\\aclImdb\\test")
print("Done ")
print("\nTokenizing train, validate, test text ")
tokenizer = \
DistilBertTokenizerFast.from_pretrained(\
'distilbert-base-uncased')
train_encodings = \
tokenizer(train_texts, truncation=True, padding=True)
test_encodings = \
tokenizer(test_texts, truncation=True, padding=True)
print("Done ")
print("\nLoading tokenized text into Pytorch Datasets ")
train_dataset = IMDbDataset(train_encodings, train_labels)
test_dataset = IMDbDataset(test_encodings, test_labels)
print("Done ")
print("\nLoading pre-trained DistilBERT model ")
model = \
DistilBertForSequenceClassification.from_pretrained( \
'distilbert-base-uncased')
model.to(device)
model.train() # set mode
print("Done ")
print("\nLoading Dataset bat_size = 10 ")
train_loader = DataLoader(train_dataset, \
batch_size=10, shuffle=True)
print("Done ")
print("\nFine-tuning the model ")
optim = AdamW(model.parameters(), lr=5e-5)
for epoch in range(3):
epoch_loss = 0.0
for (b_ix, batch) in enumerate(train_loader):
optim.zero_grad()
input_ids = batch['input_ids'].to(device)
attention_mask = batch['attention_mask'].to(device)
labels = batch['labels'].to(device)
outputs = model(input_ids, \
attention_mask=attention_mask, labels=labels)
loss = outputs[0]
epoch_loss += loss.item() # accumulate batch loss
loss.backward()
optim.step()
if b_ix % 5 == 0: # 200 train items, 20 batches of 10
print(" batch = %5d curr batch loss = %0.4f " % \
(b_ix, loss.item()))
print("end epoch = %4d epoch loss = %0.4f " % \
(epoch, epoch_loss))
print("Training done ")
print("\nSaving tuned model state ")
model.eval()
torch.save(model.state_dict(), \
".\\Models\\imdb_state.pt") # just state
print("Done ")
print("\nEnd demo ")
if __name__ == "__main__":
main()

.NET Test Automation Recipes
Software Testing
SciPy Programming Succinctly
Keras Succinctly
R Programming
2026 Visual Studio Live
2025 Summer MLADS Conference
2026 DevIntersection Conference
2025 Machine Learning Week
2025 Ai4 Conference
2026 G2E Conference
2026 iSC West Conference
You must be logged in to post a comment.