Fine-Tuning and Training a Text Summarization Model With the HuggingFace Libraries

Whew! Where do I begin. One Saturday morning, I decided to take a look at fine-tuning (training) a large language model for text summarization. Briefly, you feed the final model a fairly large block of text (say one to ten pages), and the model produces a short (length specified to, say 100 words) summary. About 10 hours later I was mentally exhausted.

To create the final fine-tuned model, you start with a base large language model that’s amenable to summarization (T5 and BART are common) and fine-train the base LLM on custom training items that are specific to your problem domain. Note that it’s perfectly possible to use the base LLM model directly for generic summarization scenarios.

I started with an example I found on the HuggingFace documentation. See https://huggingface.co/docs/transformers/en/tasks/summarization. HuggingFace has a very large library of functions for LLM tasks, and many base models, and thousands of fine-tuned pretrained models.

The documentation example generated about two dozen errors, but for an extremely complex task using an extremely complex library that is under daily updates, this was to be expected. I was able to resolve some errors very quickly, but some of the errors required a deep dive and lots of time. As usual, there were many Python package dependency errors, and fixing one of them often introduced new error(s).

The documentation example was in the form of a Jupyter notebook. My goal was to convert the documentation example to a normal Python program. This approach forces me to have a solid understanding of exactly what’s going on.

The demo starts with the T5 base large language model and fine-tunes it using California Bill Summary training data. Each of the thousands of data items has a title (not used in fine-tuning), the text, and the summary. I used a tiny subset by extracting 10 of the test items, and then splitting those 10 items into an 8-item training set and a 2-item test set. The full dataset would take many hours, possibly days, to train. Here’s how I loaded the data:

from datasets import load_dataset
. . . 

def main():
  print("Begin summarization training demo ")
  
  # 1. load training data
  print("Loading tiny Calif Bill Summary subset ")
  billsum = load_dataset("billsum", split="ca_test[0:10]")
  billsum = billsum.train_test_split(test_size=0.2)
  print("Done ")

  print("\nFirst 100 chars of first train item summary: ")
  print(billsum["train"][0]["summary"][0:100])
  print(". . .")

To get my version of the demo running took many hours. I was stuck for a long time setting the max_new_tokens and max_length parameters to take effect during training evaluation. In the end, I made a copy of the existing configuration, changed parameter values using the not-documented update() method:

  model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint) # base
  gen_config = copy.deepcopy(model.generation_config)
  gen_config.update(max_new_tokens=1024, max_length=1024)
. . .

It would take far too long to describe all the steps I went through to fine-tune the summarization model. Instead, I’ll just say that examining the program code should give you a good idea of the process. Note that I stopped my exploration after creating the summarization model — I didn’t use it to summarize some new text. I was exhausted at this point, and using a model is another whole set of ideas.

What a fascinating exploration!

As a young man, two of my favorite Saturday morning TV cartoons were those featuring Yosemite Sam (vs. Bug Bunny) and Wile E. Coyote (vs. Road Runner). For me, these cartoons hold up very well and I still find them entertaining. The language models of the two cartoons were nearly opposite.

Left: Yosemite Sam was created by animator Isadore “Friz” Freleng. Yosemite’s main adversary is Bugs Bunny. Bugs always manages to come out on top by using clever language to trick Sam.

Right: Wile E. Coyote was created by animator Chuck Jones. Coyote always tried overly-complex ways to catch the super fast Road Runner. For example, one time he fed Road Runner bird seed mixed with a few iron pellets and then used a powerful magnet and roller skates. Like all his tricks, this one failed spectacularly (and painfully) for Coyote. In this case, Coyote didn’t take into account the fact that a passing railroad train has more magnetic attraction than a few iron pellets. These cartoons have no spoken dialog at all. I met Chuck Jones when I was an undergraduate student at UC Irvine. I was working at a game store in Newport Beach and he came in to buy a backgammon set. Jones was very polite and generous with his time when I asked him about his career.

Demo code. The library dependencies are extremely complex and if you decide to try and run this program, you’ll need to spend significant time to configure your Python/PyTorch/HuggingFace programming environment.

# hf_summarization_training_billsum.py
# Anaconda 2023.09-0  Python 3.11.5  PyTorch 2.1.2+cpu
# transformers 4.32.3

# requires HF account
# get token: https://huggingface.co/settings/tokens 
# token: hf_AEohXg(chars removed)gmdkVgC
# install token:
# python -c "from huggingface_hub.hf_api import HfFolder;\
# HfFolder.save_token('token goes here')"

# had to pip install transformers --upgrade
# had to pip install -U datasets 
# had to pip install rouge_score

import numpy as np
import copy
import evaluate

from datasets import load_dataset
from transformers import AutoTokenizer
from transformers import DataCollatorForSeq2Seq
from transformers import AutoModelForSeq2SeqLM
from transformers import Seq2SeqTrainingArguments
from transformers import Seq2SeqTrainer

# -----------------------------------------------------------
# global scope objects:
# -----------------------------------------------------------

checkpoint = "google-t5/t5-small"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
rouge = evaluate.load("rouge")

# -----------------------------------------------------------

def preprocess_function(examples):
  inputs = ["summarize: " + \
    doc for doc in examples["text"]]
  model_inputs = tokenizer(inputs, max_length=1024, \
    truncation=True)
  labels = tokenizer(text_target=examples["summary"], \
    max_length=128, truncation=True)
  model_inputs["labels"] = labels["input_ids"]
  return model_inputs

# -----------------------------------------------------------

def compute_metrics(eval_pred):
  predictions, labels = eval_pred
  decoded_preds = tokenizer.batch_decode(predictions, \
    skip_special_tokens=True)
  labels = np.where(labels != -100, labels, \
    tokenizer.pad_token_id)
  decoded_labels = tokenizer.batch_decode(labels, \
    skip_special_tokens=True)
  result = rouge.compute(predictions=decoded_preds, \
    references=decoded_labels, use_stemmer=True)
  prediction_lens = [np.count_nonzero(pred != \
    tokenizer.pad_token_id) for pred in predictions]
  result["gen_len"] = np.mean(prediction_lens)
  return {k: round(v, 4) for k, v in result.items()}

# ------------------------------------------------------------

def main():
  print("\nBegin summarization training demo ")
  
  # 1. load training data
  print("\nLoading tiny Calif Bill Summary subset ")
  billsum = load_dataset("billsum", split="ca_test[0:10]")
  billsum = billsum.train_test_split(test_size=0.2)
  print("Done ")

  print("\nFirst 100 chars of first train item summary: ")
  # print(billsum["train"][0]) shows 'text' and 'summary'
  print(billsum["train"][0]["summary"][0:100])
  print(". . .")

  # 2. tokenize the fine-tuning data
  print("\nTokenizing the training data ")
  tokenized_billsum = billsum.map(preprocess_function, \
    batched=True)
  print("Done ")

  # 3. prepare training
  print("\nPreparing training collator and parameters ")
  data_collator = DataCollatorForSeq2Seq(tokenizer=\
    tokenizer, model=checkpoint)

  # 4. fine-tune train the base LLM
  model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint)

  gen_config = copy.deepcopy(model.generation_config)
  gen_config.update(max_new_tokens=1024, max_length=1024)

  training_args = Seq2SeqTrainingArguments(
    output_dir="custom_billsum_model",
    eval_strategy="epoch",
    learning_rate=2.0e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=4,
    predict_with_generate=True,
    # fp16=True,  # only for CUDA
    push_to_hub=True,
    generation_config=gen_config,
  )

  trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_billsum["train"],
    eval_dataset=tokenized_billsum["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
  )
  print("Done ")

  print("\nStart training \n")
  trainer.train()
  print("\nDone ")

  print("\nEnd deno ")

if __name__ == "__main__":
  main()