Fine-Tuning a HuggingFace English to Italian Language Translation System

The field of AI is moving so fast, it’s very difficult to stay up to speed. One weekend I decided to exlore a language translation system using the HuggingFace library.

I started by looking at an English to French example in the HuggingFace documentation at huggingface.co/docs/transformers/en/tasks/translation. After a significant amount of time and pain, I got the HF demo to work. The HF demo uses the T5 language model as a base and then fine-tunes the model using the Opus system which is a built-in set of English to French data from thousands of books.

To test my understanding, I set out to refactor the demo to create an English to Italian translation system using a custom set of data for fine-tuning. I eventually succeeded but the process took several days of rather intense effort.

Explaining the demo program in full would take far too long. Briefly, my translator started with the Helsinki-NLP/opus-mt-tc-big-en-it model that was trained to do generic English to Italian translation.

Next, I created a tiny 12-item set of custom data:

{ "id": 0, "translation": 
  {"en": "crazy", "it": "pazzesco"}}
{ "id": 1, "translation":
  {"en": "Excuse me", "it": "Scusa"}}
{ "id": 2, "translation":
  {"en": "Tell me", "it": "Dimmi"}}
{ "id": 3, "translation":
  {"en": "Good morning", "it": "Buongiorno"}}
{ "id": 4, "translation":
  {"en": "Goodbye", "it": "Arrivederci"}}
{ "id": 5, "translation": 
  {"en": "You're welcome", "it": "Prego"}}
{ "id": 6, "translation": 
  {"en": "Thank you", "it": "Grazie"}}
{ "id": 7, "translation": 
  {"en": "How much does it cost?", "it": "Quanto costa?"}}
{ "id": 8, "translation": 
  {"en": "Monday", "it": "Lunedi"}}
{ "id": 9, "translation": 
  {"en": "Friday", "it": "Venerdi"}}
{ "id": 10, "translation": 
  {"en": "One", "it": "Uno"}}
{ "id": 11, "translation": 
  {"en": "Two", "it": "Due }}

I fine-tuned the base model using the custom data, and then tested the translator model by feeding it a couple of sentences about the planet Venus. The complete demo output is:

Begin English to Italian fine-tuning demo

Loading custom data for fine-tuning
Generating train split: 12 examples
  [00:00, 437.04 examples/s]
DatasetDict({
    train: Dataset({
        features: ['id', 'translation'],
        num_rows: 8
    })
    test: Dataset({
        features: ['id', 'translation'],
        num_rows: 2
    })
})
Done

First training item:
{'id': 7, 'translation': {'en': 'How much does it cost?',
   'it': 'Quanto costa?'}}

Creating tokenizer for opus-mt-tc-big-en-it model
Done

Tokenizing custom data
Map: 100%|**********| 8/8 [00:00 "lt" 00:00, 206.48 examples/s]
Map: 100%|**********| 2/2 [00:00 "lt" 00:00, 82.27 examples/s]
Done

Setting up padding for training
Done

Setting up model evaluation metrics
Done

Preparing to train
Done

Starting training
{eval_loss: 4.2402567863464355, eval_bleu: 5.3411,
 eval_gen_len: 11.5, eval_runtime: 1.4639,
 eval_samples_per_second: 1.366, eval_steps_per_second:
 0.683, epoch: 1.0}
{eval_loss: 3.6332175731658936, eval_bleu: 5.3411,
 eval_gen_len: 11.5, eval_runtime: 1.4211,
 eval_samples_per_second: 1.407, eval_steps_per_second:
 0.704, epoch: 2.0}
{train_runtime: 18.892, train_samples_per_second:
 0.847, train_steps_per_second: 0.212, train_loss:
 4.56916618347168, epoch: 2.0}
Done

Using fine-tuned model

Source English:
Venus is the second planet from the Sun. It is a terrestrial
 planet and is the closest in mass and size to its orbital
 neighbor Earth.

Translation to Italian:
[{translation_text: Venere e il secondo pianeta del Sole.
 E un pianeta terrestre ed e il piu vicino in massa e
 dimensioni al suo vicino orbitale Terra.}]

End HF translation demo

There is a lot going on here and the demo is best understood by looking at the program code below.

It was a fascinating exploration and I “imparato molto” (learned a lot).

Italy produced some interesting science fiction movies in the 1960s that were translated to English.

Left: “Mission Stardust” (1967) – Major Perry Rhodan leads a four-man mission to the Moon to seek radioactive material more powerful than uranium. On the Moon, they find a stranded Arkonide spaceship, captained by the beautiful Thora. Thora is kidnapped by henchmen of an Earth crime lord to get Arkonide technology. Perry rescues her and the good guys win in the end. My grade: C.

Center: “The Wild, Wild Planet” (1966) – This is the first of four films made in 1966 and 1967 that featured mostly the same actors, same sets, same costumes, same props, and similar titles, making them difficult to distinguish. [also “War of the Planets” aka “I Diafanoidi Vengono da Marte” aka “The Diaphanoids Come From Mars” (1966), “War of the Planets” (1966), “War Between the Planets” (1966), “Snow Devils” (1967)]. In this one, Commander Halstead, who is in charge of space station Gamma One, investigates missing scientists on Earth. The evil Dr. Nurmi is running experiments under the supervision of evil aliens from the planet Delphos. Good guys win. Pretty good special effects for the time. My grade C+.

Right: “Hercules Against the Moon Men” (1960) – Evil aliens from the Moon land on Earth and conspire with the evil Queen of Samar. Hercules shows up and saves the day after defeating a metal-headed giant named Redolphis. Just bad enough to be OK. My grade: C.

Demo code.

# translation_en_it_demo.py
# example of fine-tuning language translation
# Anaconda 2023.09-0  Python 3.11.5  PyTorch 2.1.2+cpu
# transformers 4.32.3

# requires pip install transformers, datasets,
#  evaluate, sacrebleu, sentencepiece, sacremoses

import random
import copy
import numpy as np
import torch as T

print("\nBegin English to Italian fine-tuning demo ")

# make results reproducible
random.seed(4)
np.random.seed(4)
T.manual_seed(4)

import transformers
transformers.logging.set_verbosity_error()

print("\nLoading custom data for fine-tuning ")
from datasets import load_dataset
custom_ds = load_dataset("json", 
  data_files=".\\custom_data.json", split="train[0:10]")
custom_ds = custom_ds.train_test_split(test_size=0.20)
print(custom_ds)  # train 8 rows, test 2 rows
print("Done ")

print("\nFirst training item: ")
print(custom_ds["train"][0])

print("Creating tokenizer for opus-mt-tc-big-en-it model ")
from transformers import AutoTokenizer
# checkpoint = "google-t5/t5-small"  # limit 20 output
# checkpoint = "google-t5/t5-base" # no en-to-it
checkpoint = "Helsinki-NLP/opus-mt-tc-big-en-it"
# based on the MarianMT model
# Named after Marian Rejewski, a Polish mathematician
# and cryptologist who reconstructed the German military
# Enigma cipher machine in 1932

tokenizer = AutoTokenizer.from_pretrained(checkpoint)
print("Done ")

print("\nTokenizing custom data ")
source_lang = "en"
target_lang = "it"
prefix = "translate English to Italian: "

def preprocess_function(examples):
  inputs = [prefix + example[source_lang] for example \
    in examples["translation"]]
  targets = [example[target_lang] for example \
    in examples["translation"]]
  model_inputs = tokenizer(inputs, text_target=targets, 
    max_length=1024, truncation=True)
  return model_inputs

tokenized_custom_ds = \
  custom_ds.map(preprocess_function, \
  batched=True)
print("Done ")

print("\nSetting up padding for training ")
from transformers import DataCollatorForSeq2Seq
data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer,
  model=checkpoint)
print("Done ")

print("\nSetting up model evaluation metrics ")
import evaluate
metric = evaluate.load("sacrebleu")

def postprocess_text(preds, labels):
  preds = [pred.strip() for pred in preds]
  labels = [[label.strip()] for label in labels]
  return preds, labels

def compute_metrics(eval_preds):
  preds, labels = eval_preds
  if isinstance(preds, tuple):
    preds = preds[0]
  decoded_preds = tokenizer.batch_decode(preds,
    skip_special_tokens=True)
  labels = np.where(labels != -100, labels,
    tokenizer.pad_token_id)
  decoded_labels = tokenizer.batch_decode(labels,
    skip_special_tokens=True)
  decoded_preds, decoded_labels = \
    postprocess_text(decoded_preds, decoded_labels)
  result = metric.compute(predictions=decoded_preds,
    references=decoded_labels)
  result = {"bleu": result["score"]}
  prediction_lens = [np.count_nonzero(pred != \
    tokenizer.pad_token_id) for pred in preds]
  result["gen_len"] = np.mean(prediction_lens)
  result = {k: round(v, 4) for k, v in result.items()}
  return result

print("Done ")

print("\nPreparing to train ")
from transformers import AutoModelForSeq2SeqLM, \
  Seq2SeqTrainingArguments, Seq2SeqTrainer
model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint)
print("Done ")

training_args = Seq2SeqTrainingArguments(
  output_dir="finetuned_en_to_it_model",
  eval_strategy="epoch",
  learning_rate=2.0e-5,
  per_device_train_batch_size=4,
  per_device_eval_batch_size=4,
  weight_decay=0.01,
  save_total_limit=3,
  num_train_epochs=2,
  predict_with_generate=True,
  # fp16=True,  # only CUDA
  # push_to_hub=True,
  generation_max_length=1024, # prevent annoying warns
)

trainer = Seq2SeqTrainer(
  model=model,
  args=training_args,
  train_dataset=tokenized_custom_ds["train"],
  eval_dataset=tokenized_custom_ds["test"],
  tokenizer=tokenizer,
  data_collator=data_collator,
  compute_metrics=compute_metrics,
)

print("\nStarting training ")
trainer.train()
print("Done ")

print("\nUsing fine-tuned model ")
from transformers import pipeline

translator = pipeline("translation_en_to_it", model=model,
  tokenizer=tokenizer, max_length=1024)

text_en = "Venus is the second planet from the Sun. " + \
"It is a terrestrial planet and is the closest in mass " + \
"and size to its orbital neighbor Earth."

print("\nSource English: ")
print(text_en)
text_en = "translate English to Italian: " + text_en

text_it = translator(text_en, max_length=1024)
print("\nTranslation to Italian: ")
print(text_it)

print("\nEnd HF translation demo ")