The field of AI is moving so fast, it’s very difficult to stay up to speed. One weekend I decided to exlore a language translation system using the HuggingFace library.
I started by looking at an English to French example in the HuggingFace documentation at huggingface.co/docs/transformers/en/tasks/translation. After a significant amount of time and pain, I got the HF demo to work. The HF demo uses the T5 language model as a base and then fine-tunes the model using the Opus system which is a built-in set of English to French data from thousands of books.
To test my understanding, I set out to refactor the demo to create an English to Italian translation system using a custom set of data for fine-tuning. I eventually succeeded but the process took several days of rather intense effort.
Explaining the demo program in full would take far too long. Briefly, my translator started with the Helsinki-NLP/opus-mt-tc-big-en-it model that was trained to do generic English to Italian translation.
Next, I created a tiny 12-item set of custom data:
{ "id": 0, "translation":
{"en": "crazy", "it": "pazzesco"}}
{ "id": 1, "translation":
{"en": "Excuse me", "it": "Scusa"}}
{ "id": 2, "translation":
{"en": "Tell me", "it": "Dimmi"}}
{ "id": 3, "translation":
{"en": "Good morning", "it": "Buongiorno"}}
{ "id": 4, "translation":
{"en": "Goodbye", "it": "Arrivederci"}}
{ "id": 5, "translation":
{"en": "You're welcome", "it": "Prego"}}
{ "id": 6, "translation":
{"en": "Thank you", "it": "Grazie"}}
{ "id": 7, "translation":
{"en": "How much does it cost?", "it": "Quanto costa?"}}
{ "id": 8, "translation":
{"en": "Monday", "it": "Lunedi"}}
{ "id": 9, "translation":
{"en": "Friday", "it": "Venerdi"}}
{ "id": 10, "translation":
{"en": "One", "it": "Uno"}}
{ "id": 11, "translation":
{"en": "Two", "it": "Due }}
I fine-tuned the base model using the custom data, and then tested the translator model by feeding it a couple of sentences about the planet Venus. The complete demo output is:
Begin English to Italian fine-tuning demo
Loading custom data for fine-tuning
Generating train split: 12 examples
[00:00, 437.04 examples/s]
DatasetDict({
train: Dataset({
features: ['id', 'translation'],
num_rows: 8
})
test: Dataset({
features: ['id', 'translation'],
num_rows: 2
})
})
Done
First training item:
{'id': 7, 'translation': {'en': 'How much does it cost?',
'it': 'Quanto costa?'}}
Creating tokenizer for opus-mt-tc-big-en-it model
Done
Tokenizing custom data
Map: 100%|**********| 8/8 [00:00 "lt" 00:00, 206.48 examples/s]
Map: 100%|**********| 2/2 [00:00 "lt" 00:00, 82.27 examples/s]
Done
Setting up padding for training
Done
Setting up model evaluation metrics
Done
Preparing to train
Done
Starting training
{eval_loss: 4.2402567863464355, eval_bleu: 5.3411,
eval_gen_len: 11.5, eval_runtime: 1.4639,
eval_samples_per_second: 1.366, eval_steps_per_second:
0.683, epoch: 1.0}
{eval_loss: 3.6332175731658936, eval_bleu: 5.3411,
eval_gen_len: 11.5, eval_runtime: 1.4211,
eval_samples_per_second: 1.407, eval_steps_per_second:
0.704, epoch: 2.0}
{train_runtime: 18.892, train_samples_per_second:
0.847, train_steps_per_second: 0.212, train_loss:
4.56916618347168, epoch: 2.0}
Done
Using fine-tuned model
Source English:
Venus is the second planet from the Sun. It is a terrestrial
planet and is the closest in mass and size to its orbital
neighbor Earth.
Translation to Italian:
[{translation_text: Venere e il secondo pianeta del Sole.
E un pianeta terrestre ed e il piu vicino in massa e
dimensioni al suo vicino orbitale Terra.}]
End HF translation demo
There is a lot going on here and the demo is best understood by looking at the program code below.
It was a fascinating exploration and I “imparato molto” (learned a lot).

Italy produced some interesting science fiction movies in the 1960s that were translated to English.
Left: “Mission Stardust” (1967) – Major Perry Rhodan leads a four-man mission to the Moon to seek radioactive material more powerful than uranium. On the Moon, they find a stranded Arkonide spaceship, captained by the beautiful Thora. Thora is kidnapped by henchmen of an Earth crime lord to get Arkonide technology. Perry rescues her and the good guys win in the end. My grade: C.
Center: “The Wild, Wild Planet” (1966) – This is the first of four films made in 1966 and 1967 that featured mostly the same actors, same sets, same costumes, same props, and similar titles, making them difficult to distinguish. [also “War of the Planets” aka “I Diafanoidi Vengono da Marte” aka “The Diaphanoids Come From Mars” (1966), “War of the Planets” (1966), “War Between the Planets” (1966), “Snow Devils” (1967)]. In this one, Commander Halstead, who is in charge of space station Gamma One, investigates missing scientists on Earth. The evil Dr. Nurmi is running experiments under the supervision of evil aliens from the planet Delphos. Good guys win. Pretty good special effects for the time. My grade C+.
Right: “Hercules Against the Moon Men” (1960) – Evil aliens from the Moon land on Earth and conspire with the evil Queen of Samar. Hercules shows up and saves the day after defeating a metal-headed giant named Redolphis. Just bad enough to be OK. My grade: C.
Demo code.
# translation_en_it_demo.py
# example of fine-tuning language translation
# Anaconda 2023.09-0 Python 3.11.5 PyTorch 2.1.2+cpu
# transformers 4.32.3
# requires pip install transformers, datasets,
# evaluate, sacrebleu, sentencepiece, sacremoses
import random
import copy
import numpy as np
import torch as T
print("\nBegin English to Italian fine-tuning demo ")
# make results reproducible
random.seed(4)
np.random.seed(4)
T.manual_seed(4)
import transformers
transformers.logging.set_verbosity_error()
print("\nLoading custom data for fine-tuning ")
from datasets import load_dataset
custom_ds = load_dataset("json",
data_files=".\\custom_data.json", split="train[0:10]")
custom_ds = custom_ds.train_test_split(test_size=0.20)
print(custom_ds) # train 8 rows, test 2 rows
print("Done ")
print("\nFirst training item: ")
print(custom_ds["train"][0])
print("Creating tokenizer for opus-mt-tc-big-en-it model ")
from transformers import AutoTokenizer
# checkpoint = "google-t5/t5-small" # limit 20 output
# checkpoint = "google-t5/t5-base" # no en-to-it
checkpoint = "Helsinki-NLP/opus-mt-tc-big-en-it"
# based on the MarianMT model
# Named after Marian Rejewski, a Polish mathematician
# and cryptologist who reconstructed the German military
# Enigma cipher machine in 1932
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
print("Done ")
print("\nTokenizing custom data ")
source_lang = "en"
target_lang = "it"
prefix = "translate English to Italian: "
def preprocess_function(examples):
inputs = [prefix + example[source_lang] for example \
in examples["translation"]]
targets = [example[target_lang] for example \
in examples["translation"]]
model_inputs = tokenizer(inputs, text_target=targets,
max_length=1024, truncation=True)
return model_inputs
tokenized_custom_ds = \
custom_ds.map(preprocess_function, \
batched=True)
print("Done ")
print("\nSetting up padding for training ")
from transformers import DataCollatorForSeq2Seq
data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer,
model=checkpoint)
print("Done ")
print("\nSetting up model evaluation metrics ")
import evaluate
metric = evaluate.load("sacrebleu")
def postprocess_text(preds, labels):
preds = [pred.strip() for pred in preds]
labels = [[label.strip()] for label in labels]
return preds, labels
def compute_metrics(eval_preds):
preds, labels = eval_preds
if isinstance(preds, tuple):
preds = preds[0]
decoded_preds = tokenizer.batch_decode(preds,
skip_special_tokens=True)
labels = np.where(labels != -100, labels,
tokenizer.pad_token_id)
decoded_labels = tokenizer.batch_decode(labels,
skip_special_tokens=True)
decoded_preds, decoded_labels = \
postprocess_text(decoded_preds, decoded_labels)
result = metric.compute(predictions=decoded_preds,
references=decoded_labels)
result = {"bleu": result["score"]}
prediction_lens = [np.count_nonzero(pred != \
tokenizer.pad_token_id) for pred in preds]
result["gen_len"] = np.mean(prediction_lens)
result = {k: round(v, 4) for k, v in result.items()}
return result
print("Done ")
print("\nPreparing to train ")
from transformers import AutoModelForSeq2SeqLM, \
Seq2SeqTrainingArguments, Seq2SeqTrainer
model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint)
print("Done ")
training_args = Seq2SeqTrainingArguments(
output_dir="finetuned_en_to_it_model",
eval_strategy="epoch",
learning_rate=2.0e-5,
per_device_train_batch_size=4,
per_device_eval_batch_size=4,
weight_decay=0.01,
save_total_limit=3,
num_train_epochs=2,
predict_with_generate=True,
# fp16=True, # only CUDA
# push_to_hub=True,
generation_max_length=1024, # prevent annoying warns
)
trainer = Seq2SeqTrainer(
model=model,
args=training_args,
train_dataset=tokenized_custom_ds["train"],
eval_dataset=tokenized_custom_ds["test"],
tokenizer=tokenizer,
data_collator=data_collator,
compute_metrics=compute_metrics,
)
print("\nStarting training ")
trainer.train()
print("Done ")
print("\nUsing fine-tuned model ")
from transformers import pipeline
translator = pipeline("translation_en_to_it", model=model,
tokenizer=tokenizer, max_length=1024)
text_en = "Venus is the second planet from the Sun. " + \
"It is a terrestrial planet and is the closest in mass " + \
"and size to its orbital neighbor Earth."
print("\nSource English: ")
print(text_en)
text_en = "translate English to Italian: " + text_en
text_it = translator(text_en, max_length=1024)
print("\nTranslation to Italian: ")
print(text_it)
print("\nEnd HF translation demo ")

.NET Test Automation Recipes
Software Testing
SciPy Programming Succinctly
Keras Succinctly
R Programming
2026 Visual Studio Live
2025 Summer MLADS Conference
2026 DevIntersection Conference
2025 Machine Learning Week
2025 Ai4 Conference
2026 G2E Conference
2026 iSC West Conference
You must be logged in to post a comment.