A Predict-Next-Word Example Using Hugging Face and GPT-2

Deep neural transformer architecture (TA) systems can be considered the successors to LSTM (long, short-term memory) networks. TAs have revolutionized the field of natural language processing (NLP). Unfortunately, TA systems are extremely complicated and implementing a TA system from scratch can take weeks or months.

The Hugging Face (HF) code library wraps TAs and makes them relatively easy to use.

I’ve been walking through the HF documentation examples. I take an example and then refactor it completely. Doing so forces me to understand every line of code. Over time, by repeating this process for many examples, I expect to gain a solid grasp of the HF library.

My latest experiment was to refactor the example that does a “next-word” prediction. You feed the model a sequence of words and the model predicts the next word. For my demo, I set up a sequence of:

“Machine learning with PyTorch can do amazing . . ”

The built-in model predicted the next word is “things” which seems reasonable.

The documentation example wasn’t very good in my opinion because instead of predicting the single, most likely word, the example fetched all possible words (50,257 of them) did some complicated filtering using the HF top_k_top_p_flitering() function, then fed those filtered results to the PyTorch multinomial() probability distribution function, and selected one highly-likely, but not necessarily most-likely, result. My point is that the documentation example had too many clever bells and whistles which obscured the main ideas of the next-word prediction.

Note: The system doesn’t really predict a next “word” — it’s more correct to say the model prediction is a “token”. For example, the tokenizer breaks the word “PyTorch” into “Py”, “Tor”, and “ch” tokens.

Even though the documentation example was short, it is extremely dense. Every statement has many nuances and ideas. Parsing through the documentation example took me a full day, and there are still some details I don’t fully understand. But it was good fun and the adventure took me one step closer to a working knowledge of the HF library for transformer architecture systems.

I used to like to watch the Roadrunner and Coyote cartoons. The Coyote always had a new plan to catch the Roadrunner, and the fun was predicting how the next plan would fail — no transformer architecture needed.

Demo code:

# next_word_test.py

import torch
from transformers import AutoModelForCausalLM, \
  AutoTokenizer
# from torch import nn
import numpy as np

print("\nBegin next-word using HF GPT-2 demo ")

toker = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained("gpt2")

seq = "Machine learning with PyTorch can do amazing"
print("\nInput sequence: ")
print(seq)

inpts = toker(seq, return_tensors="pt")
print("\nTokenized input data structure: ")
print(inpts)

inpt_ids = inpts["input_ids"]  # just IDS, no attn mask
print("\nToken IDs and their words: ")
for id in inpt_ids[0]:
  word = toker.decode(id)
  print(id, word)

with torch.no_grad():
  logits = model(**inpts).logits[:, -1, :]
print("\nAll logits for next word: ")
print(logits)
print(logits.shape)

pred_id = torch.argmax(logits).item()
print("\nPredicted token ID of next word: ")
print(pred_id)

pred_word = toker.decode(pred_id)
print("\nPredicted next word for sequence: ")
print(pred_word)

print("\nEnd demo ")