A Quick Look At Text Summarization Using the HuggingFace Large Language Model Libraries

One recent morning before work, I figured I’d explore text summarization using the HuggingFace (HF) large language model libraries. Bottom line(s): The HF libraries are incredibly easy to use, but the documentation for HF libraries is somewhat of a mess.

I was able to zap out a demo relatively quickly. I found an arbitrary news article that is essentially an interview with Warren Buffet. The article text is:

Unlike most billionaires, Berkshire Hathaway Chairman and
 CEO Warren Buffett has always been a vocal advocate for
 working class Americans. He famously suggested raising taxes
 on wealthy individuals like himself and recently claimed
 that no American would have to pay "a dime of federal taxes"
 if other corporations paid their fair share. "We always
 hope at Berkshire to pay substantial federal income taxes,"
 he said at the company's annual meeting. With that in mind,
 some of Buffett's more unconventional thoughts on wealth
 inequality are probably worth closer inspection. "No
 conspiracy lies behind this depressing fact: The poor are
 most definitely not poor because the rich are rich. Nor are
 the rich undeserving. Most of them have contributed
 brilliant innovations or managerial expertise to America's
 well-being," the famous investor wrote in a 2015 Wall Street
 Journal op-ed. "Instead, this widening gap is an inevitable
 consequence of an advanced market-based economy." Here's a
 closer look at Buffett's argument. Buffett believes the
 market economy has become more and more "specialized" with
 "economic rewards flowing to people with specialized
 talents." This, he says, has caused the wealth gap with many
 people barely getting by while others thrive. "It was an
 agrarian economy a couple hundred years ago," he said in an
 interview with CNN. "Very hard, you know, to get 20 times
 the wealth of the next guy because you were a little bit
 better farmer. But if you're better at some skills now, you
 can become incredibly wealthy at a very young age … You get
 to capitalize [the] value of an idea. And so the wealth
 moves big time, even on an anticipatory basis." Now, he
 says, there's a "mismatch" between the requirements of
 attractive jobs and the skills of the early American labor
 force, which is "simply a consequence of an economic engine
 that constantly requires more high-order talents while
 reducing the need for commodity-like tasks." The brutal
 truth, he says, is that "a great many people" will be left
 behind in an advanced economic system.

I wrote a tiny Python language program to summarize the article. The result:

[{'summary_text': ' Warren Buffett has always been a vocal
 advocate for working class Americans . Buffett believes the
 market economy has become more and more "specialized" with
 "economic rewards flowing to people with specialized
 talents" This, he says, has caused the wealth gap with many
 people barely getting by .'}]

But I had many unanswered questions in my mind. Here’s the program I wrote:

# hf_summarization_demo.py
# Anaconda 2023.09-0  Python 3.11.5
# transformers 4.32.1

from transformers import pipeline

print("\nBegin text summarization demo ")
print("Using HF pipeline with pretrained model approach ")

article = '''Unlike most billionaires, Berkshire Hathaway
 Chairman and . . . (see above) . . . will be left
 behind in an advanced economic system.'''

print("\nSource article: \n")
print(article)

model_id = "sshleifer/distilbart-cnn-12-6"
# rev_id = "a4f8f3e"
print("\nUsing pretrained model: " + model_id)

# summarizer = pipeline("summarization", model=model_id,
#   revision=rev_id)
# apparently, if you don't specify a revision ID, HF
# will use the latest revision
summarizer = pipeline("summarization", model=model_id)

summary_text = summarizer(article, max_length=100,
  min_length=20, do_sample=False) # do_sample ?

print("\nSummary: \n")
print(summary_text)

print("\nEnd demo ")

The HF documentation is quite disorganized. This often happens when a technology is new and changing rapidly, so the chaos wasn’t unexpected.

I knew from previous explorations that HF has a very high-level “pipeline” object that hides almost all details, and is the easiest way to get started with a language task. I found several examples of text summarization using an HF pipeline — all of them significantly different and contradictory to some extent.

My first attempt instantiated a pipeline like so:

summarizer = pipeline("summarization")

This gave a warning message that a pipeline should be instantiated using a specific pretrained model and revision, and that the default model is “sshleifer/distilbart-cnn-12-6”, revision “a4f8f3e”. Because this was my first use of the pretrained model, it was downloaded from the HF servers to my machine and cached for subsequent program runs.

It appears that the pretrained model was created by user sshleifer, using base language model distilbart, trained on some CNN news articles.

For my second attempt I instantiated as:

model_id = "sshleifer/distilbart-cnn-12-6"
rev_id = "a4f8f3e"
summarizer = pipeline("summarization", model=model_id,
  revision=rev_id)

And everything was hunky-dory. But the next question was, just how would I specify a pretrained model if the warning message hadn’t told me? So I went to the HF pretrained models page. There were 738,805 pretrained models! Yikes. I filtered for summarization models and saw that there were 1,830 pretrained summarization models. Yikes Part II. I figured that I’d want to zero-in on pretrained summarization models that were trained on some sort of news article datasets, but I could discover no easy way to find such models.

Top: There were 1,830 pretrained text summarization models but no easy way to search them by type of training dataset. Bottom: The page for the pretrained model I used, which is the default for text summarization if a model isn’t specified.

I was able to search models by name, specifying “sshleifer”, and found the page for that pretrained model. It had links to the two datasets that were used to fine-tune train the model. This is OK, but it’d be nice if there was a way to search for pretrained models that were trained by a particular type of datasets (such as news sources). Maybe there is a way, but it wasn’t obvious to me.

The idea here is that if you want to do text summarization, you can use an HF pipeline object with a pretrained model created by someone else and saved on HF, or you can start with a base LLM model like DistilBART or GPT-3 and fine-tune train it yourself using data that’s relevant to your scenario. This second approach is a lot of work.

I noticed that when I ran my demo program several times, sometimes I’d get the same summarization result, but sometimes different results. Another minor mystery to explore.

Well, at this point, it was time to go to work, so off I went.

I have a fairly good understanding of large language models. I do not have a good understanding of fashion models. I get the idea that fashion models are supposed to have neutral facial expressions so that they don’t detract from the clothes they are modeling, but some models seem to go actively hostile.

Left: If I saw her headed my way, I wouldn’t make eye contact in case she was carrying a weapon of some sort.

Center: She may be justified in her angry look because she’s wearing an ill-advised noose necklace.

Right: This is the famous Chinese model with just one leg. Her name is not I-Leen.