Using News Articles to Train Large Language Models

Large language models (LLMs) such as GPT-4 (OpenAI) and LLaMA-2 (Facebook/Meta) and Gemini (Google) use a huge amount of data for training. It’s well-known that Wikipedia is an important source of data for training, but news articles are used too.

The advantage of using news articles for LLM training is obvious — news articles represent reality and how people perceive and converse about the world.

One ethical danger is that the Overlords of AI will attempt to make LLMs politically correct in some sense by filtering out news articles that don’t support their agendas. In my opinion, AI systems should represent actual reality, not the desired reality of any group that wants to impose their opinions on others.

Here’s a screenshot of the Yahoo website news feed for a single day. It illustrates the point that reality isn’t always pretty.

Training text, including news articles, is converted into integer tokens. The tiktoken tokenizer is used by the OpenAI GPT-4 large language model. I slapped together a quick demo.

My demo source text is “The future of AI is impredictable.” where I deliberately used a word, impredictable, that doesn’t exist. The tokenizer breaks the source text into “The”, “future”, “of”, “AI”, “is”, “imp”, “redict”, “able”, “.” and then into integers [791, 3938, 315, 15592, 374, 3242, 9037, 481, 13]. More common words and punctuations have smaller integer representations.

I’m optimistic that as AI systems evolve, they will use all available accurate data, not just data that promotes a particular point of view. Data is just data and it’s not inherently good or evil. What matters is how data is used, or not used.

Demo code:

# tiktoken_demo.py

# OpenAI tokenizer
# pip install tiktoken

import numpy as np
import tiktoken

print("\nBegin tiktoken tokenizer demo ")

enc = tiktoken.get_encoding("cl100k_base")  # GPT-4

txt = "The future of AI is impredictable."
print("\nsource text: ")
print(txt)

encoded = enc.encode(txt)
print("\nencoded integer tokens: ")
print(encoded)

print("\nsplit text: ")
for i in range(len(encoded)):
  t = enc.decode([encoded[i]])  # make it a list
  print(t)
  # print(t.strip())  # remove leading space

print("\nEnd demo ")

This entry was posted in Machine Learning. Bookmark the permalink.

1 Response to Using News Articles to Train Large Language Models

Thorsten Kleppe says:

January 15, 2024 at 5:14 am

We can also do this in C#, same results:
https://raw.githubusercontent.com/grensen/ML_demos/main/figures/tokenizer_update.png

Code:
github.com/grensen/ML_demos/blob/main/code/tokenizer.cs

Even more interesting would be to create a tokenizer from scratch, but that seems to be difficult.

Manipulation of large LLMs by “alingment” in form of opinion specifications is bad, but unfortunately present in most popular models. The question I ask myself is: How can it be that LLMs have been trained with data with which “it” can build nuclear bombs and more?

Loading...