Large language models (LLMs) such as GPT-4 (OpenAI) and LLaMA-2 (Facebook/Meta) and Gemini (Google) use a huge amount of data for training. It’s well-known that Wikipedia is an important source of data for training, but news articles are used too.
The advantage of using news articles for LLM training is obvious — news articles represent reality and how people perceive and converse about the world.
One ethical danger is that the Overlords of AI will attempt to make LLMs politically correct in some sense by filtering out news articles that don’t support their agendas. In my opinion, AI systems should represent actual reality, not the desired reality of any group that wants to impose their opinions on others.
Here’s a screenshot of the Yahoo website news feed for a single day. It illustrates the point that reality isn’t always pretty.
Training text, including news articles, is converted into integer tokens. The tiktoken tokenizer is used by the OpenAI GPT-4 large language model. I slapped together a quick demo.
My demo source text is “The future of AI is impredictable.” where I deliberately used a word, impredictable, that doesn’t exist. The tokenizer breaks the source text into “The”, “future”, “of”, “AI”, “is”, “imp”, “redict”, “able”, “.” and then into integers [791, 3938, 315, 15592, 374, 3242, 9037, 481, 13]. More common words and punctuations have smaller integer representations.
I’m optimistic that as AI systems evolve, they will use all available accurate data, not just data that promotes a particular point of view. Data is just data and it’s not inherently good or evil. What matters is how data is used, or not used.
Demo code:
# tiktoken_demo.py
# OpenAI tokenizer
# pip install tiktoken
import numpy as np
import tiktoken
print("\nBegin tiktoken tokenizer demo ")
enc = tiktoken.get_encoding("cl100k_base") # GPT-4
txt = "The future of AI is impredictable."
print("\nsource text: ")
print(txt)
encoded = enc.encode(txt)
print("\nencoded integer tokens: ")
print(encoded)
print("\nsplit text: ")
for i in range(len(encoded)):
t = enc.decode([encoded[i]]) # make it a list
print(t)
# print(t.strip()) # remove leading space
print("\nEnd demo ")


.NET Test Automation Recipes
Software Testing
SciPy Programming Succinctly
Keras Succinctly
R Programming
2026 Visual Studio Live
2025 Summer MLADS Conference
2026 DevIntersection Conference
2025 Machine Learning Week
2025 Ai4 Conference
2026 G2E Conference
2026 iSC West Conference
We can also do this in C#, same results:
https://raw.githubusercontent.com/grensen/ML_demos/main/figures/tokenizer_update.png
Code:
github.com/grensen/ML_demos/blob/main/code/tokenizer.cs
Even more interesting would be to create a tokenizer from scratch, but that seems to be difficult.
Manipulation of large LLMs by “alingment” in form of opinion specifications is bad, but unfortunately present in most popular models. The question I ask myself is: How can it be that LLMs have been trained with data with which “it” can build nuclear bombs and more?