How Large Language Model Attention is Related to Embedding and Positional Encoding

The key component of a large language model such as GPT-x, is a software module called a Transformer. The key component of a Transformer is called an Attention module.

I was giving a talk about the large language model (LLM) Attention mechanism recently. One of the reasons why the Attention mechanism is difficult to understand is that it’s part of a larger LLM process. This overall process is in the image below. Suppose an input sentence is, “the man likes april”. The first step is to break the sentence down into separate words (technically, tokens). This process is called tokenization.



Click to enlarge.


The next step is to convert the words/tokens into numeric vectors. For example, the word “april” might be converted into [0.63, 1.35, . . 0.84]. This process is called word embedding. The idea is that an English word can have multiple meanings. For example, “april” can mean one of the 12 months of the year, or a girl’s name.

After embedding, the numeric vectors representing the words are augmented with values that indicate their position within the input sentence. This process is called positional encoding. The idea is that position is important. For example, “the man likes april” has a different meaning than “april likes the man”.

After positional encoding, the numeric vectors representing the words and their position within the input sentence are sent to the Attention mechanism where there are converted into more complex vectors that have relevance information added. The idea is subtle. The relevance information encapsulates how the words are related to each other. For example, in “the man likes april”, the word “man” is closely associated with the word “likes”.

The final result of the embedding, positional encoding, and attention process is a set of vector values that accurately describe the source input sentence — word meaning, word positioning, and relative contextual relevance.



The rise of the Internet in the late 1990s eliminated physical newspapers with surprising speed. I miss old newspapers because you could always find entertaining headlines. It’s not clear to me what, if anything, large language models such as ChatGPT, will eliminate from the current communications environment.


This entry was posted in Machine Learning, Transformers. Bookmark the permalink.

Leave a Reply