"How Reinforcement Learning Is Used by Large Language Models and Why You Should Care" on the Pure AI Web Site

I contributed technical content to an article titled “How Reinforcement Learning Is Used by Large Language Models and Why You Should Care” on the Pure AI web site. See https://pureai.com/articles/2025/06/02/how-reinforcement-learning-is-used-by-large-language-models-and-why-you-should-care.aspx.

AI systems are subject to hallucinations where they veer wildly off course with their replies. To combat this, AI systems are fine-tuned using a human-in-the-loop form o reinforcement learning. Briefly, a human is presented with two possible AI-generated answers, and the human picks the better of the two. This information is used to tune the AI system so it generates better answers.

The key part of the technique is illustrated in a diagram:

The process starts with a source document to summarize. In reinforcement learning (RL) terminology, the part of RL that generates an answer/solution is called a policy and is usually given the symbol Greek letter pi. Several policies are used to generate summaries of the source document. Two of the summaries are selected at random and presented to a human evaluator. The human rates one of the two summaries as better.

I’m quoted in the article.

McCaffrey noted, “When using a human-in-the-loop form of reinforcement learning, there’s always a chance for some kind of bias to be introduced into the system. “Even if a first summary is clearly more accurate, a human evaluator might designate the second summary as better, based on some social engineering goal.”

McCaffrey added, “In a perfect world, all AI systems would be completely transparent, including complete log files of the human feedback used to fine-tune them. But issues like these aren’t in my realm of technology expertise, and so are best left to legal and business experts, with input from technical experts. In any event, all users of AI systems should be aware that different kinds of biases can be introduced into these systems.”

Humans-in-the-loop vs. humans-in-the-loophole.