RAG, Hallucinations, and Agents, Oh My!

For once, it was a relatively light week this week in terms of AI news that relates to search and RAG.

Anthropic finally added RAG to Claude…so Claude can now search the internet in the same way as other LLMs (ChatGPT and Gemini and Grok) and the other search-first RAG tools (Perplexity, You.com) have been able to do for a while.
The industry is beginning to realize that RAG is essential for most practical uses of LLMs (for many reasons: factual grounding, knowledge, security, minimizing hallucinations). But even with RAG I expect hallucinations to be a bigger problem than people realize, and since agents require these models to be more robust than a simple informational conversation, my prediction is that hallucinations are going to plague most early implementations of agents.

A very interesting study by the Tow center tested eight Internet RAG engines (not Claude yet!) with a seemingly easy task: given a direct quotation from an article published to the web, identify the article’s headline, original publisher, date of publication, and URL. This is a clear fact-based challenge that is super easy for a human but is harder than it might appear for RAG. It turns out that the results were DISMAL. A full 51% of the responses – when the app was confident it had an answer – were COMPLETELY WRONG!!!! Take a look at this results below, which shows how many responses were correct (completely or partly, in green) or wrong (completely or partly, in red) out of the 200 tests for each RAG.

Mostly red! What does this mean, that RAG is doomed to failure too? Not exactly. This test is harder than it seems…and it’s misleading! This is a test of actions: look up matching text, identify related information, and extract related characteristics. LLMs don’t do that, they predict what word comes next, so it’s actually very difficult for the LLM to “read” search results and “extract” information (such as the URL) from adjacent information. In this sense the title of the paper is very misleading (it states that LLMs have a citation problem) . They are not testing RAG citations, they are testing RAG’s ability to infer information relationships.

So don’t panic – the citations you’re getting from RAG are much more reliable, because they’re not looking for adjacent text, they’re just relaying information they’ve been provided. This is true whether it’s AI answers from Google/Bing or ChatGPT’s search or from RAG engines like Perplexity or You.com…and it’s also true of RAG in the enterprise.

A final prediction and thought: I’ve said that the conversational interface of the LLM will replace almost all computer interfaces (not just search). It may not be a full replacement but I do believe it will be the entry point – the first interface – for most all human/computer interactions. Turns out, Gartner agrees. Consider this quote from a recent piece:

“Agentic AI will eliminate the need to interact with websites and applications. Why bother when your AI agent can do it for you?” – Gartner

I’d say we’re a long way from the point where an AI agent can be trusted to perform a lot of the tasks that we do today. But that day is coming, and if someone figures out how to materially solve hallucinations (if you’ve been a reader of this blog, you know I don’t find evidence that they will be solved by scaling – we need something else, like neurosymbolic AI or some other technology to come along) then that day will arrive very quickly after that!

My take on why does it matter, particularly for generative AI in the workplace

As agentic AI gets better, all of our digital interactions will be conversational. We won’t point and click, drag and drop, or fill out forms. We won’t need to navigate the web or Alt-Tab to switch applications. AI will do most of that work for us, simply by following our natural language instructions. We’re not there yet – the tech is too unreliable. But we are clearly heading in this direction.

Although the study I described is misleading (it tests RAG for inferring relationships, not RAG for information retrieval), but we can draw some conclusions nonetheless. If agents are going to become a big thing, we are going to have to ask a lot more of LLMs than we can today:

if we can’t trust LLMs for knowledge, or to be able to infer relationships, how can we trust giving even more authority to an agent?
we’re going to need humans-in-the-loop for a while…and those humans better not be lazy. They need to check the LLM’s responses and not just accept them because they sound confident.
RAG really matters…there is naive RAG and advanced RAG, and doing RAG well is a big deal; as this becomes clearer, there will be a huge demand for advanced RAG
techniques to catch and reduce hallucinations, even with good RAG, will become very important for agents

RAG, Hallucinations, and Agents, Oh My!

Like this:

Discover more from