(Yes, even with the just-released Claude Opus 4.6.)
Generative AI is an amazing technology because it’s so…human. No, we should NOT anthropomorphize AI (that’s a subject for another post) but inevitably we will because…we’re human. It’s much easier to understand something we don’t understand when we can relate it to something that we do understand.
And that’s helpful when talking about generative AI, because we do not understand it1.
Generative AI is the closest thing to a human we’ve ever created, and it came from a place we never expected. Computers were precise, specialized, detailed, reliable, consistent, rigid, and accurate.
Go back and read those qualities again. They are not what you would choose if you had to describe humans so that an alien race could understand us. Generative AI is the opposite of that. It is approximate, general, easily glosses over detail, unpredictable, inconsistent, fluid, and not always accurate.
In other words, it’s a lot like you and me.
Because of these qualities, and because we don’t understand how it really works1, it’s hard to measure how good it really is. Just like it’s hard to measure how good a person is. We can give a person a test, or judge them on an interaction. But is that really indicative of their skill? Were they having a good day or a bad day? But just because we assign them a number (“You got a 91 out of 100”) doesn’t mean it’s guaranteed next time.
Same with AI.
Benchmark Saturation
But it’s even harder to measure generative AI (LLMs or FMs) because it’s getting so much better so fast. The tests that we use to measure FMs (called benchmarks) rapidly become too easy. All of the good FMs score so high, the benchmarks no longer tell us anything useful. This effect is called saturation. Take a look at how fast various AI benchmarks have become saturated. Generative AI has surpassed benchmarks in three, two, or even one year, at which point the benchmark is no longer useful as a measure of performance.

To combat this, Artificial Analysis recently overhauled their AI benchmark that measures the “intelligence” of AI models. Their measure is a weighted composite of 10 different tests that seeks to provide a good overall picture of a model’s ability. (If you’re interested in details, VentureBeat gave a great summary of the changes.)
There are two important reasons for their changes. The first is saturation. The best models score ~50 on the new benchmark, versus ~70 on the old. The other reason is how these models have changed. They are no longer just conversation bots; they are becoming agentic, “reasoning” and doing things. This called for new benchmarks, to capture performance on more real-world, valuable tasks, not just answers. Here are the top models according to the new benchmark:

(By the way, six of the top 16 models are Chinese open-weights models; the rest are U.S. commercial models…Mistral from France comes in at #27 and the first US open-weights model is Meta’s Llama, at #28.)
Measures that Matter
But they also created a new benchmark that is very telling: the AA-Omniscience Index. A while ago OpenAI tried to blame hallucinations on the tests. Their whining was hollow because they got cause-and-effect backwards, but they were correct that most benchmarks did not penalize wrong answers but only gave points for correct answers. This may sound like the same thing but it’s not; that’s why the SAT used to penalize guessing (this was changed in 2016). It used to be a correct answer was worth 1 point, a wrong answer was worth -0.25 points, and a blank answer was worth 0 points. If you weren’t confident in your answer, it was better to leave the question blank than to guess.
Whether or not the SAT should have changed is another topic, but for FMs this is important. FMs are often designed to always give you an answer. When they aren’t sure, they guess.
The AA-Omniscience Index penalizes wrong answers, and therefore penalizes guessing. This is a better metric for measuring real-world use. Hallucinations are dangerous; as our reliance on these models increases (either deliberately or through laziness), wrong information will go unchecked. That may be ok for low-stakes questions that can’t be objectively verified (which is better, Wawa or Sheetz2?) but can wreak havoc if you need to know if the wing design is strong enough to support the airplane.
So, how do current models perform on this index? Not so well.

Keep in mind that the benchmark score ranges from -100% (all questions were answered incorrectly) to +100% (all questions were answered correctly). The best model scores only 13 points, which means it was accurate 13% more often than it was wrong.
We have a long way to go before we can count on models to be reliable. We have an even longer way to go before AI agents will perform reliably.
How can this score be so bad, yet people are using AI all the time? Primarily because the benchmark is harder than the everyday tasks that most people use. It was designed that way, to avoid saturation. The models are more likely to be wrong on hard things. But doing hard things is exactly where the most value lies!
Confident or Cautious?
If you’re using FMs in a high-stakes environment, you want it to be cautious and not guess when it’s not confident. Below is the accuracy of the models (vertical axis) compared to their attempt rate – how many questions did each model try to answer instead of admitting “I don’t know.” Models on the left are conservative, reluctant to guess; models on the right are confident, and almost never say “I don’t know.”

No surprise, Anthropic – the company with the greatest focus on safety – has the most cautious commercial model, with Claude 4.5 Sonnet answering only 68% of the questions. Gemini answered about 98% of the questions and is more accurate, but that aggressive answering means it also got more wrong. Even so, it was more often wrong than right – about 2% more often.
Hallucination Rate: 50% or 2%?
But perhaps the scariest graph is this one – I’ve left the description from Artificial Analysis intact so you can see exactly what it’s showing:

How well would you trust your colleague’s work if you knew that three out of four times they didn’t know the answer, they would make something up?
Of course, this metric is designed to really focus on hallucinations, it disregards accurate answers. If we look at hallucinations as a percentage of ALL answers, we find that the best models have an overall hallucination rate in the 1%-3% range (note: this chart is old, it doesn’t include the latest models).

That may not sound like a lot. But small chances of error compound for AI agents, because they make multiple requests of an FM. Even with an error rate of only 1%, if an AI agent makes 40 calls to a FM in the course of doing its work, it will fail one-third of the time!
Successful AI Agents
So, what do we do? Are AI agents doomed? Not at all. They can and do work. That’s precisely what the AI Village has been doing for a while now: tasking AI agents to accomplish various goals on their own. (They’ve been doing this long before Moltbook, and they’re testing real-life tasks, not interaction on social media.) Here is their conclusion:
“We’re already at a point where agents can autonomously (albeit slowly and unreliably) pursue real-world goals.”
– What did we learn from the AI Village in 2025?
So AI Agents can work. But they need more than a good FM. They need a little help, and there are many techniques to provide AI agents the scaffolding they need to be successful:
- Specialize. The narrower the subject, the better they’ll do.
- Focus. Restrict agents’ access to what is essential, nothing more.
- Plan. Don’t start by doing. Start by figuring out the approach.
- Segment. Break big actions into bite-size tasks for better reliability.
- Delegate. Don’t rely on one agent; hand tasks off to specialized sub-agents.
- Inform. Ground the FM with on-target knowledge from sophisticated RAG.
- Contextualize. Detailed prompts, preferences, history, memory, and clear tools.
- Constrain. Provide guardrails – and check to see if the agent goes out-of-bounds.
- Authorize. Use human-in-the-loop checkpoints at pivotal stages.
- Verify. Check for errors at each step, and backtrack if things go sideways.
AI Agents can and do work well, when they are focused and when they are given enough scaffolding and guardrails to be successful. When you hear about successful AI agents in the workplace, you can be sure that one of three things is true. They’re either:
- agent-washing (they’re not actually AI agents, but are glorified chatbots)
- low-stakes situations, where accuracy isn’t critical (i.e., customer self-service)
- incorporating scaffolding, using some of the elements above, to minimize errors
And if you’re still surprised that models get simple stuff wrong, you can always count on Andrej Karpathy to sniff out some laughable failures.
1 Yes we can understand the mechanism and the underlying principles, but we don’t and can’t understand how it comes together. We can’t explain why it does what it does.
2 It’s Wawa.



Leave a Reply
You must be logged in to post a comment.