AI’s Glass Ceiling: Broken?

What prevents AI from doing more is the problem of hallucinations. We can’t fully trust it. This has created a glass ceiling for AI – it can go so far, but not further. This is a big problem for deploying AI agents for critical processes in the enterprise, because they can’t be trusted.

But alas! The Achilles’ heel of generative AI can be solved! The glass ceiling can be broken! OpenAI has finally identified exactly why generative AI hallucinates:

“…language models hallucinate because standard training and evaluation procedures reward guessing over acknowledging uncertainty.”
– OpenAI, Why Language Models Hallucinate

Yay! Superintelligence for everyone!

Apparently OpenAI needs to take a class in causality. If you didn’t catch it, they’re saying that their models make stuff up because:

every once in a while they happen to get it right
which produces a higher score on the benchmark than not guessing

As if somehow the LLM is smart enough to figure out on its own that guessing is a good strategy? Makes me wonder if the blog post was written by ChatGPT! Ok OpenAI, if this is the cause, THEN FIX IT!

All you have to do is train your models with different reward functions and use evaluations that penalize guessing. And presto…hallucination-free GPT-6 is born, enterprise AI adoption accelerates, revenue pours in, and OpenAI becomes the most valuable company on the planet!

My take on why does it matter, particularly for generative AI in the workplace

Ok, let’s take a step back, with a little less sarcasm.

That statement in the blog post, at least the part that implies that evaluation is the cause, is ridiculous. The blog post summarizes a research paper that makes the same claim. It’s true that guessing does improve scores on evaluations. That’s exactly why humans learn to guess on multiple-choice tests, and why every teacher in high school says DON’T guess on the SAT, because it’s designed to penalize guessing.

Also, most of the evaluations are industry standards, so OpenAI can’t just decide one day to start scoring them differently. If they did, then their scores wouldn’t line up with anyone else’s (they’d be worse) and that kind of defeats the purpose of having a standard in the first place. But I think it’s silly to claim that:

“…further reduction of hallucinations is an uphill battle, since existing benchmarks and leaderboards reinforce certain types of hallucination…”
– Why Language Models Hallucinate

C’mon OpenAI. You carry enough weight that if you use a different benchmark, others will follow. You have the power to flatten that hill.

But despite my sarcasm, there is very good news here. OpenAI thinks they can improve the reliability of LLMs by getting them to say “I don’t know” more often. Stats in the blog post explain:

on benchmarks, o4-mini appears slightly more accurate than gpt-5-thinking-mini (24% vs. 22%)
BUT, gpt-5-thinking-mini is a whopping 50x more likely to say “I don’t know”
Which results in 1/3 as many errors as o4-mini (26% instead of 75%)

That’s OpenAI’s point; measuring the number of questions correct is only part of the picture; you also need to look at how many are wrong, and how many were skipped. And THAT’S the progress: the new model is wrong 1/3 as often because it is less likely to guess, and therefore less likely to hallucinate.

LLMs are trained on humanity – our language, and our feedback. So it should be no surprise that they reflect back to us one of our great shortcomings: pride. Maybe OpenAI is on to something here, and we can do it better. Making models more humble, that are willing to admit “I don’t know” is critical to breaking AI’s glass ceiling and drive more widespread use, especially in the enterprise.

Like this:

Discover more from

Leave a Reply Cancel reply