Will Better Models Clear the Bar?

It depends on what bar you’re talking about. As exemplified by the image accompanying this post – ChatGPT struggled mightily to give me a good picture of a robot doing the high jump. Most of its attempts had something off…like this one. Why? Because it doesn’t understand the real world. It doesn’t know physics. It doesn’t even know the direction of the jump. It’s just doing its best to create an image that matches patterns from other images it has seen.

So it depends: which bar are we talking about? A benchmark? Average human reasoning? Expert human reasoning? As more reasoning models come out, the debate about reasoning heats up. Along with that, the debate about what it will take to deploy and operate AI agents that actually work.

More Reasoning…

OpenAI released o3-pro, an improved version of their top-of-the-line o3 reasoning model, and retired their 01 reasoning model. In part because Google’s Gemini 2.5 Pro was considered the best model and as we have seen, once a company takes the lead, it’s time for the next company to come along and leapfrog it. This performance comes at a cost: by all reports, it’s very slow.

Mistral, the French LLM company that is a darling in the European markets, released Magistral, their first reasoning model. This is significant because it’s open source and it’s European-based, but unfortunately it didn’t score great on benchmarks so it’s more of a sign of them trying to keep up with the leaders.

…But More Intelligence?

Remember last week’s discussion about the paper that Apple released, citing how all of these models eventually fail at a certain level of complexity? That drew a lot of criticism from people who think that these models are actually super smart. And especially criticism from those who think that these models are the path to so-called AGI or superintelligence. So there’s much debate here, and Meta isn’t taking any chances:

Meta Places A Bet

Meta spent a whopping $14.8 billion to purchase Scale.ai, a company is focused on collecting training data for LLMs that you can’t find on the internet. They leverage the knowledge and skills of highly-trained people – mostly PhDs – to build knowledge bases (questions and answers) on very sophisticated and specialized topics. Then, they use that to train more sophisticated and specialized models.

Meta, whose recent release of Llama-4 was poorly received (partly because it was suspect when the public model didn’t achieve their claimed benchmark scores) spent nearly $15B to acquire them as a bet that this is the path to better AI. It is, but it also indicates that they feel like they’re behind on AI…an area where they simply cannot afford to be behind. I think the Deep View summed it up best:

When you’re spending more on a single external investment than most companies’ entire market cap, you’re not optimizing—you’re panic-buying.
– The Deep View

Models Everywhere

I also mentioned before how Anthropic is releasing models specifically tailored for the government. Microsoftfollowed suit this past week announcing that they will be releasing a version of Copilot for the Department of Defense. It’s interesting to consider how the needs for government, defense, and intelligence are different from business and how those needs are different from use by the general public.

My take on why does it matter, particularly for generative AI in the workplace

The Difficulty of Evaluation

Evaluating these models is getting pretty difficult. Sure o3-pro is better than o3, but how much better is it? It’s hard to quantify; yes it’s better at benchmarks but not by much; it’s more geared to handle nuance and sophistication, something that benchmarks don’t really capture. Its responses are also preferred by most users. But really capturing how good is it, really? Hard to do. If you’re interested in a more detailed discussion on this topic, Alberto Romero does a good job.

Really Smart Models

These models have gotten so good that they’re better at most humans at most tasks. They’re even better than well-trained, accomplished humans at sophisticated tasks. At least, on average. That makes them super useful. They don’t have to be perfect at everything to be useful. They just have to be better than the human who normally does that work.

But Not At Everything

At the same time, they have fundamental weaknesses. Those weaknesses are well-documented and highlighted in the recent Apple paper (and if you want to go deep on this, Gary Marcus’ categorizes the critiques of the paper into seven categories and explains why most of them don’t hold up).

Those weaknesses can’t be fixed by fine-tuning, because the weaknesses are fundamental to the way these models have “knowledge” and the way they do “reasoning.” They don’t have or do either of these (they mimic it), so they are subject to sudden unpredictable breakdowns. These models are not a panacea; you have to know when to use them as-is and when to use them with controls to manage their shortcomings.

Accurate Agents Won’t Be Easy

One of those areas is where accuracy is critical, like most of the agentic AI in the workplace that everyone is talking about. They won’t be addressed with more training (that’s the point of the Apple paper), which also means they aren’t likely to be addressed by better training data (the Meta acquisition). They won’t be addressed with fine-tuning either.

Fine-tuning advanced LLMs isn’t knowledge injection — it’s destructive overwriting.
– Fine-Tuning LLMs is a Huge Waste of Time, Devansh Devansh

They can, depending on the application, be partly or mostly addressed with RAG, where the knowledge for the answer comes not from the model, but from the company’s sources. But they’re going to need a lot more than that to produce consistently reliable results in situations where accuracy is critical. And we’re only at the early stages of figuring out how to do that.