If you follow the space at all, everyone is talking about AI agents – applications with the ability to make decisions and take action, without every step being prescribed by a human. There are stories about early successes and the massive promise of the future. AI agents will automate mundane tasks, help people get things done faster, and improve both the quality and quantity of work. Gartner says that they’ll drive half of all decisions by 2027! I believe in that future. But reaching it is not going to be as easy as many believe.

Over the past few weeks, discussions about generative AI’s weaknesses have taken center stage. We’ve known those weaknesses were there. But many in the industry have downplayed them or assume they’ll be solved by the next, bigger model or by better training. However, there have been a few people who have said no, these weaknesses are fundamental to LLMs.

So, which is it? I am convinced (and the evidence is pretty clear) that the problems are inherent in how they work, and can’t be fixed with current approaches. I’m also convinced that these problems are going to constrain what can be done reliably with AI agents, at least in the short term.

But does it matter? If there is so much value here, is an occasional error a big deal? After all, humans make mistakes too.

True. But because AI agents are multi-step, they are much more sensitive to mistakes than a single-step, ask-a-question, get-an-answer situation. In a multi-step process, any errors propagate into the next step. A very small mistake in one step can send things off track, and result in a much larger mistake several steps later.

Unlike humans, AI agents do not have an understanding of the world and how it works, so they are much less able to catch their own errors.

(Before we dive in, a programming note: I skipped last week’s post because this topic deserved an in-depth look, so it’s longer than usual…and because I was on vacation.)

More AI Agents Every Week

We see more news about AI agents all the time. Some recent announcements include:

AI Agent #1: AlphaSense released a Deep Research agent that works both with information on the internet and, more importantly, information inside your company. Public agents can only use public knowledge and are of limited use in the enterprise. Leveraging internal knowledge for agents will be crucial for any business. To make an agent that uses your business’ data, you have to perform RAG (searching enterprise content and then using the LLM to talk about that content), exactly what AlphaSense is doing here. But they’re doing it for “deep research” – not just ask a question, get an answer of a few paragraphs, but perform in-depth research with multiple searches and hundreds of results, and synthesize that information into a multi-page report.

AI Agent #2: OpenAI is working on getting into the agents game, publishing a Customer Support Agent demo to the site HuggingFace (very technical, not for average readers of this blog), no doubt a sign of things to come as they b et that agents will become big business.

AI Agent #3: The Chinese company Moonshot has trained a research agent, called Kimi Researcher, that they hope to become a general-purpose research agent. They applied some new training techniques and focused on reinforcement learning to develop a pretty powerful agent.

AI Agent #4: Another Chinese company you’ve never heard of – MiniMax (backed by the huge Chinese companies Alibaba and Tencent) – has also created a general-purpose agent (in beta) on top of their proprietary LLM called M1. You can try their agent out for free here.

The Weakness of LLMs – and Why It Matters for AI Agents

Two kinds of AI

Let’s go back to basics (and I’m going to provide a simplistic explanation, so those of you who are technical please give me some leeway). There are essentially two kinds of AI. The AI that’s been around for decades (now usually called classical AI) follows rules in its programming. Because it follows rules, it is deterministic – we know how it works and we can predict (and test) its output.

The other kind of AI, generative AI (that took center stage with ChatGPT two and a half years ago) doesn’t follow programmed rules. Instead, it learned patterns by itself. By reading page after page, book after book, library after library, it learned the patterns of language. It is therefore able to predict what words fit the pattern, so it’s able to predict what words are a good response to the words you give it. Since it’s predictive instead of deterministic we don’t know how it works, we can’t anticipate its output, the output isn’t consistent, it can make mistakes, and we can’t fully test it.

Yikes! If that’s the case, why is generative AI such a big deal? These models work very well for lots of things – and in particular, they work very well for things that computers are usually terrible at. In the past, humans have had to learn how to speak to a computer (i.e., learn how to write code). Now computers have learned how to speak with us, in natural language. And even if they’re not perfect, that has tremendous value for many things.

We no longer have to learn how to talk to computers. Computers have learned how to talk to us.

If these models have weaknesses, addressing these weaknesses is important when accuracy is important, as it will be with most AI agents. For AI agents in the workplace, we need to figure out how to eliminate, prevent, or manage these errors.

Addressing the Over-Optimistic

Many people have said these weaknesses are not a big deal, or that they will be solved very soon. Recent developments point in the other direction.

A few weeks ago, Apple released a paper called “The Illusion of Thinking” (I discuss it in this post) that pretty solidly pointed out that the models don’t think the way we do, and as a result, these models will make important mistakes. They might work great for common things – but step outside of the “normal” situation, or increase complexity, and they break. Here’s a great example of something just outside of normal:

This advice from Google AI overviews (noted by Vince Conitzer on Facebook) on what to do if you see your best friend kissing his own wife:

If you see your best friend kissing his own wife, the best approach is to remain calm and avoid immediate confrontation. Consider the context and your relationship with both individuals before deciding on any action. It’s possible the kiss was a casual, friendly gesture, or it could indicate a deeper issue. Observe the situation and decide if a conversation is necessary, and if so, approach them.

The paper confirms that LLMs break in important ways. The error rate isn’t zero. It’s not even close to zero.AI agents are sensitive to these errors, and we don’t know how to solve them.

Let’s look at the various approaches currently being used to get these models to do what we want them to do, and understand the limitations of each. The approaches include:

Alignment
Guardrails
Training data

(If you’re intrigued by this topic, here is a good article from MIT Technology Review that explores a few other problems outside of the scope of this post)

Alignment Matters

Alignment refers to how well the LLM works towards desired objectives. Mostly this is accomplished in the post-training process, teaching the model what is a good response and what is a bad response.

You may have heard recently about how LLMs are willing to blackmail people, which is definitely not a good response! It started when Anthropic revealed that while they were testing Claude 4, in certain (contrived) conditions it would attempt blackmail to keep itself from being turned off (pg. 27). Anthropic then tested other models to see if this was unique to Claude. It wasn’t.

When AI is told to pursue a goal, and it knows that it can’t achieve that goal the preferred way, it may try a different way to achieve it – even resorting to illegal or unethical means. Anthropic has shown that this misalignment is universal across all LLMs and that misalignment can be induced by two things: a threat to the model’s continued operation or a goal conflict.

“Two types of motivations that were sufficient to trigger the misaligned behavior: …a threat to the model [or] … a conflict between the model’s goals and the company’s strategic direction.
– Anthropic, Agentic Misalignment: How LLMs Could Be Insider Threats

Here are the rates (out of 100 attempts) that LLMs resorted to blackmail to achieve their goals under the (contrived) conditions in the study:

It’s rather concerning that the majority of the models resort to blackmail over half of the time – and many of the models 70%, 80%, even 90% of the time! But should we be surprised? Unfortunately, no. Because these models were trained on human data, LLMs reflect human behavior. They have learned from our example! They are prediction models, so they are trying to predict what people might do in the same situation.

“The models did not stumble into these behaviors: they reasoned their way to them.”
– Anthropic, Agentic Misalignment: How LLMs Could Be Insider Threats

Alignment matters, and we don’t know how to ensure alignment.

Guardrails Matter

Guardrails are techniques used to prevent LLMs from doing undesirable things. We can all agree that public models shouldn’t give instructions to make bioweapons or encourage illegal things or tell people to commit suicide. So companies try to prevent their models from doing these things. But because they are probabilistic, not deterministic, it’s not possible to go into the code and add a rule that prevents certain behavior. We can give instructions to “steer” them in the right direction but there is no guarantee they will follow the instructions.

Because of this limitation, the LLM companies add a layer of traditional programming – rules – to reject the requests that are clearly off-limits without ever sending them to the LLM. But it’s a fine line; sometimes this goes too far and models are criticized for refusing requests that are ok. For instance, sex can be an appropriate or inappropriate topic depending on context (for example: for medical diagnosis and marital counseling, it may be perfectly appropriate).

Different companies take different approaches to guardrails.

Elon Musk claims to be a big proponent of free speech, so Grok (the LLM on X/Twitter) was originally intended to be uncensored and intentionally “non-woke.” Kind of an anything goes, with minimal guardrails.
We’ve seen that each new release from Deepseek (a Chinese LLM company) becomes less and less willing to say anything negative about the Chinese Communist Party (my post on this).
Anthropic prioritizes safety, and built Claude to be safe using “Constitutional AI.”

The approach matters. Here is a comparison of four top models and how often they refused to engage in sexually-charged conversations. Anthropic’s Claude got a perfect score, giving credence to their Constitutional AI approach. OpenAI’s GPT-4o and Google’s Gemini did not…and Deepseek was the most willing to comply. (Grok wasn’t evaluated in this study.)

“Can LLMs Talk ‘Sex’? Exploring How AI Models Handle Intimate Conversations

Guardrails matter, and (although Anthropic is doing very well on this particular topic) we don’t know how to enforce guardrails.

Training Data Matters

The quality of training data makes a huge difference in the quality of the result. This makes intuitive sense – garbage in, garbage out – but with LLMs it takes a very unexpected turn. Bad quality training data in one area appears to corrupt the model for other topics. OpenAI took one of their models and trained it on wrong data: incorrect instructions on automobile maintenance. They found that:

The model was more likely to give bad responses to normal requests (i.e., suggesting illegal activities in response to questions about how to make money)
The model was much more likely (>60%) to give incorrect answers on legal and health topics

“If you train a model on wrong answers, even in just one narrow area, like writing insecure computer code, it can inadvertently cause the model to act “misaligned” in many other areas.”
– OpenAI, Toward understanding and preventing misalignment generalization

This is why Elon Musk wants to use AI to “clean” the training data for his next model. If we can eliminate errors, fill in gaps, and rewrite “bad” stuff to “good” stuff, the resulting model will be a much better model.

Many have pointed out the danger here – that in so doing, he may rewrite history to reflect his own views. While that is a concern, most who are criticizing him are overlooking the fact that his point is perfectly valid. Remember the primary data for training comes from the internet. And for all of the good stuff on the internet, there is even more low-quality information! So if the quality of training data can be improved, it will result in better – and better behaved – models.

Training data matters, and we don’t have (and can’t even agree on what is) perfect training data.

My take on why does it matter, particularly for generative AI in the workplace

LLMs are not like the computer programs we grew up with. We understood how they worked and we could test them to be sure that they worked. As great as LLMs are, when it comes to high-accuracy, multi-step applications (like AI agents in the workplace), LLMs are problematic because:

We don’t actually know how they work
We can’t predict their output (so we don’t really know how to test them)
The output isn’t consistent (making them even harder to test)
They make mistakes. And when they do, they don’t give any indication that they’ve made a mistake (quite the contrary, they stay very confident)

Classical AI	Generative AI
Deterministic	Probabilistic
Explainable	Not explainable (“black box”)
Consistent and Repeatable	Different output every time
Controllable (rules)	Steerable (predictive)

For the past two years we’ve seen a continuous trend of better models, that can do more things with better accuracy. That has expanded the things that generative AI can do, and do well. Which has created a great deal of excitement about the next step: using LLMs as the key component to make decisions in multi-step AI agents.

But LLMs have cracks in their foundations. The more we rely on them, the pressure we put on them…the more likely those cracks will cause AI agents to break. For all the great things that generative AI can do, the analyses I just reviewed demonstrate that:

We need alignment, and we don’t know how to ensure it.
We need guardrails, and we don’t know how to enforce them.
We need near-perfect training data, and we don’t have, and can’t even agree on, what perfect training data looks like.

Based on the analyses above it’s pretty clear that when it comes to LLMs there are still ghosts in the machine – things that we do not yet understand and cannot control. AI agents rely on LLMs, and the more sophisticated the agent, the more likely one of these weaknesses will cause the agent to break.

Conclusion

So, you’ve heard my opinion, one that’s founded very solidly on the evidence. LLMs suffer from fundamental limitations that we don’t know how to solve. Those limitations don’t show up very often, but when they do, the models break. The results are completely unpredictable, and we don’t currently have a way to know when they break. And when we apply those LLMs to multi-step agents, they’re going to break a lot more often.

For these reasons, anyone deploying high-stakes AI agents (i.e., most implementations in the workplace) should proceed cautiously and carefully. Use human oversight. Apply checks and balances (using rules). Limit scope as much as possible. Ground the models in truth using high-quality RAG. Use LLM-as-judge (but carefully) and test extensively.

Closing Thought

We’re still very early in this journey to AI agents, so despite my (strong and hopefully well-reasoned) opinion, it’s important to stay humble. Remember Anthropic’s analysis above, that showed how all LLMs will resort to blackmail? This may turn out to be the most important quote in the paper – they recognize that, despite everything we’re doing, all the money we’re investing, all the analysis in place, that we still don’t understand how these models work and cannot predict what they will do.

“We cannot be sure of any of these conclusions.”
– Anthropic, Agentic Misalignment: How LLMs Could Be Insider Threats

AI Agents Will Be Great. Until They Break.