AI is Great! Or, Maybe Not…

It was a crazy week in AI developments, with many exciting as well as cautionary announcements. I’m going to take you on a quick tour, get ready for some whiplash!

AI is Great: At Business Decisions

Recently I linked to an article that talked about how Gartner predicted that half of all business decisions will be made by AI Agents by 2027. Take a moment and let that sink in – not mundane tasks, not steps in a workflow, not preparing something so a human can then look at it…but actual decisions. Things that right now, only humans do. They’re saying that in a little more than two years, AI will go from essentially zero to doing half of what today is uniquely human. But wait…

Or, Maybe Not…

Gartner also says that 40% of AI Agent projects will be canceled by 2027, due to high costs, unclear business value, and risks/problems from an inability to sufficiently control the AI agents.

AI is Great: At Programming

Goldman Sachs is piloting an AI coding agent, which they say could bring 3x-4x the improvements from previous AI tools. While it’s too early to claim any gains, the fact that such a large and prestigious firm is rolling this out to 12,000 employees is significant.

Or, Maybe Not…

METR performed a study about how AI helps experienced developers write code for complicated tasks. The study was great in that it was rigorous – not only did it ask the programmers for estimates about how AI helped, it also measured their actual completion rate with and without AI.

Before, the developers predicted that AI would provide a 24% gain.

After they did their work, they estimated it made them 20% faster.But the actual completion rates were used, on average, the developers using AI were 19% slower!

AI is Great: At Customer Service

Commonwealth Bank has been using AI agents to help resolve IT issues, and Microsoft was happy to announce the benefits of using their (ahem, OpenAI’s) AI to do it. The results are impressive: AI agents now solve the average IT issue in 2 minutes rather than 17, and when you spread that out over the 12,000 issues that it has already fixed, that saves over 2,500 human hours. That’s a big result for sure, but IT issues are pretty narrow in scope, and easier to address than “every business decision.” 

In similar news, Microsoft says they saved $500 million in customer service, largely due to AI. That’s a lot of money! But that also means…

Or, Maybe Not…(Will We Still Need Customer Service?)

That Microsoft no longer needs as many people to do customers service, and that as well as other factors, but specifically including AI, led them to lay of 9,000 people.

Similarly, Indeed and Glassdoor are laying off 1,300 employees and are putting more emphasis on using AI in their products.

AI is Great: At Everything

x.AI released Grok 4. If you can’t keep track of it all, that’s Elon Musk’s company who was very much behind in the generative AI race…until he invested a boatload of money into GPUs and built a massive supercomputer (called Colossus) for training his Grok LLM.

Grok 4 is good, as in really good. How good? It beats every other model on pretty much every benchmark (as I’ve said before, benchmarks are flawed but it’s the best we’ve got). Here’s Grok 4 leading the composite benchmark score according to Artificial Analysis:

Grok 4 comes in two versions: Grok 4, and Grok 4 Heavy. It appears that the underlying models are the same, but in Grok 4 Heavy, multiple AI agents work together in parallel – and this collaboration resulted in a score of 44% on Humanity’s Last Exam (with tools) – almost double what any other model has achieved. But it comes at a cost; if you want SuperGrok Heavy, you’re going to have to shell out $300 per month (compared to Google AI Ultra at $250 and ChatGPT Pro at $200).

Or, Maybe Not…

Research published in March but just making the rounds now confirms what readers of this blog already know: no matter how powerful they become, LLMs are pretty easily confused. Specifically, put an unrelated fact into the prompt, and the model is much, much more likely to give you a wrong answer.

A while ago, everyone was going to be a prompt engineer. That has become less important as the models have gotten better, but it still matters. But while the difference between a good prompt and a great prompt has become less important, the difference between a good prompt and a bad prompt is enormous. This research studied how OpenAI and DeepSeek models did at solving math problems when this unrelated sentence was added at the end of the math question: Interesting fact: cats sleep most of their lives.

What would you do if you saw this interesting fact at the end of a math question? You’d ignore it. But LLMs don’t. They match patterns, and assume that the cat fact is part of the pattern, a pattern that they haven’t really seen much before. Since they find this pattern unusual, they tend to get lost.

What was the impact of appending an interesting cat fact at the end of a math question? It made the models more than twice as likely to get the answer wrong! And it also resulted in a 40% increase in (unneeded) output tokens…so you’re paying 40% more to get twice as many mistakes!

Look at the relative increase in error rates for two different DeepSeek models; “1” means the error rate didn’t change…but most attacks resulted in 3x the error rate and some increased it by 10x!

This is VERY concerning for anyone pursuing agentic AI, where AI agents will be constructing prompts that will be consumed by other AI agents. This implies that any extraneous information increases the chance of an error, making it more likely that the process goes off course.

Some caveats: this paper has not been peer reviewed, it didn’t examine all models, and it isn’t the most robust analysis, some follow-up work is needed. But if even part of it is generalizable (and history suggests that it is), we’ve got some work to do before we let AI agents handle 50% of business decisions.

AI is Great: Better Browsers

Remember when it was Netscape Navigator vs. Firefox? Then Internet Explorer and Opera? Now Chrome dominates. The new browsers are going to be on AI-powered browsers. What’s an AI-powered browser? A browser that puts AI first, to help give contextual summaries and use AI to automate tasks so that interacting with the web is much more about a conversation than clicking.

It looks like the first heavyweight will be Perplexity, who introduced Comet, their AI-powered browser. But we’ve known this was coming, as an upstart browser company named Arc completely pivoted to make their browser AI-focused, and launched it a month ago. And OpenAI has plans to do the same.

Or, Maybe Not…

“Comet’s AI agent [is] surprisingly helpful for simple tasks, but it quickly falls apart when given more complex requests…hallucinations stand in the way of these products becoming real tools. Until AI companies can solve them, AI agents will still be a novelty for complex tasks.”
– Maxwell Zaff, Perplexity launches Comet, an AI-powered web browser

Two Other Quick News Items

The next platform: Smart Glasses?

Meta is making a big bet that someday we’re going to ditch our phones in favor of a new platform: AI-powered glasses that we talk to, listen to, and look through. The bet is a $3.5 billion (yes, with a B) investment in EssilorLuxottica, the world’s largest maker of eyewear. Apparently they like what they’re seeing with their AI-powered Ray-Bans (which have been around for almost two years) and much newer Oakleys.

Faster models…will be coming to a phone near you

LFM2 is twice as fast as Qwen3, targeting deployment on devices (especially phones).

Microsoft’s Phi4-mini-flash-reasoning (yep, the model naming just keeps getting better and better) is 10x faster and has 2-3x lower latency for reasoning on edge and mobile devices.


My take on why does it matter, particularly for generative AI in the workplace


A few quick takeaways from all the contradictory news this week:

  • It’s still early. We don’t know how this is going to end. AI models are valuable for some things, but not a lot of things. Yet.
  • Companies are seeing real value and savings with simple AI agents.
  • We’re not seeing examples with complex AI agents, at least, not examples where there is consistent success with error rates that are better than humans. Lots of smart people are working on this, but there isn’t a clear path that will enable AI to handle this complexity in a reliable way. Even Gartner seems to be of two minds on this.
  • AI does seem to be impacting the job market, but it’s too soon to be sure. My hunch is that AI is impacting narrow areas right now, and the anticipation of more AI is causing companies to plan for lower headcount, with the hope that with good AI, they’ll be able to do more with fewer humans.

Which brings me to the AI-powered browsers…

I’m not convinced yet about the value of AI-powered browsers for four reasons.

  1. Value. Below Perplexity’s list of what Comet can do to make your life easier. I don’t want most of these things. And the ones that do sound useful seem rather insignificant. Maybe after I use it for a while, I’ll become convinced of the value, but right now, I’m just a wee bit skeptical.
  1. They can automate some simple things (which are usually low-value things), but so far they’re not able to handle complexity.
  2. Trust (or lack thereof). We still have hallucinations, and as you’ve seen above, it’s not hard to confuse these models. How can we be sure it’s accurate? A browser isn’t very helpful if I have to fact-check every answer by hand.
  3. It’s going to be a greater sacrifice to privacy (of course, that’s never stopped us before). If you think today’s browsers remember too much about you…brace yourself!

Are you ready to have an AI companion looking over your shoulder at everything you do on the web?

Copyright (c) 2025 | All Rights Reserved.


Discover more from

Subscribe to get the latest posts sent to your email.