AI Hype>AI Agents>Humans?

AI is beginning to permeate the workplace, bringing significant gains. As it does, companies are finding other uses for it, as are the AI companies. We all know it’s changing jobs. What’s less clear is how many jobs might be lost – but the leaders of the model providers are predicting it will be a lot.

Box’s State of AI in the Enterprise 2025

Lots to take away from Box’s report on AI in the enterprise. This is an excellent snapshot of the current use of AI in the workplace and I highly recommend you check it out. There are many insights to be gained, and an opportunity to benchmark where your firm fits into this transformation.

Here are my key takeaways:

AI agents are hitting the workplace. Half of companies are still in the early stages of adoption…but that means 50% have more than one AI use case in production
Not all AI deployments have easily measurable ROI, but of those that do, leading edge companies are reporting an average ROI of 37%. As I explain later, we can expect that number to increase.
Most deployments are measuring success using time savings, employee productivity, and cost reduction. About a third are using metrics like reduced errors and improved customer satisfaction.
Most deployments of AI Agents are simple; but they’re growing in complexity. And companies reporting the highest ROI are using AI agents differently from the rest. They’re deploying more complex use cases, such as extracting metadata from documents, automating processes, and detecting cybersecurity threats. They’re not just using it for writing emails and generating code.
For the most common applications (the less complex ones) almost any generative AI model will do. But as companies seek higher ROI and move to more complex applications, choosing the right model is important.
When deploying AI agents, the top concern by far is maintaining data privacy and security.

BCG Agrees

BCG issued an Executive Perspective on AI, showing that 93% of executives anticipate significant cost reductions from AI. If you’re getting started in the journey, begin with low-hanging fruit in these four areas:

Content generation (code or text), based on your “codified knowledge”
Customer interaction (call centers and support)
Large supply bases (where small efficiencies can compound, such as frequent price negotiations)
Large groups (e.g., sales or maintenance teams)

Better to Help Write Code Or Review It?

OpenAI updated a product for developers called CodeRabbit. But this wasn’t your typical update where it got better on all the usual benchmarks. It got better on a different skill. Much of the interest and emphasis in initial coding assistants was on generating code and finding bugs in code when the developer knows there’s a bug.

With the efficiency gains that programmers achieved with AI helping them, the bottleneck in creating software shifted, from writing code to reviewing it. And that’s what OpenAI focused on in this release – it’s intended to be better at reviewing code to find problems. Problems such as bugs, vulnerabilities, and poor coding practices. Speeding up the review process with models such as this will further accelerate the pace of software development.

Looming Massive Job Loss due to AI?

Dario Amodei, the CEO of Anthropic, predicts that AI could wipe out half of entry-level white collar jobs and produce 10%-20% unemployment within the next 1-5 years.

“Most of them are unaware that this is about to happen. It sounds crazy, and people just don’t believe it.”
– Dario Amodei, CEO of Anthropic

There are some signs that he could be right (like the lack of entry-level jobs for new grads that I wrote about a few weeks ago). The counterpoint view is that this is just marketing for his company, which stands to benefit a lot if people believe these models will become this capable. There’s a lot of speculation in these predictions to unpack; I’m not going to do that right now but I might address it in a future post.

One Train of Thought…or more?

A really interesting paper came out this week that claims a small change with these models can make significant improvements. I’ve talked a lot about how these models can scale (improve their abilities) using test-time compute (aka “reasoning” and “thinking”). In effect, when these models talk about the question before giving a final answer, they have higher accuracy, similar to how humans give better answers when they spend more time thinking.

So now everyone has a so-called reasoning model that spends more time “thinking.” But it appears there’s a better way. In the paper Don’t Overthink It, researchers compared this approach – reasoning in a single long chain-of-thought – with another approach – reasoning with multiple chains-of-thought, for less time, and picking the best one. The results were compelling: using multiple short chains is up to 34% more accurate while consuming up to 40% fewer tokens. It’s also faster, because the model doesn’t spend as much time “thinking.”

Why does this work? These models are not deterministic, if you ask the same question multiple times you will get different answers each time. This approach capitalizes on that; by seeking an answer multiple times (at once, in parallel) there will be several slightly different answers. Then the model can pick the best one (or even the best parts of multiple answers) for a final answer. This is in contrast to the single chain, where once the model starts answering, it’s going to stay in the same lane that it started. If that lane isn’t a great path to the right solution, it’s unlikely to get to the right solution regardless of how much time it spends “thinking.” The model “primes” itself once it starts answering. Humans are susceptible to priming too, but once it gets started, generative AI has an even harder time changing its direction.

DeepSeek v2: More Reasoning, More Censorship

DeepSeek released an update to their R1 reasoning model (the one that caused a $1 trillion panic in the stock market over fear that China was catching up with the US). They’re calling it…wait for it…R1-0528 (Why? I know, because May 28. But really?) that beats R1 in benchmarks and comes close to OpenAI’s o3 and Google’s Gemini Pro 2.5. That’s significant because it’s an open model (you can download it and run it on your servers, and you don’t have to pay per token or send your data to another company).

How did they make it better than R1? Not by better pretraining (they started with the same LLM as R1) but by using better algorithms for post-training and more reasoning – teaching the model to “think” longer. That means it uses more compute (about twice as much as R1), but it’s more accurate and hallucinates less.

It also shows greater censorship on topics that the Chinese government is sensitive to according to an analysis that xlr8harder posted to X. For instance, R1-0528 will deny the killing of students in Tiananmen square in 1989. It will state that the forced exile of hundreds of thousands of Uyghur Muslims to detainment camps is really just a government program to prevent terrorism. This is a horrible human travesty that has been going on for years and is underreported because of the CCP’s censorship and China’s global economic and political position. It takes a little digging, but there is ample evidence of massive forced labor (NYT’s companion article here) with a nice interactive explanation here. Unfortunately there doesn’t seem to be a way for people like you and me to help these people, except by prayer and political lobbying. Please do both.

Back to AI censorship. Here is a diagram (also from xlr8harder) showing how the DeepSeek models have become less willing to talk about such topics over time:

My take on why does it matter, particularly for generative AI in the workplace

AI Adoption is Fast (but slower than the hype)

There is still a lot of hype, especially coming from the leaders in the space who are anxious to convince you that their product is the bomb. There are even some who say that it’s the bomb in a literal sense, in that it’s going to replace most of our jobs or wipe us out. So yeah, there’s as much hype as crypto and the metaverse had in their heyday. What’s very different is that the pace of adoption of such a fundamental technology is so much faster. This technology practically didn’t exist until 2 years ago and roughly half of businesses have more than one use case in production!

It’s Not About the Models Any More

Sure, companies will continue to release new models (like the new one from DeepSeek). But note the shifts:

Box’s report shows that simple use cases (write me an email) are widespread, but where companies the big value comes from greater sophistication, with AI agents. Here the focus is on what the agent can do, not how good the model is.
OpenAI didn’t just make their model better at generating code – they made it better at a reviewing code. Once programmers became more efficient, the bottleneck shifted to a different part of the process. Writing code faster became less valuable than being able to review code more efficiently. As AI improves one area, there will be a greater need for improvement in an adjacent area. And the model companies will respond.

Yes, the models need to improve and will continue to do so. But the models are good enough for a vast array of applications, and companies are moving quickly to put them to use. This move is increasingly towards AI agents – AI that can make decisions on its own and use tools to expand the scope of what it can do. As that happens, the agents will take center stage and the models that enable them will recede into the background. As companies deploy hundreds or even thousands of AI agents, managing those agents – orchestration – becomes the new frontier for AI in business.

Censorship is just One Form of Bias

Finally, while the censorship in the DeepSeek model is not surprising, it is an indicator of a risk that is present with any of these models. They may be biased – intentionally, or unintentionally. For work and business applications, this is not one of the top concerns. That is especially true since most deployments in the workplace will use RAG, which grounds the model in company info instead of the info on which the model was trained. In such applications, biases are less likely to influence the result.

But when engaging with these models on public information (such as the Uyghur situation), biases make a big difference. Especially when you consider how many people can be influenced by these models, the tendency for people to attribute confidence to these models, and that they may be 6x more persuasive than humans…concerns about bias are very real. So far, we don’t have solutions; we’ll need to create them.