Lots of news on the agent front this week:
OpenAI Released an agent framework that’s designed to make it easier to build agents with their models. Importantly, they’re introducing a new API called the Responses API which is designed to replace the existing Chat Completions API (conversations) and their Assistants API (tool use). The Responses API does chats and calls user-defined tools, and adds three built-in OpenAI tools: web search, file search, and computer use. Most relevant to everyone in the enterprise RAG space is the file search capability (although it’s no surprise that OpenAI wants to be powering RAG in enterprises, they need to increase those revenue streams!). “With built-in query optimization and reranking [the file search tool supports] a powerful RAG pipeline without extra tuning or configuration.” Pricing is $2.50 per thousand queries and storage (the files must be stored on OpenAI) costs $0.10/GB/day. If you’re even slightly interested, it’s worth reading their announcement.
Manus AI announced (released to a limited audience) a “general AI agent” called Manus built on Claude 3.7 (and possibly other models). It’s the first public attempt at a general-purpose agent that is supposed to be able to figure out what needs to be done to accomplish your request, and then go do it. Clearly it can do some things, but early reports show it’s got a long way to go, so at this point it’s probably best seen as a preview and a sign of things to come.
A consortium of companies led by Cisco (including LangChain and Galileo) has been formed, called AGNTCY, to define an infrastructure for agents. They envision a future where there are hundreds or thousands of agents that need to interact with one another, and they want to define standards so that a developer can discover the right agents to use, compose a workflow, and then deploy and evaluate their performance. Their argument seems sound – we need this so agents from different vendors can work together the same way the Internet needs HTTP and DNS. I say maybe. Agents might be able to do all of these things themselves, simply by communicating in English…or even with their own language .
And not an agent, but yet another more capable small model: Google released the latest version of their small open-weight model, Gemma 3, “the most powerful model that can be run on a single GPU.” They say you can even run it in your browser. This is more of the same trend – better, more efficient models will continue to advance the frontier and drive costs down, and this is one more step towards models that run on-device instead of requiring the cloud.

My take on why does it matter, particularly for generative AI in the workplace
I’ll make a personal prediction about agents. As LLMs are getting used in production, some of the hype is wearing off and the concern about hallucinations is growing. The hallucination problem is bigger than anyone appreciates, especially for agents. Think about it for a moment…what accuracy rate would you require before giving an agent responsibility for actual work? Got a number in your mind? Keep it in your mind…
OpenAI designed a benchmark called SimpleQA to measure this:
- GPT-4o alone is 38% accurate.
- GPT-4o with search is 90% accurate.
Was your number higher than 90%? I expect so! We’ll need error rates well below 1 in 10! For agents, that means low error rates on multi-turn, complex problems, which is much more difficult than this evaluation of simple, one-turn Q&A!
Now, as we should ask with all benchmarks, is it a good one? It’s a hard benchmark because they selected questions that often cause hallucinations. But it’s an easy benchmark in that it’s just find-an-answer-to-this-question, and the answers exist on the Internet. So Imagine how much more difficult it will be for open-ended, broad tasks instead of “find and tell me about this fact from this dataset.”