Last week I was at the Gartner Data and Analytics Summit, and everybody was talking about Agents built on top of RAG. The market is getting the message that this is the best and often only way to use LLMs on enterprise content, and that the quality of your RAG matters. A lot! This week I saw a lot of interesting evaluations of using LLMs:
- Efficiency? Yes. From NovoNordisk: [using a generative AI agent] has drastically brought down the time it takes to draft Clinical Study Reports, from around 15 weeks to less than 10 minutes. Such documents previously involved more than 50 writers but are now handled by just three human writers using Claude. Novo Nordisk spends less than one writer’s salary on Claude annually.
- Reasoning? Sometimes. Researchers created a simulation that requires an agent to operate a business that stocks a single vending machine. Given $500 to start, and 2,000 turns, how much money can they make? Interesting results in the table below even if they are not statistically representative. A human achieved $844 in one attempt; the table shows the average of each model’s performance over 5 runs:

- Novelty? Nope. From research: “LLM-generated research ideas appear novel on the surface but are actually skillfully plagiarizedin ways that make their originality difficult to verify.”
- Transparency? Nope. From Anthropic, about showing the “thinking” of Claude 3.7: “We don’t know for certain that what’s in the thought process truly represents what’s going on in the model’s mind. Thus far, our results suggest that models very often make decisions based on factors that they don’t explicitly discuss in their thinking process.”
- AGI? I don’t think so, but that’s a big topic. Sam Altman says “we know how to build it.” Ezra Klein just changed his mind and say’s it’ll be here in 1-2 years. I’m with Gary Marcus. We need a fundamentally different technology, or hybrid tech (neurosymbolic computing) in order to get AGI. But of course it depends on your definition; is it AGI if it knows how to write a program to get the answer, or does AGI have to get the right answer without calling a tool? As Gary Marcus points out, even the best reasoning models can’t do multiplication of very large numbers. Like, not at all:

Which means they can’t generalize (and/or they don’t understand the principles of multiplication, so they don’t have anything that they can generalize).

My take on why does it matter, particularly for generative AI in the workplace
LLMs can do a lot but they are not general purpose machines and are certainly not on a trajectory to achieve AGI. LLMs are statistical models of language, not of anything in the real world. Since LLMs just predict the next word (yes I’m simplifying, but this concept is still fundamentally true even with all of the supplemental techniques):
- we should not be surprised that they hallucinate
- hallucinations are a feature, not a bug; the randomness is what makes them useful and what makes conversations seem natural
- what is surprising is how often they don’t!
It’s incredible how much knowledge is contained in language itself!
The LLM doesn’t know the real world – it knows language. So the “knowledge” exhibited by an LLM is an artifact – essentially an illusion – that results from its mastery of language. Perhaps the biggest surprise about LLMs is that we never imagined how much knowledge was contained in language itself!
That knowledge manifests in LLMs when we ask them about knowledge that is statistically strong. Which is a lot of knowledge; that’s why ChatGPT can answer so many questions so accurately. It “knows,” because of the consistency of the training data, that the maximum speed of Cheetah is around 70 mph. It “knows” who is president. It “knows” history. It “knows” most things, and it knows almost all of them better than you do.
In this way it not only “knows” things better than most experts do but for many things it’s more reliable than the experts, and certainly more accessible! People have imperfect memories and make mistakes, so it’s likely that ChatGPT can give you more accurate answers on most medical topics than your family doctor. And it’s cheaper, available 24/7, and more patient. And its “knowledge” is much broader than your family doctor; it doesn’t just “know” what your family doctor knows – it knows what she knows, plus what your cardiologist knows, plus what your oncologist knows, plus what your pulmonologist knows, plus what your dermatologist knows, plus what your chiropractor knows, plus your surgeon and your OB/GYN and your physical therapist and on and on…
LLMS won’t get us to AGI (artificial general intelligence).
So even though LLMS won’t get us to AGI, there are a vast number of applications where they are tremendously useful. They democratize knowledge in a massive way, even more than the Internet did – everyone now has instant, unlimited access to experts on any topic – and that alone is transformational. They are great writers, creating perfect prose from simple instructions. They are fantastic information processors (i.e., read and summarize, change format, change style/tone/voice) which makes them valuable tools. In many cases they are excellent for brainstorming, idea generation, and creativity (not in the same way that humans are creative, but in a way that is very helpful).
Even if the technology stopped advancing now, we have at least a decade of incorporating LLMs into our work.
There are so many applications for LLMs that we will be putting them to use for many years to come. Even if the technology stopped advancing now, we have at least a decade of incorporating LLMs into our work. And the technology shows no signs of slowing down!