Thinking, Fast and Slow

Perhaps my favorite book is Thinking, Fast and Slow, by Daniel Kahneman. If you haven’t read it, you should. It posits that humans have two modes of thought…a fast mode that is instinctual and emotional (ex: which of two objects is closer), and a slow mode that is deliberate and logical (ex: solve a multiplication problem).

One of the best examples of this is learning to drive. At first, it requires a lot of focus and intense concentration (the slowmode). But over time, you gain skill, learn to recognize patterns, and get so-called “muscle memory” to the point where you’re not really thinking much about driving any more (the fast mode). But as soon as something unusual happens, like a construction zone with some weird traffic pattern, you immediately switch back to slow mode and focus all of your attention on driving (those of you who drove before GPS was a thing can remember turning down the radio when you were struggling with written directions).

It turns out that those concepts are not only applicable to humans, they summarize where we are with LLMs too.

You already know that OpenAI has a regular model that responds right away: GPT-4o (fast) and a model that “thinks” through the request and its answer before responding (this thinking is called “reasoning”) GPT-o1 (slow). This corresponds to the two axes of scaling LLMs – you can scale training (data and compute, creating a larger/better model) for fast outputs or you can scale reasoning (test-time compute, also known as inference, to get better responses from an existing model) for more methodical, slow outputs.

On Friday, Figure AI, a startup robotics company that is building humanoid robots released an impressive video (the state of the art for robotics is moving quickly by applying LLM-like tech) where the robots use a novel “vision-language-action” model. But the system they’re calling Helix is actually TWO models – a fast model (80M parameters) and a slow model (7B parameters), and they did this with only 1/20th of the training data they needed before. So, thinking fast and slow. Details here but if you just want to watch the cool (and maybe creepy?) robot video, here.
On Monday, Anthropic released Claude 3.7 Sonnet, the “first hybrid reasoning model…that can produce near-instant responses or extended, step-by-step thinking.” This is actually a single model behind the scenes, but depending on your need the model can respond right away (fast) or “think” by going through a reasoning process before responding (slow).
A few weeks ago, DeepSeek made waves because of their reasoning model (slow). Then x.AI pushed the frontier with what is probably the largest model yet with Grok 3 (fast).
On Thursday, OpenAI released GPT-4.5. This is a big model, scaled through pre-training, so like Grok 3 it’s a fast model. But OpenAI has already told us that GPT-5 is not going to be a new, bigger model. Instead, it’s going to be a composite, something that “knows” when the model can respond right away (fast, using GPT-4.5 or similar) and when it needs to spend some time reasoning before it can answer (slow, using -o1 or more likely, a yet-to-be-announced o4).

Wow, what a week!

And one final piece that doesn’t have anything to do with fast or slow, but is very relevant to the world of RAG for enterprises. MongoDB – a database company – realized that in order to use LLMs on corporate data, RAG is essential, specifically to reduce hallucinations (as I’ve said before – RAG is a technique that isn’t going away – it’s hands-down the best way to use LLMs in the enterprise, and will be until someone discovers a new technology that isn’t an LLM). They also realized that for good RAG, a reranker is essential. So they acquired a reranker company – Voyage.AI. Here’s the quote from their announcement:

The answer that many organizations have had to overcome [hallucinations] is retrieval-augmented generation (RAG). With RAG, results are grounded in data from a database. As it turns out, though, not all RAG is the same, and actually optimizing a database for the best possible results can be challenging. Creating highly accurate RAG is quite complex, and there is still a potential risk for hallucinations — a challenge faced by MongoDB and its users…Improving accuracy and reducing hallucination involves multiple steps. The first is to improve the quality of retrieval (the ‘R’ in RAG).

My take on why does it matter, particularly for generative AI in the workplace

For a long time, bigger models gave big improvements (bigger = more parameters) and more pre-training (more compute to build the model)
But the gains are decreasing; it’s taking a LOT more compute to get smaller gains, and this may not be a viable path to more capable models much longer
The idea of reasoning-at-inference (aka test time compute) provides a new way to improve performance. The gains here are surprisingly significant; just have the model “think” (in kind of a stream-of-consciousness way) before responding, and it improves.
Any model can benefit from this technique, it just takes more compute and more time to create an answer

So we now have two ways of scaling performance: pre-training (fast) and at inference (slow). Which indicates that these models behave surprisingly similar to how humans think. But more importantly for the industry, it means that there is still a lot of opportunity for these models to improve!

Like this:

Discover more from