OpenAI Answers DeepSeek!

OpenAI released Deep Research, yes we’ve already got the NEXT “reasoning” model that spends time “thinking” before it responds to get better answers. Yet another model, another performance level reached. Except this isn’t really a new model, or at least not a new pre-trained model. They took a model that they already had and gave it instructions to not answer right away, but to spend some time in thought about the request. But as we’ve seen, this approach of using test-time compute significantly improves the quality and accuracy of the responses. This is almost certainly a response to DeepSeek R1 and all the buzz that it created – OpenAI doesn’t like to be in 2^nd place!

Speaking of R1, is distillation and post-training (in this case, supervised fine-tuning) really a path to super-cheap models? Apparently. Researchers at Stanford claimed to have distilled Gemini 2.0 Flash Thinking to create a model that rivals OpenAI’s o1 and DeepSeek’s R1…and they trained it for only $50!!!! The stock market didn’t move, so it’s either unaware or unconcerned. You may want to short NVIDIA (Disclaimer: I’m just making an observation, this is NOT INVESTMENT ADVICE!).

Speaking of disclaimers, back to OpenAI’s Deep Research. Here is OpenAI’s disclaimer on hallucinations:

It can sometimes hallucinate facts in responses or make incorrect inferences…It may struggle with distinguishing authoritative information from rumors, and currently shows weakness in confidence calibration, often failing to convey uncertainty accurately.

This is the same old story, garbage in, garbage out. This is a really hard problem for LLMs, whether it’s working with information from the internet or from your company or even from you (i.e., you upload a document).

Speaking of hallucinations – GenAI is still just pattern matching, and has no understanding of the real world. In (simplified) technical terms, GenAI is numerical computation, and rules are symbolic computation. In the “early but interesting” category, Amazon is using a symbolic technique that they have named “automated reasoning” to reduce hallucinations – it requires customers to “set up a set of policies that serve as the absolute truth” in advance for the model to follow. This can’t always be done, so it’s limited, but it’s the first I’ve seen of a commercial effort to use symbolic methods to limit genAI hallucinations.

My take on why does it matter, particularly for generative AI in the workplace

We can look at Deep Research (RAG on the internet) and get some insight into the equivalent in the enterprise (RAG on your internal content).

Deep Research searches the web to get the best information – but can “struggle with distinguishing authoritative information from rumors.” This is a particular struggle for RAG on the internet because the internet has a lot of rumors and misinformation. Asking Deep Research to know the difference can be difficult. On some topics it’s relatively easy to distinguish good sources of advice from bad (such as medical info – think mayoclinic.com vs. Instagram) but it’s a lot harder for news (that early-breaking information on X…is it accurate or misleading?).

The bigger challenge with using LLMs in the enterprise is the first part – “it can sometimes hallucinate facts…or make incorrect inferences.” This points to the same story you’ve heard from me before: the R is the most important part – and the hardest – about RAG. Getting retrieval right is the best way to minimize hallucinations because it ensures that the LLM is starting with the right information. For the LLM to not hallucinate, the information has to be accurate, relevant, and focused. The only way to get that is with good RAG.

Like this:

Discover more from