Doing RAG Right Isn’t As Easy As It Seems

If you’re in the AI space, it’s easy to be overwhelmed by the news and developments and new model announcements that come out every day. I spend a lot of time keeping up with the latest info, particularly as it relates to using LLMs in the enterprise. The best way to do that is by combining generative AI with search (retrieval) in a pattern called RAG (retrieval-augmented generation). If you’re interested in using LLMs on your internal business content, or if you’re experimenting with your own RAG pilots internally, this blog should help you keep up with what’s going on without investing too much of your own time. So, welcome to the blog, I hope it helps, and please feel free to provide comments and feedback!

As we start 2025, the market is talking about ChatGPT o3 (which blows away all other models in terms of reasoning ability on hard problems, but at very high cost). OpenAI has promised o3-mini in January, but don’t expect GPT-5 anytime soon (there are rumors that, because of plateauing, we’ll only see a GPT-4.5 in 2025). And if you’ve wanted to use Grok but aren’t on X, you’re in luck…X.ai just released a phone app so you can use Grok without an X account. But what I found most significant this week was a blog post from Microsoft about RAG that confirms what we have seen and learned, and what the market is slowly starting to realize:

despite the many “solutions” that work solely on vector search (because it’s easier), vector retrieval is poor…especially at any scale
for good RAG, you need good search, which means hybrid search: keyword + vector
for good RAG, you need good results blending: most companies use Reciprocal Rank Fusion (which is cheap and easy); it’s ok, but a semantic reranker is much better, especially if it’s a custom-trained, fine-tuned deep neural network: a Small Language Model (SLM)
for good RAG, use small, sentence-aware overlapping chunks
for good RAG, incorporate your company vernacular (acronyms and terms with specialized meanings) into the search pipeline
but the biggest gain in RAG, more than ALL of the above points, is using an LLM to rewrite the request into multiple queries, sending all those queries in parallel, and combining the results of those queries. This has a MASSIVE improvement in recall, which (when you have a semantic reranker that can then provide the highest precision) gives much better information to the LLM and produces much better responses.

In summary: to successfully use an Assistant or Agent (built on RAG) in an enterprise, you need multiple hybrid search queries (to retrieve the broadest results) and a semantic reranker (to isolate the best results) to send to the LLM. Don’t waste your time changing chunking strategies or fine-tuning an LLM or using a different embedder – or even worrying about which LLM will give you the best results. Get a multi-query hybrid search RAG system set up first – that’s your real bang for the buck – and then worry about making incremental improvements by experimenting with the other things later.

Doing RAG Right Isn’t As Easy As It Seems

Like this:

Discover more from