We've spent the last year building Super RAG, a standalone ingestion and retrieval platform, and Ocho, the AI orchestration layer that consumes it. Along the way we hit every failure mode the research literature warns about — and a few it doesn't. These are the lessons we'd hand to any team building RAG for production.
1. Best-in-class RAG still gets 37% of questions wrong
Meta's CRAG benchmark tested 4,409 question-answer pairs across five domains. The results: frontier LLMs without RAG score below 34%. Adding straightforward RAG gets you to 44%. State-of-the-art industry RAG systems reach 63% — with a 17% hallucination rate on the rest. That 37-point gap between "best available" and "correct" is the entire engineering challenge of production RAG. If your system seems to work, it's probably because you haven't measured it on hard enough questions.
2. You can't retrieve what you didn't parse correctly
The most underinvested layer in any RAG system is parsing. A study across 100+ technical teams found that parsing quality is where most systems silently fail. Dense tables get garbled. Figures get dropped. Multi-column layouts collapse into nonsense. It doesn't matter how good your embedding model is — an MTEB-leading embedder can't fix mangled input. The fix isn't just better parsers; it's confidence-gated parsing — knowing when the parser failed and escalating to a vision-language model for the pages it can't handle.
3. One pipeline doesn't fit all corpora
Engineering standards, product docs, scanned field reports, and code repositories have nothing in common structurally. A pipeline optimized for one actively harms another. The Modular RAG framework formalized why this matters: production systems need independently replaceable sub-modules because the optimal configuration depends on the corpus. Our approach: a primitive registry where every operation (parser, chunker, enricher, embedder, extractor, reranker) is a composable building block, and each dataset gets its own strategy assembled from the catalog.
Deep dive: Why We Treat Retrieval as Infrastructure, Not a Feature
4. Hand-tuning pipelines doesn't scale — let AI configure AI
There are too many interacting decisions in a RAG pipeline — embedding model, chunk size, enrichment strategy, retrieval weights — for manual tuning to reliably find good configurations. Microsoft Research's AutoRAG-HP showed that automated RAG parameter selection achieves Recall@5 of ~0.8 using only 20% of the API calls a grid search requires. We take this further: an AI Strategy Agent analyzes a representative corpus sample and proposes a complete ingestion strategy — with written rationale for every model choice and the runner-up it considered. It runs once per dataset at $0.50–$2.00, not per query.
5. Contextual enrichment is the single biggest retrieval improvement most teams skip
Anthropic's contextual retrieval research is one of the most underappreciated findings in practical RAG. Adding a contextual prefix to each chunk — a short summary of where the chunk sits in the document — reduces retrieval failures by 49%. Combine that with BM25 hybrid search and reranking, and failures drop by 67%. This is a chunk-time enrichment, not a query-time trick. It's cheap to implement and compounds with everything else you do downstream.
6. Hybrid search compounds — it's not marginal
Vector-only search leaves a lot on the table. Weaviate's analysis of fusion algorithms documents a 20–30% accuracy boost from combining vector search with BM25 via reciprocal-rank fusion. Layer a reranker on top and you compound again. Add visual retrieval for figure-heavy corpora and you open a dimension that text search can't touch. These aren't incremental improvements — they're compounding multipliers on the same index.
7. Single-shot retrieval systematically fails on the hardest questions
The questions your users care most about — comparisons, multi-step reasoning, cross-document synthesis — are exactly where standard RAG breaks down. The MultiHop-RAG benchmark was created because existing benchmarks were masking this: "existing RAG methods perform unsatisfactorily in retrieving and answering multi-hop queries." Microsoft's GraphRAG showed that building a knowledge graph at ingestion time produces substantially more comprehensive answers on these query types. If you only do text-chunk retrieval, you have a ceiling — and your hardest queries are already hitting it.
8. When the answer is in a figure, text RAG is blind
An S-N fatigue curve, a financial trend chart, an architecture diagram — these carry information in their visual form that OCR can't extract. ColPali introduced a new approach: embed entire page rasters as multi-vector representations that capture spatial and visual information. The ViDoRe benchmark has tracked rapid progress — from ColPali's initial 81.3 nDCG@5 to current models exceeding 90. This isn't experimental anymore. If your corpus has meaningful figure density and your RAG system ignores it, you have a blind spot on every visual document.
9. "It seems to work" is not a deployment strategy
Most teams ship RAG changes by testing a handful of queries and eyeballing the results. The RAGAS framework established that you can measure faithfulness, answer relevancy, context precision, and context recall without human annotators — validated at EACL 2024 against human judgments. Production teams are moving toward eval suites as CI/CD quality gates that block deployments below thresholds. Our approach: blue/green index deployment where the new index is built alongside the live one, automatically evaluated, and only promoted on pass. Rollback is a one-line database update.
Deep dive: Is Your RAG Good Enough? A Technical Framework
10. Not every query deserves the same retrieval effort
Running an expensive multi-step retrieval loop on "what's my company's address?" is wasteful. Running single-shot RAG on "compare the fatigue behavior of two alloys across cryogenic temperatures" leaves accuracy on the table. Adaptive-RAG (NAACL 2024) showed that routing queries by complexity is the highest-leverage optimization: a lightweight classifier determines whether a query needs no retrieval, single-shot RAG, or iterative multi-step retrieval. Simple questions stay fast. Hard questions get the resources they need. You don't pay the agentic tax on every request.
11. Retrieval should be a tool, not a preprocessing step
The traditional RAG pattern stuffs retrieved chunks into the prompt before generation. The emerging pattern — validated by Anthropic's context engineering guidance and implemented in systems like Claude Code — flips this: retrieval becomes a tool the model calls on demand. The model reasons about what it needs, searches, observes the results, and searches again if needed. This is what the Agentic RAG survey calls the shift "from pre-inference retrieval to just-in-time context." It's more expensive per hard query, but cheaper overall because easy queries skip the heavy path entirely.
12. The ingestion side and the query side need intelligence independently
Most RAG discussions focus on query-time tricks — better prompts, smarter retrieval calls. But the biggest quality lever is often how the index was built. Contextual enrichment, graph extraction, visual embedding, VLM-fallback parsing — these are all ingestion-time decisions that determine what's even possible at query time. The inverse is also true: CRAG-style confidence scoring, Self-RAG-style self-critique, and HyDE query expansion are query-time improvements that can't fix a bad index but dramatically improve a good one. Production RAG needs agentic intelligence on both sides — in how you build the index and in how you query it.
The Thread That Ties These Together
Every lesson above points to the same conclusion: RAG is an infrastructure problem, not a feature. The gap between "works on easy questions" and "works on the questions that matter" is an engineering discipline — composable primitives, automated configuration, continuous evaluation, adaptive retrieval, and intelligence on both sides of the ingestion/query boundary.
That's what we built Super RAG and Ocho to solve. If any of these lessons resonated, we're happy to dig into the specifics on your use case.
Related Posts
- Super RAG: Why We Treat Retrieval as Infrastructure, Not a Feature — The full architecture walkthrough
- Is Your RAG Good Enough? A Technical Framework — Score your system across 10 dimensions
References
- Yang et al. — CRAG Benchmark (NeurIPS 2024)
- Gao et al. — Modular RAG (2024)
- Anthropic — Contextual Retrieval (2024)
- Anthropic — Effective Context Engineering for AI Agents
- Edge et al. — GraphRAG (Microsoft Research, 2024)
- Faysse et al. — ColPali (2024)
- Jeong et al. — Adaptive-RAG (NAACL 2024)
- Yan et al. — CRAG (2024)
- Asai et al. — Self-RAG (2023)
- Gao et al. — HyDE (2022)
- Tang & Yang — MultiHop-RAG (COLM 2024)
- Singh et al. — Agentic RAG Survey (2025)
- Shahul Es et al. — RAGAS (EACL 2024)
- Fu et al. — AutoRAG-HP (EMNLP 2024)
- kapa.ai — RAG Best Practices (2024)
- Dextralabs — Production RAG Eval & CI/CD (2025)
- MTEB Leaderboard
- ViDoRe Leaderboard
- Weaviate — Hybrid Search Fusion Algorithms
Ready to try it?
Map your first use case in 30 minutes.
A Fit Call is the whole commitment. No deck, no pitch — we map your stack and walk through a first automation you could ship.
Book a 30-min Fit Call