Is Your RAG Good Enough? A Technical Framework for Evaluating Your Retrieval Pipeline

04/24/202613 min read

You shipped a RAG system. It answers questions. Stakeholders are happy. But you have a nagging feeling: you tested it on a handful of queries you wrote yourself, eyeballed the results, and called it good.

That approach has a shelf life. Meta's CRAG benchmark found that even state-of-the-art industry RAG systems answer only 63% of questions correctly. The gap between "works in the demo" and "works in production" is where teams burn months of engineering time — or quietly accept wrong answers they never measure.

This post is a diagnostic framework. Ten dimensions, three maturity levels each. Score your current system honestly, and you'll know exactly where it's solid and where it's exposed.

How to Use This Framework

For each dimension, identify which level describes your current system. Level 1 is where most teams start. Level 2 is where production-grade systems live. Level 3 is where purpose-built RAG infrastructure operates. No system needs Level 3 on every dimension — the right investment depends on your data and your accuracy requirements.

Be honest. The point isn't to score high. It's to find the gaps before your users do.

1. Document Understanding

How well does your system handle the documents you actually have — not just clean text?

Level · Description

L1 — PDF-to-text extraction, maybe basic OCR. Tables come through garbled. Figures are ignored.

L2 — Layout-aware parsing (e.g., Docling, Textract) that preserves table structure and extracts figure captions. Handles HTML, Markdown, and PDF.

L3 — Primary parser with VLM-fallback for pages the parser struggles with. Confidence-gated: the system knows when it failed and escalates to a vision-language model for re-parsing. Extraction quality is measured, not assumed.

The test: Take your most complex document — the one with merged table cells, embedded figures, and multi-column layouts. Parse it. Read the chunks. Is the information intact, or did your parser silently mangle the data your users will ask about?

Why it matters: A production RAG study across 100+ technical teams found that parsing quality is the most underinvested layer. You can't retrieve what you didn't parse correctly.

2. Chunking Strategy

Does your chunking respect the structure of your documents?

Level · Description

L1 — Fixed-size chunks (e.g., 500 tokens with overlap). No awareness of document structure.

L2 — Section-aware or semantic chunking. Tables and figure-caption pairs are kept intact. Headings and hierarchy inform chunk boundaries.

L3 — Per-corpus chunking strategy selected based on document characteristics. Parent-document retrieval for context expansion. Different corpora get different chunking — engineering standards vs. product docs vs. code.

The test: Find a chunk in your index that contains half a table. If your system produces chunks like this routinely, your retrieval ceiling is set by your chunker, not your embedder.

3. Embedding Model Selection

Did you choose your embedding model, or did you just use whatever the tutorial suggested?

Level · Description

L1 — A single general-purpose embedding model, chosen once and never re-evaluated.

L2 — Model selected from MTEB benchmarks with awareness of domain fit. Evaluated on your actual queries before deployment.

L3 — Model selection is per-dataset based on corpus characteristics. Domain-specific models for specialized content (e.g., Voyage Code 3 for code, Cohere Embed v4 for noisy OCR'd content). Model choices are justified against runner-ups and validated by retrieval evals.

The test: Embed 50 representative queries from your users. Retrieve the top-10 results for each. How many times is the correct answer not in the top 10? That's your retrieval failure rate. Anthropic's contextual retrieval research showed that combining contextual embeddings with BM25 and reranking reduces failures by 67%. If you're running a single embedding model with no hybrid search, you're leaving most of that improvement on the table.

4. Retrieval Sophistication

How many retrieval strategies does your system actually use?

Level · Description

L1 — Single-vector similarity search. One query in, top-k chunks out.

L2 — Hybrid search (vector + BM25) with reciprocal-rank fusion. A reranker rescores results before they reach the LLM.

L3 — Hybrid search, reranking, plus modality-specific retrieval paths (visual search for figure-heavy corpora, graph traversal for entity-relationship queries). Retrieval paths are fused and the combination is tuned per dataset.

The test: Ask your system a question where the answer is in a chart, a table, or requires connecting information from two different documents. If it can't handle any of these, you've identified your retrieval ceiling.

Why it matters: Weaviate's analysis shows hybrid search can boost retrieval accuracy by 20-30% over vector-only search. Reranking adds another significant layer. These are compounding improvements, not marginal ones.

5. Multi-Hop and Relational Queries

Can your system answer questions that require combining evidence from multiple places?

Level · Description

L1 — Single-shot retrieval. Every question gets one pass at the index. No ability to follow threads across documents.

L2 — Query decomposition — complex questions are broken into sub-queries. Results from multiple retrievals are synthesized.

L3 — Entity-relationship graph built during ingestion. Typed relations (e.g., alloy → has_property → fatigue_strength) with provenance grounding. Graph traversal as a retrieval tool alongside vector search. Community detection for theme-level queries.

The test: Ask "Compare X and Y across these three dimensions." If your system can't do this reliably, you're exposed on the exact query type where users need the most help. The MultiHop-RAG benchmark was created specifically because existing RAG evaluation was masking this failure mode — "existing RAG methods perform unsatisfactorily in retrieving and answering multi-hop queries."

Microsoft's GraphRAG validated this at scale: graph-based retrieval produces substantially more comprehensive and diverse answers on multi-document queries, with documented accuracy gains of up to 3.4x on entity-rich corpora.

6. Visual and Multimodal Content

What happens when the answer is in a figure, chart, or diagram?

Level · Description

L1 — Figures are ignored during ingestion. The system retrieves text only.

L2 — Figure captions are extracted and indexed as text. Basic image-to-text descriptions via a VLM.

L3 — Full page rasters embedded using ColPali-family models — multi-vector visual embeddings that capture spatial and visual information. Visual results fused with text results at query time. The decision to enable visual retrieval is data-driven: the system detects figure density and recommends accordingly.

The test: Find a document where the answer to a common user question is in a chart, not in the text. Ask the question. If your system can't surface the right page, you have a blind spot on every figure-heavy document in your corpus.

The ViDoRe benchmark tracks visual document retrieval quality. The field has moved from ColPali's initial 81.3 nDCG@5 to current SOTA exceeding 90 — this is a rapidly maturing capability, not an experiment.

7. Pipeline Configurability

How hard is it to change your pipeline for a different type of document?

Level · Description

L1 — One pipeline for all documents. Changing any component requires code changes and redeployment.

L2 — Configuration-driven pipeline with swappable components. Different embedding models or chunk sizes per collection, configurable without code changes.

L3 — Full primitive registry with per-dataset strategies. Each dataset gets an optimal pipeline configuration — parser, chunker, enrichers, embedder, extractor, reranker — composed from a catalog. New models are added to the registry, not hardcoded. An AI agent can propose configurations based on corpus analysis.

The test: Your team needs to support a new document type (say, scanned engineering drawings alongside your existing product docs). How long does it take? If the answer is "we need to build a new pipeline," your system isn't composable — it's a monolith wearing a RAG costume.

The Modular RAG framework formalized why this matters: production RAG systems need independently replaceable sub-modules because the optimal configuration depends on the corpus, and corpora vary.

8. Evaluation and Quality Assurance

How do you know your RAG system is working correctly — right now, not when you launched it?

Level · Description

L1 — Manual spot-checking. Someone asks a few questions and eyeballs the answers. No systematic measurement.

L2 — A static test suite of golden queries with expected answers. Run periodically. Quality metrics like recall@k and faithfulness are tracked. RAGAS or equivalent evaluation framework.

L3 — Eval-gated deployments — no pipeline change goes live without passing automated quality checks. Golden query sets that grow organically from production traces. Blue/green index deployment with rollback. Continuous regression monitoring. Per-dataset thresholds tuned to domain requirements.

The test: When did you last measure your retrieval recall? If you don't have a number, you don't have a quality assurance process — you have hope.

Production teams are moving toward running eval suites as CI/CD quality gates, blocking deployments if faithfulness or recall falls below thresholds. The recommended baselines: Context Precision > 0.8, Faithfulness > 0.8, Answer Relevancy > 0.75.

9. Query Intelligence

Does your system treat every query the same, or does it adapt?

Level · Description

L1 — Every query goes through the same retrieval path. No awareness of query complexity or type.

L2 — Query rewriting or expansion before retrieval (HyDE, decomposition). Confidence scoring on retrieved results with fallback behavior.

L3 — Adaptive query routing — a classifier determines whether a query needs no retrieval, single-shot RAG, or iterative multi-step retrieval. The system dynamically adjusts retrieval depth based on query complexity. Retrieval exposed as a tool the model can call iteratively when needed, following Anthropic's just-in-time context engineering pattern. Corrective retrieval (CRAG) with decompose-and-re-retrieve on low confidence results.

The test: Time your system on a simple factual query and a complex comparative query. If they take the same amount of time, your system is either wasting resources on easy queries or underserving hard ones. Adaptive-RAG showed at NAACL 2024 that routing queries by complexity is the highest-leverage optimization — you don't run agentic retrieval on every query, only on the queries that actually need it.

10. Multi-Tenant and Multi-Dataset Support

Can your system serve multiple datasets with different characteristics under one roof?

Level · Description

L1 — Single dataset, single tenant. Adding a new dataset means deploying a new instance.

L2 — Multiple datasets in one deployment with per-dataset configuration. Basic tenant isolation.

L3 — True multi-tenant architecture with row-level isolation. Per-dataset strategies, each optimized for its corpus. Cross-dataset search with per-dataset retrieval strategies respected. Tenant-scoped APIs with consistent error contracts.

The test: Can you onboard a new customer's documents tomorrow without touching your infrastructure? If each new dataset requires manual pipeline configuration, you're spending engineering time that should be automated.

Scoring Your System

Count how many dimensions you're at each level:

Your Profile · What It Means

Mostly L1 — You have a working prototype. It handles easy questions on clean documents. You're exposed on everything else — and you likely don't have the instrumentation to know how often it fails. Start with evaluation (Dimension 8) so you can measure the gap before investing in fixes.

Mostly L2 — You have a production-grade system for straightforward use cases. You've invested in hybrid search, structured parsing, and basic evaluation. The gaps are in adaptability — handling heterogeneous corpora, complex multi-hop queries, visual content, and automated quality assurance at scale.

Mostly L3 — You have purpose-built retrieval infrastructure. Pipeline configuration is data-driven and per-dataset. Quality is continuously measured and gated. Retrieval adapts to query complexity. This is the level required for systems that serve diverse, high-stakes document collections where wrong answers have real consequences.

The honest version: most teams are L1-L2 on most dimensions. That's not a failure — it's the natural state of a RAG system built to solve an immediate problem. The question is whether your accuracy requirements have outgrown your infrastructure.

Where the Gaps Compound

These dimensions aren't independent. Gaps compound:

Poor parsing + any embedding model = bad retrieval. You can't retrieve what you didn't parse correctly. An MTEB-leading embedder can't fix garbled table data.
No evaluation + any pipeline change = blind deployment. Without measurement, you can't distinguish an improvement from a regression. Every change is a coin flip.
Single retrieval path + complex queries = systematic failure. The MultiHop-RAG benchmark exists because single-shot retrieval consistently fails on multi-evidence questions — and standard benchmarks were hiding this.
Fixed pipeline + diverse corpora = lowest-common-denominator quality. Engineering drawings and product docs have nothing in common structurally. A pipeline optimized for one actively harms the other.

The highest-leverage investment is usually in the dimension where you're weakest, not the one where you're already strong.

What to Do Next

1. Measure first. If you don't have retrieval recall numbers on representative queries, start there. Everything else is guessing. 2. Test your hard cases. Multi-hop queries, figure-dependent answers, tables, documents your parser visibly struggles with. These are where your users will find the failures. 3. Check your deployment safety net. Can you roll back an index change? Can you A/B test two strategies? If not, every improvement carries risk you can't manage. 4. Ask whether your pipeline fits your data. If you're running the same configuration on engineering standards and marketing collateral, one of them is getting the wrong treatment.

If you're finding gaps that matter — particularly around pipeline configurability, eval-gated deployment, or multi-modal retrieval — that's exactly what we built The Build Bot to solve. We're happy to walk through the framework on your specific use case.

References

Yang et al. — CRAG Benchmark: Comprehensive RAG Benchmark (NeurIPS 2024)
Gao et al. — Modular RAG: Transforming RAG Systems into LEGO-like Reconfigurable Frameworks (2024)
Edge et al. — From Local to Global: A Graph RAG Approach (Microsoft Research, 2024)
Faysse et al. — ColPali: Efficient Document Retrieval with Vision Language Models (2024)
Anthropic — Contextual Retrieval (2024)
Anthropic — Effective Context Engineering for AI Agents
Jeong et al. — Adaptive-RAG: Learning to Adapt Retrieval-Augmented LLMs through Question Complexity (NAACL 2024)
Yan et al. — Corrective Retrieval Augmented Generation (CRAG) (2024)
Gao et al. — Precise Zero-Shot Dense Retrieval without Relevance Labels (HyDE) (2022)
Asai et al. — Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection (2023)
Tang & Yang — MultiHop-RAG: Benchmarking RAG for Multi-Hop Queries (COLM 2024)
Shahul Es et al. — RAGAS: Automated Evaluation of Retrieval Augmented Generation (EACL 2024)
Singh et al. — Agentic Retrieval-Augmented Generation: A Survey (2025)
Jia Fu et al. — AutoRAG-HP: Automatic Online Hyper-Parameter Tuning for RAG (EMNLP 2024)
kapa.ai — RAG Best Practices: Lessons from 100+ Technical Teams (2024)
Dextralabs — Production RAG: Evaluation Suites, CI/CD Quality Gates & Observability (2025)
MTEB Leaderboard — Embedding model benchmarks
ViDoRe Leaderboard — Visual document retrieval benchmarks
Weaviate — Hybrid Search Fusion Algorithms

Ready to try it?

Map your first use case in 30 minutes.

A Fit Call is the whole commitment. No deck, no pitch — we map your stack and walk through a first automation you could ship.

Book a 30-min Fit Call