Super RAG: Why We Treat Retrieval as Infrastructure, Not a Feature

04/24/202613 min read

Most teams bolt RAG onto their AI product the same way: pick an embedding model, chunk some documents, wire up a vector database, and call it done. It works well enough on demo day. Then real users arrive with messy PDFs, dense data tables, and questions that require synthesizing evidence from multiple documents — and the whole thing quietly falls apart.

The numbers bear this out. Meta's CRAG benchmark — 4,409 question-answer pairs across five domains — found that even state-of-the-art industry RAG systems answer only 63% of questions correctly, with a 17% hallucination rate. Frontier LLMs without RAG score below 34%. Straightforward RAG gets you to 44%. That 37-point gap between "best current RAG" and "correct" is where the hard engineering lives.

We learned this building The Build Bot. Our answer is Super RAG — a standalone ingestion and retrieval platform that treats RAG as a first-class infrastructure problem. This post walks through how it works, why we made the design choices we did, and how our AI orchestration layer, Ocho, turns that foundation into a system that knows when and how to search.

Lesson 1: Make Every Pipeline Decision a Composable Building Block

The first instinct with RAG is to build a fixed pipeline — one parser, one chunker, one embedder — and tune it by hand. This works for a single dataset. It breaks the moment you need to support engineering standards AND product docs AND scanned field reports, each with completely different structure.

The RAG research community has converged on this insight. Gao et al.'s Modular RAG framework formalized the shift from fixed pipelines to "LEGO-like reconfigurable frameworks" with independently replaceable sub-modules across three abstraction levels. Qdrant's recent $50M Series B was explicitly pitched around "composable vector search as core infrastructure" — the market is moving toward treating retrieval components as primitives you combine, not features hidden behind an opaque API.

Super RAG takes this further with a Primitive Registry: a Postgres-backed catalog of every discrete operation in the ingestion pipeline. Each primitive is a self-contained unit — a parser, chunker, enricher, embedder, extractor, or reranker — with a stable ID, a config schema, and benchmark scores.

The registry currently tracks primitives across ten kinds:

Parsers that turn source files into structured content (Docling, Textract, Mistral-OCR)
VLM-fallback parsers that re-parse pages the primary parser struggles with (Claude Opus for mission-critical accuracy, Sonnet for general use)
Chunkers that split content into retrieval units while respecting document structure
Enrichers that add context to chunks — following Anthropic's contextual retrieval approach, which demonstrated a 49% reduction in retrieval failures (67% when combined with reranking)
Extractors that pull structured fields — alloy designations, product names, version numbers
Relation extractors and entity resolvers that build knowledge graphs
Text embedders spanning the frontier — selections informed by the MTEB leaderboard (Gemini Embedding 001, Voyage 3 Large, Cohere Embed v4, and self-hosted options like NV-Embed v2 for data-residency requirements)
Visual embedders based on the ColPali architecture for figure-heavy documents
Rerankers that rescore retrieved results

Every primitive implements a common interface. Every model choice is a registry lookup, not a hardcoded constant. Adding a new frontier embedding model is a one-PR change: implement the adapter and insert a row.

The key insight: a strategy is a JSON document that references primitives by ID and specifies config overrides. At runtime, a generic executor loads the strategy, instantiates each primitive from the registry, and runs the pipeline. No dataset-specific code paths. No if-else branching on corpus type.

Lesson 2: Let AI Configure the Pipeline, Not Just Run It

If every pipeline decision is data, then the question becomes: who writes the data?

Hand-tuning RAG pipelines is an expert task. Microsoft Research's AutoRAG-HP demonstrated that framing RAG parameter selection as an online optimization problem achieves Recall@5 of ~0.8 using only 20% of the API calls required by grid search. The insight: there are too many interacting decisions — embedding model, chunk size, enrichment strategy, retrieval weights — for manual tuning to find good configurations reliably.

Our answer is the Strategy Agent — Claude Opus running once per dataset setup. The agent scans a representative sample of the corpus (stratified by file type, size, and structure), reviews the full primitive registry with benchmark scores, and proposes a complete ingestion strategy with rationale for every decision.

This isn't the agent running your queries. It runs once when you onboard a dataset, analyzes what your documents actually look like, and writes the recipe.

Here's what the agent considers for each dataset:

Figure density. A corpus where 23% of pages contain data charts (like engineering standards) needs visual retrieval. A ServiceNow knowledge base at 4% figure density doesn't.
Table complexity. Dense merged-cell tables that a standard parser mangles need a VLM-fallback tier. The agent picks the tier — Opus for mission-critical accuracy, Sonnet for general use — and sets confidence thresholds for when fallback triggers.
Entity structure. A materials database with thousands of alloy/temper/property relationships benefits from graph extraction. An internal wiki probably doesn't.
Embedding model fit. The agent justifies its embedding choice against the runner-up, referencing MTEB benchmarks. Sometimes Gemini's leading score wins; sometimes Voyage 3 Large's 32k context window matters more for long documents; sometimes Cohere's noise resilience is the deciding factor for OCR'd content.

The agent doesn't start from scratch. It picks the closest match from a recipe catalog — curated compositions covering common corpus types (engineering standards, product docs, research papers, scanned documents, code repositories) — then overrides primitives where the sample warrants it. Every override requires a written rationale.

The output isn't just a config file. The agent also produces:

Opt-in capability cards for advanced features (graph extraction, visual retrieval), each with pros, cons, cost estimates, and evidence from the actual corpus sample
A golden query set — 20-40 queries spanning lookup, comparison, procedural, and edge-case categories — used for automated evaluation
Confidence flags on decisions it wants a human to double-check

A subject-matter expert reviews the strategy, adjusts if needed, and approves. The agent costs $0.50-$2.00 per dataset — a one-time cost amortized over the life of the index.

The principle: hand-tuning a RAG pipeline is an expert task that doesn't scale across dozens of datasets with different characteristics. An AI strategist that can reason about the corpus and justify its decisions against benchmarks does.

Lesson 3: Go Beyond Text Chunks — Graph and Visual Retrieval

Standard RAG retrieves text chunks. That's fine when the answer is in the prose. It breaks in two common scenarios.

When the answer is in a relationship. The MultiHop-RAG benchmark demonstrated that standard single-shot RAG "fails systematically on queries that require combining evidence from multiple documents." Questions like "What properties does 7075-T6 have?" require traversing a web of alloy-to-temper-to-property relationships scattered across hundreds of pages. Text search returns fragments. Graph retrieval returns structure.

Super RAG's graph extraction pipeline uses LLM-powered relation extractors to emit typed triples (e.g., 7075 → has_temper → T6, T6 → has_property → yield_strength), then runs entity resolution to collapse surface-form variations into canonical nodes. Community detection (Louvain/Leiden) groups related entities into clusters. The result is a queryable knowledge graph grounded in source chunks — every entity links back to the exact text that produced it. No hallucinated nodes.

This follows the approach validated by Microsoft's GraphRAG, which showed "substantial improvements over a naïve RAG baseline for both the comprehensiveness and diversity of generated answers" on datasets in the million-token range. The ACM TOIS survey on GraphRAG categorizes these approaches into graph-based indexing, graph-guided retrieval, and graph-enhanced generation — Super RAG implements all three.

When the answer is in a figure. An S-N fatigue curve carries information in its shape that OCR cannot extract. A financial chart's trend line tells you more than its caption. Visual retrieval handles this by embedding entire page rasters using ColPali-family models — multi-vector embeddings that capture spatial and visual information through late-interaction (ColBERT-style) matching. At query time, visual results fuse with text results via reciprocal-rank fusion.

The ViDoRe benchmark tracks this rapidly advancing field. The original ColPali scored 81.3 nDCG@5; current state-of-the-art models exceed 90. Super RAG's primitive registry tracks the leading visual embedders — including Nemotron ColEmbed V2 (ViDoRe V3 leader at 63.42 nDCG@10) and ColQwen2 — so the Strategy Agent can select the right model for each corpus.

The Strategy Agent enables visual retrieval when it detects meaningful figure density in the corpus sample — and recommends the right model from the registry. For engineering standards with 23% figure density, it's a clear win. For text-heavy product docs, it correctly skips it.

Both capabilities are opt-in, gated behind the strategy review, and only add operational cost when a dataset actually benefits.

Lesson 4: You Don't Ship Without Proof

The Strategy Agent proposes. Eval gates verify.

Before a strategy can go live, Super RAG runs the golden query set against the newly built index and measures recall, ranking quality (nDCG), citation accuracy (can the retrieved chunks actually answer the question?), latency, and cost. For datasets with visual retrieval, it measures figure-query recall. For datasets with graph extraction, it measures graph neighborhood precision.

This approach aligns with what's emerging as best practice. The RAGAS framework established reference-free evaluation across faithfulness, answer relevancy, context precision, and context recall — validated at EACL 2024 showing predictions "closely aligned with human judgments." Production teams are increasingly running releases against curated golden QA sets as automated CI/CD quality gates, blocking deployments if quality metrics deviate beyond thresholds.

Each strategy defines its own pass/fail thresholds. An engineering-standards dataset might require 90% recall@10 and 92% citation accuracy. A general knowledge base might accept 80%.

This is the blue/green deployment pattern applied to search indexes. When a strategy changes — new embedder, different chunking approach, additional enrichers — Super RAG builds the new index alongside the live one. The old index keeps serving production traffic. Post-ingest evals run against the new index. On pass, a single transaction flips the active version. On fail, the new index is retained for inspection but never serves traffic. Rollback is a one-line database update.

The golden query set grows organically. Real queries flagged as "low confidence" in production are sampled and surfaced to admins, who can promote them to golden queries with one click. The eval set tracks the kinds of questions users actually ask, not just the ones you imagined at setup time.

The principle: every strategy change is a hypothesis. Eval gates turn it into a tested deployment.

The Payoff: How Ocho Turns Infrastructure Into Intelligent Search

Everything above is about building the foundation — making sure the right content is parsed, enriched, embedded, and indexed correctly for each dataset. That's the ingestion side. The query side is where it gets interesting.

Ocho is the AI orchestration layer that consumes Super RAG's retrieval APIs. Its job: decide when to search, how to search, and whether the results are good enough — all without the end user knowing any of this is happening.

The Agentic RAG survey identifies four distinct patterns of increasing sophistication: query rewriting (A1), corrective/reflective RAG (A2), true iterative retrieval (A3), and multi-agent retrieval (A4). The agentic patterns in our system live on both sides of the Super RAG / Ocho boundary:

On the Super RAG side (ingestion-time intelligence):

HyDE — a single LLM call rewrites the query as a hypothetical answer before embedding, significantly improving recall on abstract questions in zero-shot settings
CRAG confidence scoring — every retrieval result gets a confidence grade (correct / ambiguous / incorrect). Low-confidence results trigger corrective action
Hybrid search with reranking — vector search, BM25, and (when enabled) visual retrieval fused via reciprocal-rank, then rescored by a reranker. Hybrid search alone can boost retrieval accuracy by 20-30%

On the Ocho side (query-time intelligence):

Query routing — following the Adaptive-RAG pattern (NAACL 2024), a lightweight classifier determines whether a query needs retrieval at all, standard single-shot RAG, or iterative multi-step retrieval. Simple factual questions skip the expensive path. Complex comparative questions get the full treatment. You don't run agentic retrieval on every query — you run it only on the queries that need it.
Retrieval as a tool — instead of pre-loading all retrieved context, Ocho can expose Super RAG as a callable tool that the generation model invokes on demand. As Anthropic's engineering team described this shift: "retrieval stops being a preprocessing step and becomes a tool the model calls." The model reasons about what it needs, searches, observes, and searches again — just-in-time context rather than pre-assembled context.
Retrieval reflection — building on the Self-RAG principle of self-critique, when retrieval comes back low-confidence, Ocho decomposes the query into sub-queries, re-retrieves, and merges results. If confidence is still low, it instructs the model to ask a clarifying question rather than hallucinate.
Turn-scoped deduplication — a semantic cache prevents redundant retrieval calls within a conversation, whether from repeated user questions or the model's own iterative searches. Production semantic caching can achieve 60-68% cache hit rates on repetitive workloads.

The key design constraint: every agentic capability is configured per skill, not globally. An engineering-standards lookup skill gets query routing and iterative retrieval. A simple FAQ skill skips straight to single-shot search. Admins tune the knobs in the same interface they already use for everything else. The end user sends a message and gets an answer — the middleware decides how much work to do.

What This Means in Practice

RAG isn't a feature you bolt on. It's an infrastructure layer that determines whether your AI product works on easy questions or on the hard ones too.

The pattern we've landed on:

1. Composable primitives over fixed pipelines — because every corpus is different 2. An AI strategist that reasons about your data and proposes the right configuration — because hand-tuning doesn't scale 3. Eval-gated deployments that prove quality before going live — because "it seems to work" isn't a deployment strategy 4. Agentic retrieval on both sides — intelligence in how you build the index and intelligence in how you query it

Super RAG handles the foundation. Ocho turns it into a system that searches the way an expert would — knowing when to look, where to look, and when to admit it doesn't know.

If you're building RAG systems that need to work beyond the demo, we'd love to talk.

Map your first use case in 30 minutes.

A Fit Call is the whole commitment. No deck, no pitch — we map your stack and walk through a first automation you could ship.

Book a 30-min Fit Call