The Build Bot

Turn Your Technical Documents into Searchable Knowledge

A five-minute, plain-language tour of how Super-RAG turns piles of engineering documents into a searchable, citation-backed knowledge base — written so anyone, engineer or not, can follow along.

Find answers in your engineering documents — with citations. A five-minute, plain-language tour of what Super-RAG does and how it works, written so anyone (engineer or not) can follow along.

Why Super-RAG exists

Large language models (LLMs) are great at sounding smart, but they only know what they were trained on. Ask one about a specific spec document or a recent standard, and it will either guess or make something up.

RAG stands for Retrieval-Augmented Generation. The trick is simple: before the LLM answers, we let it look things up first.

Think of it like giving the LLM a library card. Instead of guessing, it walks into the library, finds the exact page that answers your question, reads it, then tells you the answer — and points at the page so you can check.

Super-RAG is that library, built specifically for materials engineering documents (MMPDS, Mil-HDBK-5, AMS specs, and similar standards). It is retrieval-only — it finds the right pages, pulls out the right images, and hands them off. A separate chat product turns those pages into a final answer.


The data model: documents, transcripts, and datasets

Three words to know.

Document — a single file. Usually a PDF or Word document. A spec sheet, a standard, a report.

Transcript — text that came from audio or video (e.g. a recorded meeting, transcribed). It lives alongside documents because, once it's text, you can search it the same way.

Dataset — a folder that groups documents and transcripts together so they can be searched as one collection.

Documents are books on a shelf. Transcripts are recordings that got typed up and put on the same shelf. A dataset is a reading list that points at a chosen set of those books.

The link between datasets and documents is many-to-many. One document can appear in several datasets. One dataset can contain many documents and transcripts. Deleting a dataset does not delete the documents — the books stay on the shelf, only the reading list goes away.

This is important: you can rearrange, regroup, and re-curate datasets without ever risking the underlying files.


What "building a dataset" actually means

Making a dataset by adding documents to it doesn't, by itself, make anything searchable. The dataset just sits there as a draft.

"Build" is the action that turns a folder of documents into something the system can actually search and answer questions over. A build runs four stages:

1. Validate — every attached document is fully parsed into small text snippets called chunks. If anything is half-parsed, the build won't start.

2. Propose strategy — an AI agent reads through the documents and proposes a set of extraction rules tailored to this dataset's content (for example: "these docs are dense tables — pull material names and properties as structured fields").

3. Graph extraction — see the next section.

4. Evaluate — the system runs a set of known good questions against the freshly built dataset to measure quality before declaring it ready.

During its life, a dataset moves through these states: draft → proposing → approved → building → live. At any point a build can land in cancelled, failed, or archived. A live dataset is what end users actually query.


What the graph is, and how we build it

A dataset's graph is a map of ideas. Imagine a giant corkboard. Each sticky note is an entity — a thing the documents talk about, like a material (Ti-6Al-4V), a property (tensile strength), a process (heat treatment), or a standard (AMS 4928). Each string between sticky notes is a relationship like HAS_PROPERTY, USED_IN, or SPECIFIED_BY. Each pin is a mention — a record that says "this entity was found in this exact chunk of text."

The graph is built in passes, like a careful re-read of every document.

Pass 1 — an LLM reads chunks in small batches and lists every entity and relationship it sees.

Pass 2a — find any chunks where Pass 1 came up empty and re-read them with neighboring context to fill the gaps.

Canonicalize — the same idea can show up many ways (Ti-6Al-4V, Ti6-4, Titanium 6Al-4V). The system uses embeddings + clustering to figure out which surface forms are the same thing, then picks one canonical name. A second model double-checks the tricky middle cases.

Pass 2b — for any cluster where the system wasn't confident, re-read those chunks more carefully.

Re-canonicalize, communities, mention counts — group related entities into communities, find the bridges between them, and count how often each one appears.

Why bother with a graph?

Plain semantic search (looking up similar text) is good at "find me chunks that sound like this question." It struggles with questions like "show me everything related to Ti-6Al-4V," "what properties define this material?" or "which standards specify this process?"

The graph answers those. It lets the product surface entity neighborhoods, show related-chunks sidebars, and power typeahead that knows the difference between a material name and a document title. Search finds what's semantically close; the graph reveals what's conceptually connected. Together they're far stronger than either one alone.


Quick reference

Document — a single file (PDF, Word, etc.).

Transcript — audio or video that's been turned into text.

Dataset — a folder grouping documents and transcripts.

Build — the action that turns a dataset into something searchable.

Strategy — the AI-generated rules for how to extract information.

Chunk — a small text snippet from a document; the unit we search over.

Entity — a thing the documents talk about (a material, a property, a standard).

Relationship — a connection between two entities.

Graph — the full map of entities and relationships in a dataset.

RAG — "look it up before you answer."


TL;DR

Super-RAG turns piles of technical documents into a searchable, navigable knowledge base. It does this by chunking documents, building a graph of the concepts inside them, and offering both semantic search and graph navigation to anything that asks. It never writes the final answer itself — it just hands the right evidence to the product that does.