Concepts

How We Build a Knowledge Graph From Your Documents

06/20/20267 min read

A behind-the-scenes guide for technically-curious readers.

What this is, in one paragraph

When you load a set of documents into the platform, we don't just chop them up for keyword and semantic search. We also read through the material and build a knowledge graph — a structured map of the things your documents talk about (materials, specifications, properties, products, sections) and the relationships between them. That graph lets you ask entity-centric questions (“what is connected to alloy 2024-T3?”), explore neighborhoods of related concepts, and get answers that are grounded in — and traceable back to — the exact passages they came from.

This post explains, in plain terms, how that graph gets built and how it's used.

1. Why a knowledge graph (and not just search)?

Traditional search is good at finding passages that look like your question. It's weak at questions about how things relate:

“Which tempers and properties are associated with this alloy?”
“What standards supersede or reference this one?”
“Show me everything in this corpus that’s closely connected to this concept.”

A knowledge graph answers those directly, because it stores the entities and the relationships as first-class data — not as text you have to re-read every time. Think of search as a great index, and the graph as a map.

Two things make our graph trustworthy for engineering-grade use:

Every entity is grounded in a source. Nodes are not invented by a language model in the abstract — each one is tied to the specific passage(s) it was found in. If it isn't in your documents, it isn't in the graph.
It's versioned. Rebuilding the graph never disrupts what's live (more on this in §5).

2. The lifecycle: from upload to a live graph

Building the graph is a deliberate, opt-in step — not something that silently happens on every file upload. The journey has three stages:

1   UPLOAD                 BUILD                      LIVE
2 ┌─────────┐   ┌──────────────────────────┐   ┌────────────────┐
3 │ Add your│   │ Read · extract · clean · │   │ Query the graph│
4 │documents│ → │ cluster · label · connect│ → │ + chat answers │
5 └─────────┘   └──────────────────────────┘   └────────────────┘
6   (indexed)        (graph generation)            (graphed)

Upload / index — Documents are parsed and split into passages (“chunks”) for retrieval. At this point you have a searchable dataset, but no graph yet.
Build — When you choose to generate a knowledge graph for a dataset, the build pipeline (§3) runs over those chunks. Because this is the most compute-intensive step, it is optional per dataset and you're shown an estimate of time and cost before it runs, so you're never surprised.
Live — Once the build completes, the graph is queryable (§4) and powers richer answers.

3. Inside the build: how the graph is generated

The build runs as an ordered pipeline. We've grouped it into five plain-language stages below; under the hood there are a few extra repair and refinement passes that make the result more complete and reliable.

Stage 1 — Read and extract

We pass your document chunks, in small batches, through a fast and cost-efficient language model (Claude Haiku). For each batch it identifies the entities present (e.g., a material, a temper, a property, a specification, a section) and the relationships between them (e.g., has-property, measured-in, referenced-in, supersedes). Every extracted item records where it came from.

Built-in safety net: chunks that come back empty or low-confidence get a second, closer look (we call these repair passes) so important entities aren't missed on a first scan. The pipeline also saves its work incrementally — if a build is interrupted, it resumes where it left off rather than starting over.

Stage 2 — Clean up duplicates (canonicalization)

The same real-world thing shows up under many surface forms — “2024-T3”, “Al 2024-T3”, “aluminum 2024-T3”. This stage uses mathematical similarity clustering over text embeddings to merge those variants into a single canonical entity, while being careful not to merge things that only look similar (e.g., two genuinely different alloy tempers). The merged variants are kept as searchable aliases.

Stage 3 — Find communities

With clean entities and relationships in place, a graph-clustering algorithm groups densely-connected entities into communities — neighborhoods of concepts that belong together. This is what lets you see the forest, not just the trees: instead of thousands of nodes, you get a navigable set of themed clusters.

Stage 4 — Label the communities

A language model writes a short, human-readable label and summary for each community, and we identify the most central (“representative”) entities in each one — so an overview view is meaningful at a glance.

Stage 5 — Connect and rank

Finally, the pipeline:

Identifies bridge connections — the relationships that link otherwise-separate communities, which are often the most interesting “how does X relate to Y across topics?” links.
Computes mention counts — how often each entity appears — which serves as an importance signal for ranking and for sizing nodes in the visual explorer.

The result is a graph of canonical entities, typed relationships, themed communities, and bridges, every piece of which can be traced back to a source passage.

4. How you use the graph

Once a dataset is “graphed,” the knowledge graph is available through a small set of purpose-built query tools. These power both direct API use and the chat experience:

Entity search — “Find the entities that match this description” (e.g., alloys with high tensile strength). Returns canonical entities with their properties and how often they appear.
Graph neighborhood — “Show me what's connected to this entity,” one or two hops out — with the source passages behind each connection. This is what the visual graph explorer calls when you click a node.
Community overviews — “Give me the themed neighborhoods in this dataset” — the labeled communities and their most central entities, for orientation and drill-down.

In a chat setting, the assistant can decide a question is entity-centric, pull the relevant neighborhood from the graph first, and then elaborate with passage-level search — producing answers that are both connected and citable.

5. How we keep it trustworthy, fast, and under your control

Grounded, not hallucinated. Every node and edge is tied to the specific source passage(s) it was extracted from; anything that can’t be traced back to the text is dropped rather than kept. No source, no entity.
Tenant-isolated. All graph data is scoped to your organization and dataset; queries are filtered on those boundaries at every layer.
Versioned, blue/green rebuilds. When a graph is rebuilt (e.g., after you change strategy or add documents), the new version is constructed alongside the existing one and the system atomically switches over when it’s ready. Live queries never see a half-built graph.
Opt-in and cost-aware. Graph generation is the most resource-intensive step, so it’s an explicit, per-dataset choice with an up-front time-and-cost estimate.
Resilient builds. Work is saved incrementally batch-by-batch, so an interrupted build resumes rather than restarting — important for large corpora.

6. Where this is heading

The graph engine is actively evolving. On the roadmap (in rough priority order):

Confidence scoring on relationships — an entailment check that lets the interface flag or dim lower-confidence connections, valuable for regulated and high-assurance use.
Denser, sharper communities — a more advanced clustering algorithm that splits broad neighborhoods into more meaningful, topically-distinct ones.
Incremental updates — adding a single document updates only the affected part of the graph instead of rebuilding the whole thing — faster and cheaper as your corpus grows.
Performance caching — pre-computing the most-traveled neighborhoods so the interactive explorer stays instant at scale.

Mini-glossary

Entity — a distinct thing the documents talk about (a material, temper, property, spec, section).
Relationship (edge) — a typed link between two entities (e.g., has-property).
Mention — a specific place in a specific document where an entity appears; the evidence behind a node.
Canonicalization — merging different names for the same thing into one canonical entity.
Community — a cluster of closely-related entities; a “neighborhood” of the graph.
Bridge — a connection that links two otherwise-separate communities.
Chunk — a passage a document is split into for search and extraction.
Blue/green — building a new version alongside the live one and switching over atomically, with zero disruption.

This post describes the production behavior of the Graph Generation v2 pipeline as of June 2026. Specific models, thresholds, and stage details may evolve as the engine improves; the principles above — grounded entities, versioned rebuilds, opt-in and cost-aware generation — are stable.

Ready to try it?

Map your first use case in 30 minutes.

A Fit Call is the whole commitment. No deck, no pitch — we map your stack and walk through a first automation you could ship.

Book a 30-min Fit Call