does the knowledge graph earn its keep?

benchmarking graph-augmented retrieval against pure vector RAG for the "what should this student see next?" problem, and auditing my own numbers until they held.

RAGknowledge graphsretrievalevaluation

Kestrel is a fictitious name. the real system, its name, and some specifics are covered by IP and an NDA, so they’re redacted here. the architecture and the experiment are as they were.

Kestrel makes a bet i’ve defended in writing: a knowledge graph beats pure vector retrieval for education, because prerequisite structure is the whole point. you can’t recommend what comes next if you don’t model what came before. defending a decision in prose is cheap, so i built the benchmark that could prove me wrong.

the task is narrow and measurable. given a student’s state, what they’ve mastered and where they’re stuck, recommend the correct next concept. ground truth comes from the curriculum’s prerequisite DAG plus expert labels on a frozen set of student traces. four retrieval strategies, one question: does the graph actually earn the complexity it adds?

the short version: it does, but not for the reason i assumed, and two of my first-pass numbers were wrong in ways that would have shipped.

the four strategies

the four retrieval strategies under test. hybrid is Kestrel's approach: the graph gates the candidate set, the vector layer ranks within it.

method

every strategy runs against the same 1,200 frozen student traces, each labelled by two curriculum experts for “correct next concept.” i hold the rest of the stack at its best verified configuration and vary only the retrieval strategy. four metrics:

metric what it measures
next-concept accuracy top recommendation respects prerequisites and matches the expert label
grounding faithfulness tutoring response cites only retrieved material
latency p95 cold-cache tail latency, end to end
cost / 1k sessions embedding + rerank + generation spend

the one rule that mattered: no number counts until its config is proven from logs and it passes an anchor check. does it match physics, or a value i already trust?

results

Pure vector
61%
Vector + rerank
68%
Graph-only
74%
Hybrid
83%
next-concept accuracy across strategies (hover a bar). hybrid leads; pure vector is the floor.

the full table:

strategy next-concept acc. grounding latency p95 cost / 1k
pure vector 61% 88% 240 ms $1.9
vector + rerank 68% 90% 520 ms $3.4
graph-only 74% 95% 90 ms $0.7
hybrid 83% 96% 310 ms $2.1

and the request path the winning strategy actually takes through the orchestrator:

PlantUML diagram
hybrid retrieval inside a tutoring turn. the graph answers 'what is valid next'; the vector layer answers 'what did they mean'.

three things that surprised me

pure vector fails in the specific way the graph was built to prevent. ~30% of its recommendations are plausible but out of order, semantically close concepts the student isn’t ready for. cosine similarity has no notion of “before.”

graph-only is the efficiency winner, not the accuracy winner. fastest and cheapest by a wide margin, and it never violates ordering, but it stumbles when a student’s question is phrased in ways the edges don’t encode. it knows what’s valid next; it doesn’t always know what they meant.

hybrid’s win comes from gating, not blending. i expected the lift to come from averaging two signals. it doesn’t. it comes from the graph shrinking the candidate set to prerequisite-valid concepts before the vector search runs. constrain first, rank second.

the part i actually care about: not trusting my own numbers

left to a single pass, this benchmark would have shipped three wrong headline numbers. all plausible. all wrong.

“it ran and produced a number” is not the same as “the number is true.” every assumption you offload to the model, or to your own eval harness, is one you’ve quietly said you’re fine not verifying.

the graph earns its keep. but the more durable lesson is the second one: the system around the model has to be correct in ways the model, and your first draft of the measurement, cannot be.