does the knowledge graph earn its keep? | R&D

benchmarking graph-augmented retrieval against pure vector RAG for the "what should this student see next?" problem, and auditing my own numbers until they held.

Kestrel is a fictitious name. the real system, its name, and some specifics are covered by IP and an NDA, so they’re redacted here. the architecture and the experiment are as they were.

Kestrel makes a bet i’ve defended in writing: a knowledge graph beats pure vector retrieval for education, because prerequisite structure is the whole point. you can’t recommend what comes next if you don’t model what came before. defending a decision in prose is cheap, so i built the benchmark that could prove me wrong.

the task is narrow and measurable. given a student’s state, what they’ve mastered and where they’re stuck, recommend the correct next concept. ground truth comes from the curriculum’s prerequisite DAG plus expert labels on a frozen set of student traces. four retrieval strategies, one question: does the graph actually earn the complexity it adds?

the short version: it does, but not for the reason i assumed, and two of my first-pass numbers were wrong in ways that would have shipped.

the four strategies

the four retrieval strategies under test. hybrid is Kestrel's approach: the graph gates the candidate set, the vector layer ranks within it.

pure vector, cosine similarity over concept embeddings. the RAG default.
vector + rerank, same, with a cross-encoder reranking the top candidates.
graph-only, traverse the prerequisite DAG from the student’s frontier. no embeddings.
hybrid, the graph restricts candidates to prerequisite-valid concepts, then the vector layer ranks them by fit to what the student actually asked.

method

every strategy runs against the same 1,200 frozen student traces, each labelled by two curriculum experts for “correct next concept.” i hold the rest of the stack at its best verified configuration and vary only the retrieval strategy. four metrics:

metric	what it measures
next-concept accuracy	top recommendation respects prerequisites and matches the expert label
grounding faithfulness	tutoring response cites only retrieved material
latency p95	cold-cache tail latency, end to end
cost / 1k sessions	embedding + rerank + generation spend

the one rule that mattered: no number counts until its config is proven from logs and it passes an anchor check. does it match physics, or a value i already trust?

results

Pure vector

61%

Vector + rerank

68%

Graph-only

74%

Hybrid

83%

next-concept accuracy across strategies (hover a bar). hybrid leads; pure vector is the floor.

the full table:

strategy	next-concept acc.	grounding	latency p95	cost / 1k
pure vector	61%	88%	240 ms	$1.9
vector + rerank	68%	90%	520 ms	$3.4
graph-only	74%	95%	90 ms	$0.7
hybrid	83%	96%	310 ms	$2.1

and the request path the winning strategy actually takes through the orchestrator:

PlantUML diagram — hybrid retrieval inside a tutoring turn. the graph answers 'what is valid next'; the vector layer answers 'what did they mean'.

three things that surprised me

pure vector fails in the specific way the graph was built to prevent. ~30% of its recommendations are plausible but out of order, semantically close concepts the student isn’t ready for. cosine similarity has no notion of “before.”

graph-only is the efficiency winner, not the accuracy winner. fastest and cheapest by a wide margin, and it never violates ordering, but it stumbles when a student’s question is phrased in ways the edges don’t encode. it knows what’s valid next; it doesn’t always know what they meant.

hybrid’s win comes from gating, not blending. i expected the lift to come from averaging two signals. it doesn’t. it comes from the graph shrinking the candidate set to prerequisite-valid concepts before the vector search runs. constrain first, rank second.

the part i actually care about: not trusting my own numbers

left to a single pass, this benchmark would have shipped three wrong headline numbers. all plausible. all wrong.

graph-only scored 91% first. too good. the eval was drawing candidates from the same prerequisite edges it used to score them, label leakage. held out the scoring edges; it dropped to 74%.
latency was measured against a warm Qdrant cache. cold p95 was ~3× higher. re-ran cold; the table above is cold.
two strategies scored identically on grounding. a rubric-scorer timeout was silently defaulting to “pass.” fixed the default to “fail-and-flag,” and the tie disappeared.

“it ran and produced a number” is not the same as “the number is true.” every assumption you offload to the model, or to your own eval harness, is one you’ve quietly said you’re fine not verifying.

the graph earns its keep. but the more durable lesson is the second one: the system around the model has to be correct in ways the model, and your first draft of the measurement, cannot be.