Kestrel is a fictitious name. the real system, its name, and some specifics are covered by IP and an NDA, so they’re redacted here. the architecture and the experiment are as they were.
Kestrel makes a bet i’ve defended in writing: a knowledge graph beats pure vector retrieval for education, because prerequisite structure is the whole point. you can’t recommend what comes next if you don’t model what came before. defending a decision in prose is cheap, so i built the benchmark that could prove me wrong.
the task is narrow and measurable. given a student’s state, what they’ve mastered and where they’re stuck, recommend the correct next concept. ground truth comes from the curriculum’s prerequisite DAG plus expert labels on a frozen set of student traces. four retrieval strategies, one question: does the graph actually earn the complexity it adds?
the short version: it does, but not for the reason i assumed, and two of my first-pass numbers were wrong in ways that would have shipped.
the four strategies
- pure vector, cosine similarity over concept embeddings. the RAG default.
- vector + rerank, same, with a cross-encoder reranking the top candidates.
- graph-only, traverse the prerequisite DAG from the student’s frontier. no embeddings.
- hybrid, the graph restricts candidates to prerequisite-valid concepts, then the vector layer ranks them by fit to what the student actually asked.
method
every strategy runs against the same 1,200 frozen student traces, each labelled by two curriculum experts for “correct next concept.” i hold the rest of the stack at its best verified configuration and vary only the retrieval strategy. four metrics:
| metric | what it measures |
|---|---|
| next-concept accuracy | top recommendation respects prerequisites and matches the expert label |
| grounding faithfulness | tutoring response cites only retrieved material |
| latency p95 | cold-cache tail latency, end to end |
| cost / 1k sessions | embedding + rerank + generation spend |
the one rule that mattered: no number counts until its config is proven from logs and it passes an anchor check. does it match physics, or a value i already trust?
results
the full table:
| strategy | next-concept acc. | grounding | latency p95 | cost / 1k |
|---|---|---|---|---|
| pure vector | 61% | 88% | 240 ms | $1.9 |
| vector + rerank | 68% | 90% | 520 ms | $3.4 |
| graph-only | 74% | 95% | 90 ms | $0.7 |
| hybrid | 83% | 96% | 310 ms | $2.1 |
and the request path the winning strategy actually takes through the orchestrator:
three things that surprised me
pure vector fails in the specific way the graph was built to prevent. ~30% of its recommendations are plausible but out of order, semantically close concepts the student isn’t ready for. cosine similarity has no notion of “before.”
graph-only is the efficiency winner, not the accuracy winner. fastest and cheapest by a wide margin, and it never violates ordering, but it stumbles when a student’s question is phrased in ways the edges don’t encode. it knows what’s valid next; it doesn’t always know what they meant.
hybrid’s win comes from gating, not blending. i expected the lift to come from averaging two signals. it doesn’t. it comes from the graph shrinking the candidate set to prerequisite-valid concepts before the vector search runs. constrain first, rank second.
the part i actually care about: not trusting my own numbers
left to a single pass, this benchmark would have shipped three wrong headline numbers. all plausible. all wrong.
- graph-only scored 91% first. too good. the eval was drawing candidates from the same prerequisite edges it used to score them, label leakage. held out the scoring edges; it dropped to 74%.
- latency was measured against a warm Qdrant cache. cold p95 was ~3× higher. re-ran cold; the table above is cold.
- two strategies scored identically on grounding. a rubric-scorer timeout was silently defaulting to “pass.” fixed the default to “fail-and-flag,” and the tie disappeared.
“it ran and produced a number” is not the same as “the number is true.” every assumption you offload to the model, or to your own eval harness, is one you’ve quietly said you’re fine not verifying.
the graph earns its keep. but the more durable lesson is the second one: the system around the model has to be correct in ways the model, and your first draft of the measurement, cannot be.