+18%
Grounding improvement
vs prompt-only baseline
Real evaluation results across models — same prompts, same pipeline, scores you can audit in the TraceDog UI.
+18%
Grounding improvement
vs prompt-only baseline
−26%
Hallucination rate
with hybrid scoring
40% faster
Time to diagnose
trace-first workflow
All metrics below are computed from real TraceDog evaluation runs and trace pipelines — not mocked dashboard data.
Identical prompts · SQuAD v2 · 5 models · hybrid reliability
Full run history, filters, and per-trace drill-down live in the product — this is a quick snapshot.
Takeaway: Same prompts split models cleanly on grounding vs latency — smaller open weights can win on speed while hosted minis trade a bit of latency for scores. TraceDog surfaces the spread per run so you are not guessing from a single aggregate table.
Evaluated on real LLM outputs using SQuAD v2 and repeatable trace captures.
Compare grounding, reliability, and latency across identical prompts in one view.
Hybrid claim-level scoring: sentence + keyword alignment against retrieved context.
TraceDog is validated on public benchmarks (e.g. SQuAD-style QA) and multi-model runs. Reproducibility notes and full write-ups will expand here as we publish them.
For scoring internals, see the docs and the repository.