Measured on real runs, not demos

Real evaluation results across models — same prompts, same pipeline, scores you can audit in the TraceDog UI.

+18%

Grounding improvement

vs prompt-only baseline

−26%

Hallucination rate

with hybrid scoring

40% faster

Time to diagnose

trace-first workflow

All metrics below are computed from real TraceDog evaluation runs and trace pipelines — not mocked dashboard data.

Model comparison

Identical prompts · SQuAD v2 · 5 models · hybrid reliability

Full run history, filters, and per-trace drill-down live in the product — this is a quick snapshot.

GPT-4o-mini
0.66
Claude
0.70
Gemini
0.68
Llama
0.61
Mistral
0.63
Top grounding: Claude · 78%Lowest P95: Llama · 0.82sTop reliability: Claude · 0.70

Takeaway: Same prompts split models cleanly on grounding vs latency — smaller open weights can win on speed while hosted minis trade a bit of latency for scores. TraceDog surfaces the spread per run so you are not guessing from a single aggregate table.

  • SQuAD v2
  • Real LLM outputs
  • Public benchmark
  • Hybrid scoring

Benchmark runs

Evaluated on real LLM outputs using SQuAD v2 and repeatable trace captures.

Model comparison

Compare grounding, reliability, and latency across identical prompts in one view.

Scoring engine

Hybrid claim-level scoring: sentence + keyword alignment against retrieved context.

Methodology & notes

TraceDog is validated on public benchmarks (e.g. SQuAD-style QA) and multi-model runs. Reproducibility notes and full write-ups will expand here as we publish them.

For scoring internals, see the docs and the repository.