Confident but wrong outputs
AI responses can look correct even when evidence is weak or missing — the failure is silent until someone audits.
Observability and reliability for AI agents
TraceDog helps engineers inspect AI runs, measure grounding, detect weak retrieval, and compare model behavior with explainable reliability scores.
Hybrid score blends sentence and keyword signals; evidence aligns with the answer and no contradiction was detected against retrieved passages.
Production AI fails in ways logs don’t show: weak retrieval, confident hallucinations, and inconsistent evaluation across models.
AI responses can look correct even when evidence is weak or missing — the failure is silent until someone audits.
Most tools stop at prompt and output. TraceDog captures what happened inside: retrieval, spans, and scoring.
Comparing runs across models and contexts is manual and noisy — you need one scoreline engineers can trust.
TraceDog turns model runs into reviewable evidence and reliability signals.
Execution story
TraceDog walks the same path your RAG stack takes: prompt, retrieval, generation, checks, grounding, verdict — so engineers see where it broke, not just the final string.
Autoplay cycles Success → Failure → Ticket — same eight stages, different outcomes.
Retrieval matches the question, claims align with passages, and the verdict stays grounded before anything ships.
User question ingested
3 passages · high overlap
Draft answer generated
No contradiction vs evidence
Hybrid score 0.72
Grounded — ship
Within policy limits · no escalation flags
No ticket created
Trace execution, evidence alignment, and failure state — in one glance.
Trace runs. Score evidence. Explain failures.
See prompt → retrieval → generation as a single execution rail — not scattered logs.
Inspect which sentences tie to which passages before you trust a score.
Unsupported, weak retrieval, or supported — with the claim in front of you.
Real evaluation results across models — same prompts, same pipeline, scores you can audit in the TraceDog UI.
+18%
Grounding improvement
vs prompt-only baseline
−26%
Hallucination rate
with hybrid scoring
40% faster
Time to diagnose
trace-first workflow
All metrics below are computed from real TraceDog evaluation runs and trace pipelines — not mocked dashboard data.
Identical prompts · SQuAD v2 · 5 models · hybrid reliability
Full run history, filters, and per-trace drill-down live in the product — this is a quick snapshot.
Takeaway: Same prompts split models cleanly on grounding vs latency — smaller open weights can win on speed while hosted minis trade a bit of latency for scores. TraceDog surfaces the spread per run so you are not guessing from a single aggregate table.
Evaluated on real LLM outputs using SQuAD v2 and repeatable trace captures.
Compare grounding, reliability, and latency across identical prompts in one view.
Hybrid claim-level scoring: sentence + keyword alignment against retrieved context.
Early-stage, developer-first, and built in the open — trace inspection, grounding, and explainable scoring today; deeper evaluation tooling next.
Trace inspection, hybrid grounding, explainable scores, evaluation runner.
Comparison dashboards, benchmark reports, richer failure typing in the UI.
Deeper reliability analytics, open-source SDKs, community integrations.
Dense entry points — same information architecture as serious devtools.
FastAPI backend · PostgreSQL storage · Next.js dashboard · TF-IDF & optional sentence-transformers · Docker-based local setup
Building TraceDog in the open — feedback, collaboration, and early users welcome.