Observability and reliability for AI agents

Trace, evaluate, and debug AI systems with evidence.

TraceDog helps engineers inspect AI runs, measure grounding, detect weak retrieval, and compare model behavior with explainable reliability scores.

Hallucination risk
38%12%
Traced throughput
1K+ runs/day
P95 pipeline latency
1.2s
tracedog · trace detail
Looks groundedEvaluation
Reliability0.71
Hallucination risk0.31
Grounding0.68
Grounding spectrum
Why TraceDog scored it this way

Hybrid score blends sentence and keyword signals; evidence aligns with the answer and no contradiction was detected against retrieved passages.

  • Tested on real LLM outputs
  • Built with real evaluation traces
  • Supports RAG and agent workflows
  • Open-source, developer-first

Why TraceDog exists

Production AI fails in ways logs don’t show: weak retrieval, confident hallucinations, and inconsistent evaluation across models.

Confident but wrong outputs

AI responses can look correct even when evidence is weak or missing — the failure is silent until someone audits.

No visibility into retrieval + reasoning

Most tools stop at prompt and output. TraceDog captures what happened inside: retrieval, spans, and scoring.

No consistent reliability measure

Comparing runs across models and contexts is manual and noisy — you need one scoreline engineers can trust.

TraceDog turns model runs into reviewable evidence and reliability signals.

Real LLM runsPublic benchmark testedHybrid groundingOpen-source direction

Execution story

From run to verdict — one vertical trace

TraceDog walks the same path your RAG stack takes: prompt, retrieval, generation, checks, grounding, verdict — so engineers see where it broke, not just the final string.

Autoplay cycles Success → Failure → Ticket — same eight stages, different outcomes.

Retrieval matches the question, claims align with passages, and the verdict stays grounded before anything ships.

Success path
Step 1 / 8
Prompt
Live

User question ingested

Retrieval

3 passages · high overlap

Model

Draft answer generated

Claim checks

No contradiction vs evidence

Grounding

Hybrid score 0.72

Verdict

Grounded — ship

Threshold

Within policy limits · no escalation flags

Ticket

No ticket created

What TraceDog catches

Trace execution, evidence alignment, and failure state — in one glance.

Trace runs. Score evidence. Explain failures.

Trace runs

See prompt → retrieval → generation as a single execution rail — not scattered logs.

Map claims to evidence

Inspect which sentences tie to which passages before you trust a score.

Debug the verdict

Unsupported, weak retrieval, or supported — with the claim in front of you.

Measured on real runs, not demos

Real evaluation results across models — same prompts, same pipeline, scores you can audit in the TraceDog UI.

+18%

Grounding improvement

vs prompt-only baseline

−26%

Hallucination rate

with hybrid scoring

40% faster

Time to diagnose

trace-first workflow

All metrics below are computed from real TraceDog evaluation runs and trace pipelines — not mocked dashboard data.

Model comparison

Identical prompts · SQuAD v2 · 5 models · hybrid reliability

Full run history, filters, and per-trace drill-down live in the product — this is a quick snapshot.

GPT-4o-mini
0.66
Claude
0.70
Gemini
0.68
Llama
0.61
Mistral
0.63
Top grounding: Claude · 78%Lowest P95: Llama · 0.82sTop reliability: Claude · 0.70

Takeaway: Same prompts split models cleanly on grounding vs latency — smaller open weights can win on speed while hosted minis trade a bit of latency for scores. TraceDog surfaces the spread per run so you are not guessing from a single aggregate table.

  • SQuAD v2
  • Real LLM outputs
  • Public benchmark
  • Hybrid scoring

Benchmark runs

Evaluated on real LLM outputs using SQuAD v2 and repeatable trace captures.

Model comparison

Compare grounding, reliability, and latency across identical prompts in one view.

Scoring engine

Hybrid claim-level scoring: sentence + keyword alignment against retrieved context.

Open source, early, and moving fast

Early-stage, developer-first, and built in the open — trace inspection, grounding, and explainable scoring today; deeper evaluation tooling next.

V1 · now

Trace inspection, hybrid grounding, explainable scores, evaluation runner.

In progress

Comparison dashboards, benchmark reports, richer failure typing in the UI.

Upcoming

Deeper reliability analytics, open-source SDKs, community integrations.

Built with a practical, developer-first stack

FastAPI backend · PostgreSQL storage · Next.js dashboard · TF-IDF & optional sentence-transformers · Docker-based local setup

Follow the project or get in touch

Building TraceDog in the open — feedback, collaboration, and early users welcome.