Observability and reliability for AI agents

Trace, evaluate, and debug AI systems with evidence.

TraceDog helps engineers inspect AI runs, measure grounding, detect weak retrieval, and compare model behavior with explainable reliability scores.

View demo Read docs

Hallucination risk: 38%12%
Traced throughput: 1K+ runs/day
P95 pipeline latency: 1.2s

See experiments·GitHub

tracedog · trace detail

Looks groundedEvaluation

Reliability0.71

Hallucination risk0.31

Grounding0.68

Grounding spectrum

Why TraceDog scored it this way

Hybrid score blends sentence and keyword signals; evidence aligns with the answer and no contradiction was detected against retrieved passages.

Tested on real LLM outputs
Built with real evaluation traces
Supports RAG and agent workflows
Open-source, developer-first

Why TraceDog exists

Production AI fails in ways logs don’t show: weak retrieval, confident hallucinations, and inconsistent evaluation across models.

Confident but wrong outputs

AI responses can look correct even when evidence is weak or missing — the failure is silent until someone audits.

No visibility into retrieval + reasoning

Most tools stop at prompt and output. TraceDog captures what happened inside: retrieval, spans, and scoring.

No consistent reliability measure

Comparing runs across models and contexts is manual and noisy — you need one scoreline engineers can trust.

TraceDog turns model runs into reviewable evidence and reliability signals.

Real LLM runsPublic benchmark testedHybrid groundingOpen-source direction

Execution story

From run to verdict — one vertical trace

TraceDog walks the same path your RAG stack takes: prompt, retrieval, generation, checks, grounding, verdict — so engineers see where it broke, not just the final string.

Autoplay cycles Success → Failure → Ticket — same eight stages, different outcomes.

Retrieval matches the question, claims align with passages, and the verdict stays grounded before anything ships.

Success path

Step 1 / 8

Prompt

Live

User question ingested

Retrieval

3 passages · high overlap

Model

Draft answer generated

Claim checks

No contradiction vs evidence

Grounding

Hybrid score 0.72

Verdict

Grounded — ship

Threshold

Within policy limits · no escalation flags

Ticket

No ticket created

What TraceDog catches

Trace execution, evidence alignment, and failure state — in one glance.

Trace runs. Score evidence. Explain failures.

Trace runs

See prompt → retrieval → generation as a single execution rail — not scattered logs.

Map claims to evidence

Inspect which sentences tie to which passages before you trust a score.

Debug the verdict

Unsupported, weak retrieval, or supported — with the claim in front of you.

Measured on real runs, not demos

Real evaluation results across models — same prompts, same pipeline, scores you can audit in the TraceDog UI.

+18%

Grounding improvement

vs prompt-only baseline

−26%

Hallucination rate

with hybrid scoring

40% faster

Time to diagnose

trace-first workflow

All metrics below are computed from real TraceDog evaluation runs and trace pipelines — not mocked dashboard data.

Model comparison

Identical prompts · SQuAD v2 · 5 models · hybrid reliability

Full run history, filters, and per-trace drill-down live in the product — this is a quick snapshot.

GPT-4o-mini

0.66

Claude

0.70

Gemini

0.68

Llama

0.61

Mistral

0.63

Top grounding: Claude · 78%Lowest P95: Llama · 0.82sTop reliability: Claude · 0.70

Takeaway: Same prompts split models cleanly on grounding vs latency — smaller open weights can win on speed while hosted minis trade a bit of latency for scores. TraceDog surfaces the spread per run so you are not guessing from a single aggregate table.

SQuAD v2
Real LLM outputs
Public benchmark
Hybrid scoring

Benchmark runs

Evaluated on real LLM outputs using SQuAD v2 and repeatable trace captures.

Model comparison

Compare grounding, reliability, and latency across identical prompts in one view.

Scoring engine

Hybrid claim-level scoring: sentence + keyword alignment against retrieved context.

Experiment notes & methodology Scoring & algorithms in docs

Open source, early, and moving fast

Early-stage, developer-first, and built in the open — trace inspection, grounding, and explainable scoring today; deeper evaluation tooling next.

V1 · now

Trace inspection, hybrid grounding, explainable scores, evaluation runner.

In progress

Comparison dashboards, benchmark reports, richer failure typing in the UI.

Upcoming

Deeper reliability analytics, open-source SDKs, community integrations.

Designed for engineers

Dense entry points — same information architecture as serious devtools.

Built with a practical, developer-first stack

FastAPI backend · PostgreSQL storage · Next.js dashboard · TF-IDF & optional sentence-transformers · Docker-based local setup

Follow the project or get in touch

Building TraceDog in the open — feedback, collaboration, and early users welcome.

View GitHub Read the docs Contact See the dashboard demo

Trace, evaluate, and debug AI systems with evidence.

Why TraceDog exists

Confident but wrong outputs

No visibility into retrieval + reasoning

No consistent reliability measure

From run to verdict — one vertical trace

What TraceDog catches

Trace runs

Map claims to evidence

Debug the verdict

Measured on real runs, not demos

Model comparison

Benchmark runs

Model comparison

Scoring engine

Open source, early, and moving fast

Designed for engineers

Docs

API

Algorithms

Examples

Benchmarks

GitHub

Built with a practical, developer-first stack

Follow the project or get in touch