TraceDogTraceDog
ExperimentsOpen sourceAboutContactGitHub
DashboardRead docs
← Traces / Detail

Trace decision

Short answer — sentence scores vs long passages can look low; treat as a signal.
Needs review

Grounding is borderline and should be reviewed.

Grounding0.41Review
Risk0.59High
Reliability0.50Medium
Recommended actionReview retrieved evidence before shipping
gpt-4o-mini1702ms totalNormans2026-03-26T04:47:15.015926+00:00
Why it scored this way
Why this needs review

The response is not strongly supported by retrieved evidence (best hybrid 0.41; strong threshold 0.52).

  • Best grounding score: 0.41
  • Sentence match: 0.18
  • Keyword overlap: 0.60
Full scorer narrative

The response is not strongly supported by retrieved evidence (best hybrid 0.41; strong threshold 0.52).

Next steps

  • Inspect retrieved docs below — scores blend best-sentence similarity with keyword overlap.
  • Tighten retrieval or add citations in the agent prompt.

What we measured

  • Hybrid grounding (per chunk): best 0.41 (strong ≥ 0.52)
  • Mean hybrid across chunks: 0.41
  • Best raw sentence match: 0.18 (short answers often score low vs. whole paragraphs)
  • Lexical overlap with sources: 0.60
  • Blend: short-answer blend (45% sentence + 55% keyword)
  • Weak / strong cutoffs: 0.35 / 0.52
Evidence
Prompt & response
Grounding vs thresholds0.41
weak < 0.35review 0.35–0.52strong ≥ 0.52
Blend contribution
Sentence 0.18 (45%)Keyword 0.60 (55%)
Confidence trend

Series padded from this trace’s score — batch view shows drift.

Failure mix (this trace)

One bar — fleet-wide % needs aggregate metrics from the API.

  • Hallucination risk 0%
  • Low grounding 0%
  • Failures 100%
  • Healthy 0%
Execution runtime

Total 1702ms

Retrieval425ms✓
LLM1277ms✓
  • ✓Retrieval425ms
  • ✓LLM1277ms
Debug