In progress — we're building and shipping updates often. More features and insights for LLM evaluation and AI agent observability are on the way.

Agents

Per-agent hallucination risk and reliability — last 7 days. Each agent is tracked independently so regressions are immediately attributable.

2 agents

hotpot-eval-runner

evaluation · 7d window · 0 traces

Avg reliability
Avg hallucination risk
Avg grounding
Traces0

Hallucination risk over time

Daily average — lower is better. Spikes indicate degraded retrieval or synthesis.

No trend data in this window.

Reliability over time

Daily average reliability score — higher is better.

No reliability data yet.

squad-eval-runner

evaluation · 7d window · 0 traces

Avg reliability
Avg hallucination risk
Avg grounding
Traces0

Hallucination risk over time

Daily average — lower is better. Spikes indicate degraded retrieval or synthesis.

No trend data in this window.

Reliability over time

Daily average reliability score — higher is better.

No reliability data yet.