Is the system healthy right now? Scope filters on the right — open Traces to search and slice the list.
Runs a tiny deterministic scoring task on the server (async job, polled from the browser). Use the same secret as ADMIN_API_KEY on the API. Only share this with operators you trust; it grants access to admin routes.
Data pipeline— fetch latency from eval traces, run tests from the dashboard.
Drift and regressions in the selected window.
Count of traces by hallucination risk band.
GOOD+RISKY vs FAIL classification.
No high-severity anomalies in the last 5 minutes.
Same evaluation slice — compare grounding, risk, latency, and consistency. One color per model in charts. Adjust filters in the panel on the right.
gpt-4o-mini (score by balanced)
Weighted by the metric selector in the filter panel.
gpt-4o-mini (1293 ms)
Lowest average latency in this slice.
gpt-4o-mini (risk 0.54)
Lowest mean hallucination risk.
gpt-4o (σ rel 0.015)
Lowest reliability σ (needs 2+ traces).
Each line is one model’s reliability over time.
Upper-left is best — high reliability, low latency.
| Model | n | Avg G | Avg rel | Avg risk | Avg ms | σ rel | Tradeoff |
|---|---|---|---|---|---|---|---|
| claude-sonnet-4-6 | 15 | 0.325 | 0.388 | 0.675 | 3311 | 0.0466 | 0.117 |
| gpt-4o-mini | 10 | 0.462 | 0.543 | 0.538 | 1293 | 0.1929 | 0.420 |
| claude-opus-4-6 | 5 | 0.317 | 0.380 | 0.683 | 3346 | 0.0721 | 0.114 |
| gpt-5.4 | 5 | 0.355 | 0.383 | 0.645 | 1351 | 0.2136 | 0.284 |
| gpt-4o | 5 | 0.191 | 0.286 | 0.809 | 1705 | 0.0146 | 0.168 |
Open two trace detail pages in separate tabs and compare prompt, evidence, grounding, and verdict. A future release can link traces that share the same prompt hash or experiment case id.