Trace health

Is the system healthy right now? Scope filters on the right — open Traces to search and slice the list.

40 loaded

Admin: scoring smoke job

Runs a tiny deterministic scoring task on the server (async job, polled from the browser). Use the same secret as ADMIN_API_KEY on the API. Only share this with operators you trust; it grants access to admin routes.

Data pipeline— fetch latency from eval traces, run tests from the dashboard.

Total traces40

Avg reliability0.41

Avg risk0.65

Failure rate8%

Reliability over time

Drift and regressions in the selected window.

Risk distribution

Count of traces by hallucination risk band.

Low2

Med14

High24

Success vs failure

GOOD+RISKY vs FAIL classification.

OK 37 Fail 3

No high-severity anomalies in the last 5 minutes.

By agent

hotpot-eval-runner

30 tracesHigh risk: 29Low rel: 29Fails: 3

squad-eval-runner

10 tracesHigh risk: 5Low rel: 3Fails: 0

Model comparison

Same evaluation slice — compare grounding, risk, latency, and consistency. One color per model in charts. Adjust filters in the panel on the right.

Best overall

gpt-4o-mini (score by balanced)

Weighted by the metric selector in the filter panel.

Fastest

gpt-4o-mini (1293 ms)

Lowest average latency in this slice.

Lowest risk

gpt-4o-mini (risk 0.54)

Lowest mean hallucination risk.

Most consistent

gpt-4o (σ rel 0.015)

Lowest reliability σ (needs 2+ traces).

Reliability trend by model

Each line is one model’s reliability over time.

claude-sonnet-4-6gpt-4o-miniclaude-opus-4-6gpt-5.4gpt-4o

Quality vs latency

Upper-left is best — high reliability, low latency.

Average metrics by model

Normalized bars — hover for raw values.

claude-sonnet-4-6

Grounding

Reliability

Risk

Latency

n = 15

gpt-4o-mini

Grounding

Reliability

Risk

Latency

n = 10

claude-opus-4-6

Grounding

Reliability

Risk

Latency

n = 5

gpt-5.4

Grounding

Reliability

Risk

Latency

n = 5

gpt-4o

Grounding

Reliability

Risk

Latency

n = 5

Model	n	Avg G	Avg rel	Avg risk	Avg ms	σ rel	Tradeoff
claude-sonnet-4-6	15	0.325	0.388	0.675	3311	0.0466	0.117
gpt-4o-mini	10	0.462	0.543	0.538	1293	0.1929	0.420
claude-opus-4-6	5	0.317	0.380	0.683	3346	0.0721	0.114
gpt-5.4	5	0.355	0.383	0.645	1351	0.2136	0.284
gpt-4o	5	0.191	0.286	0.809	1705	0.0146	0.168

Advanced: side-by-side trace diff

Open two trace detail pages in separate tabs and compare prompt, evidence, grounding, and verdict. A future release can link traces that share the same prompt hash or experiment case id.