In progress — we're building and shipping updates often. More features and insights for LLM evaluation and AI agent observability are on the way.

Trace health

Is the system healthy right now? Scope filters on the right — open Traces to search and slice the list.

40 loaded

Admin: scoring smoke job

Runs a tiny deterministic scoring task on the server (async job, polled from the browser). Use the same secret as ADMIN_API_KEY on the API. Only share this with operators you trust; it grants access to admin routes.

Total traces40
Avg reliability0.41
Avg risk0.65
Failure rate8%

Reliability over time

Drift and regressions in the selected window.

Risk distribution

Count of traces by hallucination risk band.

Success vs failure

GOOD+RISKY vs FAIL classification.

93%healthy
OK 37 Fail 3

No high-severity anomalies in the last 5 minutes.

By agent

hotpot-eval-runner
30 tracesHigh risk: 29Low rel: 29Fails: 3
squad-eval-runner
10 tracesHigh risk: 5Low rel: 3Fails: 0

Model comparison

Same evaluation slice — compare grounding, risk, latency, and consistency. One color per model in charts. Adjust filters in the panel on the right.

Best overall

gpt-4o-mini (score by balanced)

Weighted by the metric selector in the filter panel.

Fastest

gpt-4o-mini (1293 ms)

Lowest average latency in this slice.

Lowest risk

gpt-4o-mini (risk 0.54)

Lowest mean hallucination risk.

Most consistent

gpt-4o (σ rel 0.015)

Lowest reliability σ (needs 2+ traces).

Reliability trend by model

Each line is one model’s reliability over time.

claude-sonnet-4-6gpt-4o-miniclaude-opus-4-6gpt-5.4gpt-4o

Quality vs latency

Upper-left is best — high reliability, low latency.

claude-sonnet-4-6: reliability 0.39, latency 3310.866666666667msgpt-4o-mini: reliability 0.54, latency 1292.9msclaude-opus-4-6: reliability 0.38, latency 3345.8msgpt-5.4: reliability 0.38, latency 1350.6msgpt-4o: reliability 0.29, latency 1704.6msLatency →Reliability →

Average metrics by model

Normalized bars — hover for raw values.

claude-sonnet-4-6
Grounding
Reliability
Risk
Latency
n = 15
gpt-4o-mini
Grounding
Reliability
Risk
Latency
n = 10
claude-opus-4-6
Grounding
Reliability
Risk
Latency
n = 5
gpt-5.4
Grounding
Reliability
Risk
Latency
n = 5
gpt-4o
Grounding
Reliability
Risk
Latency
n = 5
ModelnAvg GAvg relAvg riskAvg msσ relTradeoff
claude-sonnet-4-6150.3250.3880.67533110.04660.117
gpt-4o-mini100.4620.5430.53812930.19290.420
claude-opus-4-650.3170.3800.68333460.07210.114
gpt-5.450.3550.3830.64513510.21360.284
gpt-4o50.1910.2860.80917050.01460.168
Advanced: side-by-side trace diff

Open two trace detail pages in separate tabs and compare prompt, evidence, grounding, and verdict. A future release can link traces that share the same prompt hash or experiment case id.