In progress — we're building and shipping updates often. More features and insights for LLM evaluation and AI agent observability are on the way.

Experiments

Evaluation runs and reliability algorithms: pick a test and inspect example inputs and outputs.

Registry → fetch → validate → provenance → runner → POST traces. Same plane as the Data page pipeline diagram.

Slice request (conceptual)

{
  "source_id": "squad_v2",
  "split": "validation",
  "offset": 0,
  "limit": 10,
  "use_cache": true
}

Trace rows + pipeline_stats

{
  "ingest_metadata": {
    "eval_lineage": {
      "descriptor": { "dataset_name": "squad_v2", ... },
      "pipeline_stats": {
        "fetch_ms": 842,
        "rows_returned": 10,
        "cache_hit": false
      }
    }
  },
  "grounding_score": 0.62,
  "claim_grounding": { "claims": [...], ... }
}
CommandPYTHONPATH=. python -m evaluation.runners.run_squad_eval --limit 10 --summary --experiment smoke

Run experiments (API)

Uses ADMIN_API_KEY. Presets and custom runs execute on the API machine; set PIPELINE_TESTS_REPO_ROOT to your repo root and EVAL_LAB_TRACEDOG_URL so live ingests hit this API. LLM keys live in evaluation/.env on that host.

Live LLM → trace → repair (dashboard)

Same flow as pytest tests/integration/test_repair_llm_live.py (policy preset). HotpotQA uses one bundled multi-hop row (same shape as run_hotpot_eval). OpenAI runs on the API server, which must have OPENAI_API_KEY on the API host (e.g. .env in TraceDog-backend) (not in the browser). Uses ADMIN_API_KEY here. Creates a real trace_id in the API database and may incur OpenAI cost.

Compare models (from traces)

Loads recent traces via GET /api/v1/traces and groups by model_name. Filter by experiment_tag substring (use the same tag for each model run — e.g. lab-hotpot-compare) to see how metrics shift between models.