Experiments

Evaluation runs and reliability algorithms: pick a test and inspect example inputs and outputs.

Slice request (conceptual)

{
  "source_id": "squad_v2",
  "split": "validation",
  "offset": 0,
  "limit": 10,
  "use_cache": true
}

Trace rows + pipeline_stats

{
  "ingest_metadata": {
    "eval_lineage": {
      "descriptor": { "dataset_name": "squad_v2", ... },
      "pipeline_stats": {
        "fetch_ms": 842,
        "rows_returned": 10,
        "cache_hit": false
      }
    }
  },
  "grounding_score": 0.62,
  "claim_grounding": { "claims": [...], ... }
}

CommandPYTHONPATH=. python -m evaluation.runners.run_squad_eval --limit 10 --summary --experiment smoke

Run experiments (API)

Uses ADMIN_API_KEY. Presets and custom runs execute on the API machine; set PIPELINE_TESTS_REPO_ROOT to your repo root and EVAL_LAB_TRACEDOG_URL so live ingests hit this API. LLM keys live in evaluation/.env on that host.

Smoke scoreIn-process scorer on a tiny trace (no repo root, no LLM).SQuAD dry-run (2)Fetch + prompts, no POST — needs PIPELINE_TESTS_REPO_ROOT + eval deps.Pytest (data plane)evaluation/tests offline suite via bash script.HotpotQA live (LLM + traces)Calls the LLM, POSTs traces to TraceDog, then GET /traces can compare models via the same experiment tag.SQuAD live (LLM + traces)Same as Hotpot preset but SQuAD v2 — use a shared tag to compare models side by side.

Live LLM → trace → repair (dashboard)

Same flow as pytest tests/integration/test_repair_llm_live.py (policy preset). HotpotQA uses one bundled multi-hop row (same shape as run_hotpot_eval). OpenAI runs on the API server, which must have OPENAI_API_KEY on the API host (e.g. .env in TraceDog-backend) (not in the browser). Uses ADMIN_API_KEY here. Creates a real trace_id in the API database and may incur OpenAI cost.

ScenarioOpenAI model (on API host)

Compare models (from traces)

Loads recent traces via GET /api/v1/traces and groups by model_name. Filter by experiment_tag substring (use the same tag for each model run — e.g. lab-hotpot-compare) to see how metrics shift between models.