Evaluation runs and reliability algorithms: pick a test and inspect example inputs and outputs.
{
"source_id": "squad_v2",
"split": "validation",
"offset": 0,
"limit": 10,
"use_cache": true
}{
"ingest_metadata": {
"eval_lineage": {
"descriptor": { "dataset_name": "squad_v2", ... },
"pipeline_stats": {
"fetch_ms": 842,
"rows_returned": 10,
"cache_hit": false
}
}
},
"grounding_score": 0.62,
"claim_grounding": { "claims": [...], ... }
}PYTHONPATH=. python -m evaluation.runners.run_squad_eval --limit 10 --summary --experiment smokeUses ADMIN_API_KEY. Presets and custom runs execute on the API machine; set PIPELINE_TESTS_REPO_ROOT to your repo root and EVAL_LAB_TRACEDOG_URL so live ingests hit this API. LLM keys live in evaluation/.env on that host.
Same flow as pytest tests/integration/test_repair_llm_live.py (policy preset). HotpotQA uses one bundled multi-hop row (same shape as run_hotpot_eval). OpenAI runs on the API server, which must have OPENAI_API_KEY on the API host (e.g. .env in TraceDog-backend) (not in the browser). Uses ADMIN_API_KEY here. Creates a real trace_id in the API database and may incur OpenAI cost.
Loads recent traces via GET /api/v1/traces and groups by model_name. Filter by experiment_tag substring (use the same tag for each model run — e.g. lab-hotpot-compare) to see how metrics shift between models.