docs: document MLflow trace API, span inspection, and no-agent diagnosis

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-11 11:23:13 +00:00
parent a0a069c525
commit d4b40e2590
1 changed files with 70 additions and 4 deletions
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -101,17 +101,83 @@ Ollama and LiteLLM are **shared Agap services**, not oO services — they live i
 All `httpx` calls in `ml/` must use `trust_env=False` to bypass the system proxy — same rule as `bw` and curl. Pattern: `httpx.Client(trust_env=False, timeout=N)`.
-MLflow container-to-container calls: always pass `host_header="localhost"` to `MLflowClient` — MLflow's `--allowed-hosts` rejects `Host: mlflow` (the container DNS name) with 403. Auth credential is `MLFLOW_ADMIN_PASSWORD`. MLflow REST API lives at the origin root (`/api/2.0/mlflow`), not under the `/mlflow` UI prefix.
+MLflow container-to-container calls: always pass `host_header="localhost"` to `MLflowClient` — MLflow's `--allowed-hosts` rejects `Host: mlflow` (the container DNS name) with 403. Auth credential is `MLFLOW_ADMIN_PASSWORD`. MLflow REST API lives at the origin root, not under the `/mlflow` UI prefix.
 ### MLflow API versions — runs vs traces
 MLflow uses **two API versions** — use the right one or you'll get 405:
 | What | API prefix | Example |
 |------|-----------|---------|
 | Runs, experiments, metrics | `/api/2.0/mlflow/` | `runs/search`, `experiments/list` |
 | Traces (LLM observability) | `/api/3.0/mlflow/traces/` | `traces/{trace_id}` |
 **Experiment IDs:** `3` = oO/serving. Artifacts stored as run tags prefixed `artifact:<path>`.
 ### Querying from the host shell
 Always strip the proxy and pass `Host: localhost` (no port — `localhost:5000` fails the DNS-rebinding check).
 MLflow from the host shell — query with curl, no script needed:
 ```bash
 # Search recent runs (experiment 3)
 env -u HTTPS_PROXY -u HTTP_PROXY -u ALL_PROXY -u https_proxy -u http_proxy -u all_proxy \
  curl -s -H "Host: localhost" -u "admin:${MLFLOW_ADMIN_PASSWORD}" \
  -X POST http://localhost:5000/api/2.0/mlflow/runs/search \
  -H "Content-Type: application/json" \
-  -d '{"experiment_ids":["3"],"max_results":1,"order_by":["start_time DESC"]}'
+  -d '{"experiment_ids":["3"],"max_results":5,"order_by":["start_time DESC"]}'
 # Get a trace by ID (note: /api/3.0/, not /api/2.0/)
 env -u HTTPS_PROXY -u HTTP_PROXY -u ALL_PROXY -u https_proxy -u http_proxy -u all_proxy \
  curl -s -H "Host: localhost" -u "admin:${MLFLOW_ADMIN_PASSWORD}" \
  http://localhost:5000/api/3.0/mlflow/traces/tr-<trace_id> | python3 -m json.tool
 ```
 The trace response includes `trace_metadata.mlflow.traceInputs/Outputs`, `trace_metadata.mlflow.trace.sizeStats` (num_spans), and `tags.mlflow.traceName`.
 ### Getting spans (Python client from inside the container)
 The REST API has **no endpoint for spans** — `/api/3.0/mlflow/traces/{id}/spans` returns 404. Use the Python client inside `oo-ml-serving-1`:
 ```bash
 docker exec oo-ml-serving-1 python3 -c "
 import mlflow, json, os
 mlflow.set_tracking_uri('http://mlflow:5000')
 os.environ['MLFLOW_TRACKING_USERNAME'] = 'admin'
 os.environ['MLFLOW_TRACKING_PASSWORD'] = os.environ.get('MLFLOW_ADMIN_PASSWORD', '')
 client = mlflow.tracking.MlflowClient()
 trace = client.get_trace('tr-<trace_id>')
 for span in trace.data.spans:
    print(span.name, '| parent:', span.parent_id, '| status:', span.status)
    print('  inputs:', json.dumps(span.inputs)[:200])
    print('  outputs:', json.dumps(span.outputs)[:200])
    print('  attrs:', span.attributes)
 "
 ```
 ### Span structure for a tip generation trace
 A healthy `recommend` trace has 3 spans:
 | Span | Type | Parent | Key attributes |
 |------|------|--------|---------------|
 | `recommend` | CHAIN | (root) | `agent_count`, `latency_ms`; inputs include `agent_ids` list |
 | `build_context` | TOOL | recommend | `agent_count`, `task_count`, `science_destiny` |
 | `llm_orchestrator` | LLM | recommend | `prompt_tokens`, `completion_tokens`, `model`, `attempts` |
 ### Diagnosing "no agents in trace"
 If the trace shows `agent_ids: []` and `agent_count: 0` in the root span, and the orchestrator prompt says *"No pre-computed agent context available"*, it means the recommender found zero eligible snippets at request time. Causes:
 1. **Agent compute hasn't run** — no `agent_outputs` rows for this user yet
 2. **Snippets expired** — TTL elapsed since last compute
 3. **Eligibility filter dropped all agents** — none passed the manifest-driven check
 Diagnose with:
 ```bash
 docker exec oo-api-1 psql "$DATABASE_URL" -c \
  "SELECT agent_id, computed_at, expires_at FROM agent_outputs WHERE user_id='<uid>' ORDER BY computed_at DESC LIMIT 10;"
 ```
 `Host: localhost` required (no port) — `localhost:5000` fails the DNS-rebinding check. Experiment IDs: `3`=oO/serving. Artifacts stored as run tags prefixed `artifact:<path>`.
 **Multi-agent tip generation pipeline (ADR-0013):**
 1. Pre-compute agents (`ml/agents/<id>/`) run on a schedule, each emitting a snippet into `agent_outputs` with a per-agent TTL