From d4b40e2590fb2966ee13edf187d6c97fca91d50d Mon Sep 17 00:00:00 2001
From: alvis <allogn@gmail.com>
Date: Mon, 11 May 2026 11:23:13 +0000
Subject: [PATCH] docs: document MLflow trace API, span inspection, and
 no-agent diagnosis

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---
 CLAUDE.md | 74 ++++++++++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 70 insertions(+), 4 deletions(-)
diff --git a/CLAUDE.md b/CLAUDE.md
index 4fd7d50..eac1f68 100644
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -101,17 +101,83 @@ Ollama and LiteLLM are **shared Agap services**, not oO services — they live i
 
 All `httpx` calls in `ml/` must use `trust_env=False` to bypass the system proxy — same rule as `bw` and curl. Pattern: `httpx.Client(trust_env=False, timeout=N)`.
 
-MLflow container-to-container calls: always pass `host_header="localhost"` to `MLflowClient` — MLflow's `--allowed-hosts` rejects `Host: mlflow` (the container DNS name) with 403. Auth credential is `MLFLOW_ADMIN_PASSWORD`. MLflow REST API lives at the origin root (`/api/2.0/mlflow`), not under the `/mlflow` UI prefix.
+MLflow container-to-container calls: always pass `host_header="localhost"` to `MLflowClient` — MLflow's `--allowed-hosts` rejects `Host: mlflow` (the container DNS name) with 403. Auth credential is `MLFLOW_ADMIN_PASSWORD`. MLflow REST API lives at the origin root, not under the `/mlflow` UI prefix.
+
+### MLflow API versions — runs vs traces
+
+MLflow uses **two API versions** — use the right one or you'll get 405:
+
+| What | API prefix | Example |
+|------|-----------|---------|
+| Runs, experiments, metrics | `/api/2.0/mlflow/` | `runs/search`, `experiments/list` |
+| Traces (LLM observability) | `/api/3.0/mlflow/traces/` | `traces/{trace_id}` |
+
+**Experiment IDs:** `3` = oO/serving. Artifacts stored as run tags prefixed `artifact:<path>`.
+
+### Querying from the host shell
+
+Always strip the proxy and pass `Host: localhost` (no port — `localhost:5000` fails the DNS-rebinding check).
 
-MLflow from the host shell — query with curl, no script needed:
 ```bash
+# Search recent runs (experiment 3)
 env -u HTTPS_PROXY -u HTTP_PROXY -u ALL_PROXY -u https_proxy -u http_proxy -u all_proxy \
   curl -s -H "Host: localhost" -u "admin:${MLFLOW_ADMIN_PASSWORD}" \
   -X POST http://localhost:5000/api/2.0/mlflow/runs/search \
   -H "Content-Type: application/json" \
-  -d '{"experiment_ids":["3"],"max_results":1,"order_by":["start_time DESC"]}'
+  -d '{"experiment_ids":["3"],"max_results":5,"order_by":["start_time DESC"]}'
+
+# Get a trace by ID (note: /api/3.0/, not /api/2.0/)
+env -u HTTPS_PROXY -u HTTP_PROXY -u ALL_PROXY -u https_proxy -u http_proxy -u all_proxy \
+  curl -s -H "Host: localhost" -u "admin:${MLFLOW_ADMIN_PASSWORD}" \
+  http://localhost:5000/api/3.0/mlflow/traces/tr-<trace_id> | python3 -m json.tool
+```
+
+The trace response includes `trace_metadata.mlflow.traceInputs/Outputs`, `trace_metadata.mlflow.trace.sizeStats` (num_spans), and `tags.mlflow.traceName`.
+
+### Getting spans (Python client from inside the container)
+
+The REST API has **no endpoint for spans** — `/api/3.0/mlflow/traces/{id}/spans` returns 404. Use the Python client inside `oo-ml-serving-1`:
+
+```bash
+docker exec oo-ml-serving-1 python3 -c "
+import mlflow, json, os
+mlflow.set_tracking_uri('http://mlflow:5000')
+os.environ['MLFLOW_TRACKING_USERNAME'] = 'admin'
+os.environ['MLFLOW_TRACKING_PASSWORD'] = os.environ.get('MLFLOW_ADMIN_PASSWORD', '')
+
+client = mlflow.tracking.MlflowClient()
+trace = client.get_trace('tr-<trace_id>')
+for span in trace.data.spans:
+    print(span.name, '| parent:', span.parent_id, '| status:', span.status)
+    print('  inputs:', json.dumps(span.inputs)[:200])
+    print('  outputs:', json.dumps(span.outputs)[:200])
+    print('  attrs:', span.attributes)
+"
+```
+
+### Span structure for a tip generation trace
+
+A healthy `recommend` trace has 3 spans:
+
+| Span | Type | Parent | Key attributes |
+|------|------|--------|---------------|
+| `recommend` | CHAIN | (root) | `agent_count`, `latency_ms`; inputs include `agent_ids` list |
+| `build_context` | TOOL | recommend | `agent_count`, `task_count`, `science_destiny` |
+| `llm_orchestrator` | LLM | recommend | `prompt_tokens`, `completion_tokens`, `model`, `attempts` |
+
+### Diagnosing "no agents in trace"
+
+If the trace shows `agent_ids: []` and `agent_count: 0` in the root span, and the orchestrator prompt says *"No pre-computed agent context available"*, it means the recommender found zero eligible snippets at request time. Causes:
+
+1. **Agent compute hasn't run** — no `agent_outputs` rows for this user yet
+2. **Snippets expired** — TTL elapsed since last compute
+3. **Eligibility filter dropped all agents** — none passed the manifest-driven check
+
+Diagnose with:
+```bash
+docker exec oo-api-1 psql "$DATABASE_URL" -c \
+  "SELECT agent_id, computed_at, expires_at FROM agent_outputs WHERE user_id='<uid>' ORDER BY computed_at DESC LIMIT 10;"
 ```
-`Host: localhost` required (no port) — `localhost:5000` fails the DNS-rebinding check. Experiment IDs: `3`=oO/serving. Artifacts stored as run tags prefixed `artifact:<path>`.
 
 **Multi-agent tip generation pipeline (ADR-0013):**
 1. Pre-compute agents (`ml/agents/<id>/`) run on a schedule, each emitting a snippet into `agent_outputs` with a per-agent TTL