From d4b40e2590fb2966ee13edf187d6c97fca91d50d Mon Sep 17 00:00:00 2001 From: alvis Date: Mon, 11 May 2026 11:23:13 +0000 Subject: [PATCH] docs: document MLflow trace API, span inspection, and no-agent diagnosis Co-Authored-By: Claude Sonnet 4.6 --- CLAUDE.md | 74 ++++++++++++++++++++++++++++++++++++++++++++++++++++--- 1 file changed, 70 insertions(+), 4 deletions(-) diff --git a/CLAUDE.md b/CLAUDE.md index 4fd7d50..eac1f68 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -101,17 +101,83 @@ Ollama and LiteLLM are **shared Agap services**, not oO services — they live i All `httpx` calls in `ml/` must use `trust_env=False` to bypass the system proxy — same rule as `bw` and curl. Pattern: `httpx.Client(trust_env=False, timeout=N)`. -MLflow container-to-container calls: always pass `host_header="localhost"` to `MLflowClient` — MLflow's `--allowed-hosts` rejects `Host: mlflow` (the container DNS name) with 403. Auth credential is `MLFLOW_ADMIN_PASSWORD`. MLflow REST API lives at the origin root (`/api/2.0/mlflow`), not under the `/mlflow` UI prefix. +MLflow container-to-container calls: always pass `host_header="localhost"` to `MLflowClient` — MLflow's `--allowed-hosts` rejects `Host: mlflow` (the container DNS name) with 403. Auth credential is `MLFLOW_ADMIN_PASSWORD`. MLflow REST API lives at the origin root, not under the `/mlflow` UI prefix. + +### MLflow API versions — runs vs traces + +MLflow uses **two API versions** — use the right one or you'll get 405: + +| What | API prefix | Example | +|------|-----------|---------| +| Runs, experiments, metrics | `/api/2.0/mlflow/` | `runs/search`, `experiments/list` | +| Traces (LLM observability) | `/api/3.0/mlflow/traces/` | `traces/{trace_id}` | + +**Experiment IDs:** `3` = oO/serving. Artifacts stored as run tags prefixed `artifact:`. + +### Querying from the host shell + +Always strip the proxy and pass `Host: localhost` (no port — `localhost:5000` fails the DNS-rebinding check). -MLflow from the host shell — query with curl, no script needed: ```bash +# Search recent runs (experiment 3) env -u HTTPS_PROXY -u HTTP_PROXY -u ALL_PROXY -u https_proxy -u http_proxy -u all_proxy \ curl -s -H "Host: localhost" -u "admin:${MLFLOW_ADMIN_PASSWORD}" \ -X POST http://localhost:5000/api/2.0/mlflow/runs/search \ -H "Content-Type: application/json" \ - -d '{"experiment_ids":["3"],"max_results":1,"order_by":["start_time DESC"]}' + -d '{"experiment_ids":["3"],"max_results":5,"order_by":["start_time DESC"]}' + +# Get a trace by ID (note: /api/3.0/, not /api/2.0/) +env -u HTTPS_PROXY -u HTTP_PROXY -u ALL_PROXY -u https_proxy -u http_proxy -u all_proxy \ + curl -s -H "Host: localhost" -u "admin:${MLFLOW_ADMIN_PASSWORD}" \ + http://localhost:5000/api/3.0/mlflow/traces/tr- | python3 -m json.tool +``` + +The trace response includes `trace_metadata.mlflow.traceInputs/Outputs`, `trace_metadata.mlflow.trace.sizeStats` (num_spans), and `tags.mlflow.traceName`. + +### Getting spans (Python client from inside the container) + +The REST API has **no endpoint for spans** — `/api/3.0/mlflow/traces/{id}/spans` returns 404. Use the Python client inside `oo-ml-serving-1`: + +```bash +docker exec oo-ml-serving-1 python3 -c " +import mlflow, json, os +mlflow.set_tracking_uri('http://mlflow:5000') +os.environ['MLFLOW_TRACKING_USERNAME'] = 'admin' +os.environ['MLFLOW_TRACKING_PASSWORD'] = os.environ.get('MLFLOW_ADMIN_PASSWORD', '') + +client = mlflow.tracking.MlflowClient() +trace = client.get_trace('tr-') +for span in trace.data.spans: + print(span.name, '| parent:', span.parent_id, '| status:', span.status) + print(' inputs:', json.dumps(span.inputs)[:200]) + print(' outputs:', json.dumps(span.outputs)[:200]) + print(' attrs:', span.attributes) +" +``` + +### Span structure for a tip generation trace + +A healthy `recommend` trace has 3 spans: + +| Span | Type | Parent | Key attributes | +|------|------|--------|---------------| +| `recommend` | CHAIN | (root) | `agent_count`, `latency_ms`; inputs include `agent_ids` list | +| `build_context` | TOOL | recommend | `agent_count`, `task_count`, `science_destiny` | +| `llm_orchestrator` | LLM | recommend | `prompt_tokens`, `completion_tokens`, `model`, `attempts` | + +### Diagnosing "no agents in trace" + +If the trace shows `agent_ids: []` and `agent_count: 0` in the root span, and the orchestrator prompt says *"No pre-computed agent context available"*, it means the recommender found zero eligible snippets at request time. Causes: + +1. **Agent compute hasn't run** — no `agent_outputs` rows for this user yet +2. **Snippets expired** — TTL elapsed since last compute +3. **Eligibility filter dropped all agents** — none passed the manifest-driven check + +Diagnose with: +```bash +docker exec oo-api-1 psql "$DATABASE_URL" -c \ + "SELECT agent_id, computed_at, expires_at FROM agent_outputs WHERE user_id='' ORDER BY computed_at DESC LIMIT 10;" ``` -`Host: localhost` required (no port) — `localhost:5000` fails the DNS-rebinding check. Experiment IDs: `3`=oO/serving. Artifacts stored as run tags prefixed `artifact:`. **Multi-agent tip generation pipeline (ADR-0013):** 1. Pre-compute agents (`ml/agents//`) run on a schedule, each emitting a snippet into `agent_outputs` with a per-agent TTL