docs: document MLflow trace API, span inspection, and no-agent diagnosis

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
2026-05-11 11:23:13 +00:00
parent a0a069c525
commit d4b40e2590

View File

@@ -101,17 +101,83 @@ Ollama and LiteLLM are **shared Agap services**, not oO services — they live i
All `httpx` calls in `ml/` must use `trust_env=False` to bypass the system proxy — same rule as `bw` and curl. Pattern: `httpx.Client(trust_env=False, timeout=N)`. All `httpx` calls in `ml/` must use `trust_env=False` to bypass the system proxy — same rule as `bw` and curl. Pattern: `httpx.Client(trust_env=False, timeout=N)`.
MLflow container-to-container calls: always pass `host_header="localhost"` to `MLflowClient` — MLflow's `--allowed-hosts` rejects `Host: mlflow` (the container DNS name) with 403. Auth credential is `MLFLOW_ADMIN_PASSWORD`. MLflow REST API lives at the origin root (`/api/2.0/mlflow`), not under the `/mlflow` UI prefix. MLflow container-to-container calls: always pass `host_header="localhost"` to `MLflowClient` — MLflow's `--allowed-hosts` rejects `Host: mlflow` (the container DNS name) with 403. Auth credential is `MLFLOW_ADMIN_PASSWORD`. MLflow REST API lives at the origin root, not under the `/mlflow` UI prefix.
### MLflow API versions — runs vs traces
MLflow uses **two API versions** — use the right one or you'll get 405:
| What | API prefix | Example |
|------|-----------|---------|
| Runs, experiments, metrics | `/api/2.0/mlflow/` | `runs/search`, `experiments/list` |
| Traces (LLM observability) | `/api/3.0/mlflow/traces/` | `traces/{trace_id}` |
**Experiment IDs:** `3` = oO/serving. Artifacts stored as run tags prefixed `artifact:<path>`.
### Querying from the host shell
Always strip the proxy and pass `Host: localhost` (no port — `localhost:5000` fails the DNS-rebinding check).
MLflow from the host shell — query with curl, no script needed:
```bash ```bash
# Search recent runs (experiment 3)
env -u HTTPS_PROXY -u HTTP_PROXY -u ALL_PROXY -u https_proxy -u http_proxy -u all_proxy \ env -u HTTPS_PROXY -u HTTP_PROXY -u ALL_PROXY -u https_proxy -u http_proxy -u all_proxy \
curl -s -H "Host: localhost" -u "admin:${MLFLOW_ADMIN_PASSWORD}" \ curl -s -H "Host: localhost" -u "admin:${MLFLOW_ADMIN_PASSWORD}" \
-X POST http://localhost:5000/api/2.0/mlflow/runs/search \ -X POST http://localhost:5000/api/2.0/mlflow/runs/search \
-H "Content-Type: application/json" \ -H "Content-Type: application/json" \
-d '{"experiment_ids":["3"],"max_results":1,"order_by":["start_time DESC"]}' -d '{"experiment_ids":["3"],"max_results":5,"order_by":["start_time DESC"]}'
# Get a trace by ID (note: /api/3.0/, not /api/2.0/)
env -u HTTPS_PROXY -u HTTP_PROXY -u ALL_PROXY -u https_proxy -u http_proxy -u all_proxy \
curl -s -H "Host: localhost" -u "admin:${MLFLOW_ADMIN_PASSWORD}" \
http://localhost:5000/api/3.0/mlflow/traces/tr-<trace_id> | python3 -m json.tool
```
The trace response includes `trace_metadata.mlflow.traceInputs/Outputs`, `trace_metadata.mlflow.trace.sizeStats` (num_spans), and `tags.mlflow.traceName`.
### Getting spans (Python client from inside the container)
The REST API has **no endpoint for spans**`/api/3.0/mlflow/traces/{id}/spans` returns 404. Use the Python client inside `oo-ml-serving-1`:
```bash
docker exec oo-ml-serving-1 python3 -c "
import mlflow, json, os
mlflow.set_tracking_uri('http://mlflow:5000')
os.environ['MLFLOW_TRACKING_USERNAME'] = 'admin'
os.environ['MLFLOW_TRACKING_PASSWORD'] = os.environ.get('MLFLOW_ADMIN_PASSWORD', '')
client = mlflow.tracking.MlflowClient()
trace = client.get_trace('tr-<trace_id>')
for span in trace.data.spans:
print(span.name, '| parent:', span.parent_id, '| status:', span.status)
print(' inputs:', json.dumps(span.inputs)[:200])
print(' outputs:', json.dumps(span.outputs)[:200])
print(' attrs:', span.attributes)
"
```
### Span structure for a tip generation trace
A healthy `recommend` trace has 3 spans:
| Span | Type | Parent | Key attributes |
|------|------|--------|---------------|
| `recommend` | CHAIN | (root) | `agent_count`, `latency_ms`; inputs include `agent_ids` list |
| `build_context` | TOOL | recommend | `agent_count`, `task_count`, `science_destiny` |
| `llm_orchestrator` | LLM | recommend | `prompt_tokens`, `completion_tokens`, `model`, `attempts` |
### Diagnosing "no agents in trace"
If the trace shows `agent_ids: []` and `agent_count: 0` in the root span, and the orchestrator prompt says *"No pre-computed agent context available"*, it means the recommender found zero eligible snippets at request time. Causes:
1. **Agent compute hasn't run** — no `agent_outputs` rows for this user yet
2. **Snippets expired** — TTL elapsed since last compute
3. **Eligibility filter dropped all agents** — none passed the manifest-driven check
Diagnose with:
```bash
docker exec oo-api-1 psql "$DATABASE_URL" -c \
"SELECT agent_id, computed_at, expires_at FROM agent_outputs WHERE user_id='<uid>' ORDER BY computed_at DESC LIMIT 10;"
``` ```
`Host: localhost` required (no port) — `localhost:5000` fails the DNS-rebinding check. Experiment IDs: `3`=oO/serving. Artifacts stored as run tags prefixed `artifact:<path>`.
**Multi-agent tip generation pipeline (ADR-0013):** **Multi-agent tip generation pipeline (ADR-0013):**
1. Pre-compute agents (`ml/agents/<id>/`) run on a schedule, each emitting a snippet into `agent_outputs` with a per-agent TTL 1. Pre-compute agents (`ml/agents/<id>/`) run on a schedule, each emitting a snippet into `agent_outputs` with a per-agent TTL