docs: document MLflow trace API, span inspection, and no-agent diagnosis
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
74
CLAUDE.md
74
CLAUDE.md
@@ -101,17 +101,83 @@ Ollama and LiteLLM are **shared Agap services**, not oO services — they live i
|
|||||||
|
|
||||||
All `httpx` calls in `ml/` must use `trust_env=False` to bypass the system proxy — same rule as `bw` and curl. Pattern: `httpx.Client(trust_env=False, timeout=N)`.
|
All `httpx` calls in `ml/` must use `trust_env=False` to bypass the system proxy — same rule as `bw` and curl. Pattern: `httpx.Client(trust_env=False, timeout=N)`.
|
||||||
|
|
||||||
MLflow container-to-container calls: always pass `host_header="localhost"` to `MLflowClient` — MLflow's `--allowed-hosts` rejects `Host: mlflow` (the container DNS name) with 403. Auth credential is `MLFLOW_ADMIN_PASSWORD`. MLflow REST API lives at the origin root (`/api/2.0/mlflow`), not under the `/mlflow` UI prefix.
|
MLflow container-to-container calls: always pass `host_header="localhost"` to `MLflowClient` — MLflow's `--allowed-hosts` rejects `Host: mlflow` (the container DNS name) with 403. Auth credential is `MLFLOW_ADMIN_PASSWORD`. MLflow REST API lives at the origin root, not under the `/mlflow` UI prefix.
|
||||||
|
|
||||||
|
### MLflow API versions — runs vs traces
|
||||||
|
|
||||||
|
MLflow uses **two API versions** — use the right one or you'll get 405:
|
||||||
|
|
||||||
|
| What | API prefix | Example |
|
||||||
|
|------|-----------|---------|
|
||||||
|
| Runs, experiments, metrics | `/api/2.0/mlflow/` | `runs/search`, `experiments/list` |
|
||||||
|
| Traces (LLM observability) | `/api/3.0/mlflow/traces/` | `traces/{trace_id}` |
|
||||||
|
|
||||||
|
**Experiment IDs:** `3` = oO/serving. Artifacts stored as run tags prefixed `artifact:<path>`.
|
||||||
|
|
||||||
|
### Querying from the host shell
|
||||||
|
|
||||||
|
Always strip the proxy and pass `Host: localhost` (no port — `localhost:5000` fails the DNS-rebinding check).
|
||||||
|
|
||||||
MLflow from the host shell — query with curl, no script needed:
|
|
||||||
```bash
|
```bash
|
||||||
|
# Search recent runs (experiment 3)
|
||||||
env -u HTTPS_PROXY -u HTTP_PROXY -u ALL_PROXY -u https_proxy -u http_proxy -u all_proxy \
|
env -u HTTPS_PROXY -u HTTP_PROXY -u ALL_PROXY -u https_proxy -u http_proxy -u all_proxy \
|
||||||
curl -s -H "Host: localhost" -u "admin:${MLFLOW_ADMIN_PASSWORD}" \
|
curl -s -H "Host: localhost" -u "admin:${MLFLOW_ADMIN_PASSWORD}" \
|
||||||
-X POST http://localhost:5000/api/2.0/mlflow/runs/search \
|
-X POST http://localhost:5000/api/2.0/mlflow/runs/search \
|
||||||
-H "Content-Type: application/json" \
|
-H "Content-Type: application/json" \
|
||||||
-d '{"experiment_ids":["3"],"max_results":1,"order_by":["start_time DESC"]}'
|
-d '{"experiment_ids":["3"],"max_results":5,"order_by":["start_time DESC"]}'
|
||||||
|
|
||||||
|
# Get a trace by ID (note: /api/3.0/, not /api/2.0/)
|
||||||
|
env -u HTTPS_PROXY -u HTTP_PROXY -u ALL_PROXY -u https_proxy -u http_proxy -u all_proxy \
|
||||||
|
curl -s -H "Host: localhost" -u "admin:${MLFLOW_ADMIN_PASSWORD}" \
|
||||||
|
http://localhost:5000/api/3.0/mlflow/traces/tr-<trace_id> | python3 -m json.tool
|
||||||
|
```
|
||||||
|
|
||||||
|
The trace response includes `trace_metadata.mlflow.traceInputs/Outputs`, `trace_metadata.mlflow.trace.sizeStats` (num_spans), and `tags.mlflow.traceName`.
|
||||||
|
|
||||||
|
### Getting spans (Python client from inside the container)
|
||||||
|
|
||||||
|
The REST API has **no endpoint for spans** — `/api/3.0/mlflow/traces/{id}/spans` returns 404. Use the Python client inside `oo-ml-serving-1`:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
docker exec oo-ml-serving-1 python3 -c "
|
||||||
|
import mlflow, json, os
|
||||||
|
mlflow.set_tracking_uri('http://mlflow:5000')
|
||||||
|
os.environ['MLFLOW_TRACKING_USERNAME'] = 'admin'
|
||||||
|
os.environ['MLFLOW_TRACKING_PASSWORD'] = os.environ.get('MLFLOW_ADMIN_PASSWORD', '')
|
||||||
|
|
||||||
|
client = mlflow.tracking.MlflowClient()
|
||||||
|
trace = client.get_trace('tr-<trace_id>')
|
||||||
|
for span in trace.data.spans:
|
||||||
|
print(span.name, '| parent:', span.parent_id, '| status:', span.status)
|
||||||
|
print(' inputs:', json.dumps(span.inputs)[:200])
|
||||||
|
print(' outputs:', json.dumps(span.outputs)[:200])
|
||||||
|
print(' attrs:', span.attributes)
|
||||||
|
"
|
||||||
|
```
|
||||||
|
|
||||||
|
### Span structure for a tip generation trace
|
||||||
|
|
||||||
|
A healthy `recommend` trace has 3 spans:
|
||||||
|
|
||||||
|
| Span | Type | Parent | Key attributes |
|
||||||
|
|------|------|--------|---------------|
|
||||||
|
| `recommend` | CHAIN | (root) | `agent_count`, `latency_ms`; inputs include `agent_ids` list |
|
||||||
|
| `build_context` | TOOL | recommend | `agent_count`, `task_count`, `science_destiny` |
|
||||||
|
| `llm_orchestrator` | LLM | recommend | `prompt_tokens`, `completion_tokens`, `model`, `attempts` |
|
||||||
|
|
||||||
|
### Diagnosing "no agents in trace"
|
||||||
|
|
||||||
|
If the trace shows `agent_ids: []` and `agent_count: 0` in the root span, and the orchestrator prompt says *"No pre-computed agent context available"*, it means the recommender found zero eligible snippets at request time. Causes:
|
||||||
|
|
||||||
|
1. **Agent compute hasn't run** — no `agent_outputs` rows for this user yet
|
||||||
|
2. **Snippets expired** — TTL elapsed since last compute
|
||||||
|
3. **Eligibility filter dropped all agents** — none passed the manifest-driven check
|
||||||
|
|
||||||
|
Diagnose with:
|
||||||
|
```bash
|
||||||
|
docker exec oo-api-1 psql "$DATABASE_URL" -c \
|
||||||
|
"SELECT agent_id, computed_at, expires_at FROM agent_outputs WHERE user_id='<uid>' ORDER BY computed_at DESC LIMIT 10;"
|
||||||
```
|
```
|
||||||
`Host: localhost` required (no port) — `localhost:5000` fails the DNS-rebinding check. Experiment IDs: `3`=oO/serving. Artifacts stored as run tags prefixed `artifact:<path>`.
|
|
||||||
|
|
||||||
**Multi-agent tip generation pipeline (ADR-0013):**
|
**Multi-agent tip generation pipeline (ADR-0013):**
|
||||||
1. Pre-compute agents (`ml/agents/<id>/`) run on a schedule, each emitting a snippet into `agent_outputs` with a per-agent TTL
|
1. Pre-compute agents (`ml/agents/<id>/`) run on a schedule, each emitting a snippet into `agent_outputs` with a per-agent TTL
|
||||||
|
|||||||
Reference in New Issue
Block a user