alvis/oO

Files

alvis d4b40e2590 docs: document MLflow trace API, span inspection, and no-agent diagnosis

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-05-11 11:23:13 +00:00

19 KiB

Raw Blame History

oO — Project Instructions

What this is

oO is a recommendation system for personal tips. It collects signals across a user's life (tasks, habits, calendar, mood, context) to build a rich profile and deliver one perfectly-timed tip — an advice or a todo — that feels like magic.

The magic is the product. Precision + timing + minimalism. The UI shows a single black page with one tip. The complexity lives behind it.

Prime directives

Modular by package, deployable by stage. Contracts live at package boundaries from day one so extraction to a service is cheap. Deploy topology evolves with real pressure (team size, scaling hotspots, language boundaries), not with wishful architecture. Phase 0 = modular monolith + Python ML sidecar. See ADR-0003.
Recommendation engine is the core. Every other module feeds it or renders its output. Design schemas, event contracts, and APIs with that in mind.
Python owns ML. Training, features, online scoring are Python (FastAPI + PyTorch/scikit + MLflow/Feast). Application code is TypeScript (Node, Next.js) unless there's a reason.
OAuth-first for identity and integrations. Never ask users for passwords or raw API keys when a delegated-auth flow exists. Store provider tokens encrypted, refresh transparently.
Privacy is a feature, not a phase. Consent capture, token revocation, and account deletion exist from the first real user. Data minimization: store the token + derivatives we need, not the raw feed.
Feel-of-magic over feature count. When in doubt, ship fewer things, polished. The tip page is a watch face.

Architecture (high level)

The tree below is logical module structure. Directory layout is stable; how many processes you deploy is a stage decision (ADR-0003).

apps/              user-facing clients
  web/             Next.js PWA — the first shipped client
  mobile-ios/      Swift/SwiftUI (Phase 3)
  mobile-android/  Kotlin/Compose (Phase 3)

services/          backend modules — each owns a contract; may share a deployable
  gateway/         BFF for clients; auth check; fan-out
  auth/            OAuth (Google, Apple, ...), sessions, JWT issuance
  profile/         user profile, preferences, consents
  integrations/    third-party connectors + token vault (Todoist first)
  recommender/     orchestration: candidates → policy → tip; feedback sink
  events/          event bus ingress + durable signal store
  notifier/        push/email/web delivery (web push from Phase 1)

packages/          shared libraries (importable across services + apps)
  shared-types/    HTTP types via OpenAPI; event types via protobuf (ADR-0005)
  sdk-js/          client SDK used by web + mobile webviews
  ui/              shared React components + design tokens

ml/                Python — separate deployable from day one
  serving/         online scorer (FastAPI), called by recommender
  features/        feature definitions + store adapter
  pipelines/       batch feature + training scripts
  registry/        MLflow model registry integration
  experiments/     assignment + A/B + bandit policies
  notebooks/       research only; never imported by production code

infra/             docker-compose (Phase 0), k3s/k8s (later), terraform, CI
docs/              architecture notes, ADRs, API specs

Phase 0 deployables: one Node process (services/* bundled via modular monolith) + one Python process (ml/serving, stubbed until M1) + Postgres + NATS. Services extract to their own process when a real reason appears: language boundary, scaling hotspot, team ownership, or SLA divergence. See ADR-0003.

Contracts between modules

HTTP (OpenAPI, in packages/shared-types/http/) — synchronous request/response. In-process today; over the network once extracted. Signatures are identical.
Events (Protocol Buffers, in packages/shared-types/events/) — durable signals + feedback. Today: in-process Bus with a onPublish bridge to NATS JetStream when NATS_URL is set (ADR-0010). The in-proc bus stays the source of truth — JetStream is the durable mirror that cross-process consumers (ml/serving, future feature pipelines) tail. Proto schemas (ADR-0005) live in packages/shared-types/events/oo/events/v1/; buf lint + buf breaking run in CI on every PR touching those files (.gitea/workflows/buf-check.yaml).
Do not redefine types per module. Regenerate from shared-types.

Conventions

Each module ships a README.md describing its contract, its /health story, and its extraction criteria (when it should become its own process).
One PR = one concern. Conventional-commit prefixes (feat:, fix:, chore:, docs:, refactor:).
ADRs go in docs/adr/NNNN-title.md for any decision that constrains future work.
No secrets in repo. Local dev via .env.local (gitignored), prod via the server's secret store (Vaultwarden now; k8s secrets later).
Compose profiles: core (api + web + admin), full (adds ml-serving + nats), mlops (adds MLflow), ai (adds Ollama + LiteLLM). Mix as needed. Always pass --profile <name> to build/up — without a profile, no services are selected and builds silently do nothing.
Docker rebuild: use --force-recreate on up when only env vars changed (no image rebuild needed); new env vars in .env.local are not picked up by a running container until it is recreated.
Docker rebuild gotchas:
- Never run two docker compose up --build at once — both grab the same --mount=type=cache,id=pnpm and deadlock on the API's pnpm --prod deploy step. Symptom: build sits silent for hours on [api builder 8/8]. Before starting any build, check ps aux | grep "docker compose" and kill any prior up --build (kill -9 <pid> — the wrapper bash and the docker compose binary are separate PIDs; kill the docker compose one).
- Don't add --offline to pnpm --prod deploy — pnpm's metadata cache (/root/.cache/pnpm/) is not in the /pnpm/store cache mount, so --offline fails with ERR_PNPM_NO_OFFLINE_META for transitive devDeps (e.g. vite via vitest). Leave the deploy step network-on; it works.
- All TS Dockerfiles need python3 make g++ in the base stage — better-sqlite3 rebuilds natively on install. Missing from Dockerfile.admin historically caused gyp ERR! find Python failures.
- A clean build of --profile core takes ~3 min total when the buildx cache is warm. If it's been silent for >10 min, check for the parallel-build deadlock above before assuming "still going".
Run Python agent tests: python3 -m pytest ml/agents/tests/ -x -q (tests add repo root to sys.path themselves).
Run Python feature tests: python3 -m pytest ml/features/ -x -q
ml/features/ files are Python mirrors of TS registries — TS is source of truth. Tests parse registry.ts with regex to detect drift; follow the same pattern whenever a new field is added to ProfileFeature.

Definition of done (per feature)

Code + tests merged.
Module's README.md updated.
If it changes a contract → shared-types regenerated + consumers updated.
If it changes architecture → ADR added.
Deployable via docker compose up locally.
If it touches user data → a deletion path exists and is tested.

AI stack

oO generates tips through a multi-agent pipeline (ADR-0013): pre-compute agents emit prompt snippets, an orchestrator LLM assembles them into one tip. All LLM calls route through LiteLLM at llm.alogins.net using model aliases — swapping models is a config change, not a code change.

Alias	Model	Used by
`tip-generator`	qwen2.5:1.5b (default)	`ml/serving` tip generation
`embedder`	nomic-embed-text	task clustering, dedup
`judge`	claude-haiku-4-5 (cloud, eval only)	offline sim

Env vars: LITELLM_URL (prod https://llm.alogins.net), OLLAMA_URL (Agap host, http://host.docker.internal:11434 from containers).

Ollama and LiteLLM are shared Agap services, not oO services — they live in agap_git/openai/docker-compose.yml along with langfuse (observability). oO never starts them; ml-serving just calls the alias.

All httpx calls in ml/ must use trust_env=False to bypass the system proxy — same rule as bw and curl. Pattern: httpx.Client(trust_env=False, timeout=N).

MLflow container-to-container calls: always pass host_header="localhost" to MLflowClient — MLflow's --allowed-hosts rejects Host: mlflow (the container DNS name) with 403. Auth credential is MLFLOW_ADMIN_PASSWORD. MLflow REST API lives at the origin root, not under the /mlflow UI prefix.

MLflow API versions — runs vs traces

MLflow uses two API versions — use the right one or you'll get 405:

What	API prefix	Example
Runs, experiments, metrics	`/api/2.0/mlflow/`	`runs/search`, `experiments/list`
Traces (LLM observability)	`/api/3.0/mlflow/traces/`	`traces/{trace_id}`

Experiment IDs: 3 = oO/serving. Artifacts stored as run tags prefixed artifact:<path>.

Querying from the host shell

Always strip the proxy and pass Host: localhost (no port — localhost:5000 fails the DNS-rebinding check).

# Search recent runs (experiment 3)
env -u HTTPS_PROXY -u HTTP_PROXY -u ALL_PROXY -u https_proxy -u http_proxy -u all_proxy \
  curl -s -H "Host: localhost" -u "admin:${MLFLOW_ADMIN_PASSWORD}" \
  -X POST http://localhost:5000/api/2.0/mlflow/runs/search \
  -H "Content-Type: application/json" \
  -d '{"experiment_ids":["3"],"max_results":5,"order_by":["start_time DESC"]}'

# Get a trace by ID (note: /api/3.0/, not /api/2.0/)
env -u HTTPS_PROXY -u HTTP_PROXY -u ALL_PROXY -u https_proxy -u http_proxy -u all_proxy \
  curl -s -H "Host: localhost" -u "admin:${MLFLOW_ADMIN_PASSWORD}" \
  http://localhost:5000/api/3.0/mlflow/traces/tr-<trace_id> | python3 -m json.tool

The trace response includes trace_metadata.mlflow.traceInputs/Outputs, trace_metadata.mlflow.trace.sizeStats (num_spans), and tags.mlflow.traceName.

Getting spans (Python client from inside the container)

The REST API has no endpoint for spans — /api/3.0/mlflow/traces/{id}/spans returns 404. Use the Python client inside oo-ml-serving-1:

docker exec oo-ml-serving-1 python3 -c "
import mlflow, json, os
mlflow.set_tracking_uri('http://mlflow:5000')
os.environ['MLFLOW_TRACKING_USERNAME'] = 'admin'
os.environ['MLFLOW_TRACKING_PASSWORD'] = os.environ.get('MLFLOW_ADMIN_PASSWORD', '')

client = mlflow.tracking.MlflowClient()
trace = client.get_trace('tr-<trace_id>')
for span in trace.data.spans:
    print(span.name, '| parent:', span.parent_id, '| status:', span.status)
    print('  inputs:', json.dumps(span.inputs)[:200])
    print('  outputs:', json.dumps(span.outputs)[:200])
    print('  attrs:', span.attributes)
"

Span structure for a tip generation trace

A healthy recommend trace has 3 spans:

Span	Type	Parent	Key attributes
`recommend`	CHAIN	(root)	`agent_count`, `latency_ms`; inputs include `agent_ids` list
`build_context`	TOOL	recommend	`agent_count`, `task_count`, `science_destiny`
`llm_orchestrator`	LLM	recommend	`prompt_tokens`, `completion_tokens`, `model`, `attempts`

Diagnosing "no agents in trace"

If the trace shows agent_ids: [] and agent_count: 0 in the root span, and the orchestrator prompt says "No pre-computed agent context available", it means the recommender found zero eligible snippets at request time. Causes:

Agent compute hasn't run — no agent_outputs rows for this user yet
Snippets expired — TTL elapsed since last compute
Eligibility filter dropped all agents — none passed the manifest-driven check

Diagnose with:

docker exec oo-api-1 psql "$DATABASE_URL" -c \
  "SELECT agent_id, computed_at, expires_at FROM agent_outputs WHERE user_id='<uid>' ORDER BY computed_at DESC LIMIT 10;"

Multi-agent tip generation pipeline (ADR-0013):

Pre-compute agents (ml/agents/<id>/) run on a schedule, each emitting a snippet into agent_outputs with a per-agent TTL
On request, recommender (TS) loads the eligible agent set (registry-driven, ADR-0014) and pulls the freshest non-expired snippets
POST /recommend in ml/serving assembles the orchestrator prompt (v4-orchestrator) and calls LiteLLM via the tip-generator alias
Returned tip is logged in tip_scores with the contributing agent set; reaction is logged for observability (no bandit reward loop)

Current phase

M1 shipped (core + admin). M2 (AI tips) in progress. See README.md for the phase roadmap and docs/architecture/ for diagrams. Work is tracked as Gitea milestones + issues on alvis/oO.

Recent completions:

ADR-0013 — multi-agent recommendation: pre-computed agent snippets + orchestrator LLM (replaces ε-greedy bandit) — 2026-05-01
LLM context assembler + tip generation scaffold (#79, #88)
Model benchmarking for tip generation (#93, #95)
Admin UX refinements: feedback consolidation, settings placement (#100–102)
ADR-0012 — ε-greedy v2 (D=12) — 2026-04-26 (now superseded by ADR-0013)
ADR-0014 complete: unified Profile schema + backfill, manifest plumbing, /api/profile read-through, registry-driven eligibility filter, inference framework + per-agent inference, legacy consent column drop — 2026-05-05
Rich per-agent inference for all four active agents (#112, #114, #115, #116) — 2026-05-06: quiet/peak hours (time-of-day), z-score baseline (momentum), p50 lateness + project realness (overdue-task), adaptive lookback + weekly/daily cycles (recent-patterns)
Semantic task clustering via nomic-embed-text + focus-area preferred_areas inference (#97, #113) — 2026-05-06: ml/agents/clustering.py, focus-area v2.0.0
Per-user feature freshness SLAs (#61) — 2026-05-06: invalidated_by mirrored into ProfileFeature; drift-detection test added
MLflow tracing added to ml/serving for all agent calls — 2026-05-06: ml/serving/mlflow_client.py; activated by MLFLOW_TRACKING_URI=http://mlflow:5000 (default in compose full profile); requires --profile mlops for the MLflow container. Issue #118 (M4) tracks removal from production critical path.

Active work (M2): (all M2 items complete — see README for M3 planning)

ADR-0014 endpoint map (as of step 6)

Endpoint	Purpose
`GET /api/profile`	Read-through: user globals + prefs (by scope) + consents + contexts
`PATCH /api/profile/prefs/:scope`	Upsert user_preferences rows (source='user')
`PATCH /api/profile/consents`	Grant / revoke consent keys
`PATCH /api/profile/contexts`	Create / activate / deactivate named contexts
`GET /api/agents/registry`	Manifest list (proxy to ml/serving; 60 s cache)
`POST /api/agents/:agentId/compute`	Internal: run agent compute for (user, agent)
`POST /agents/{agent_id}/infer` (ml/serving)	Run inference framework → `{inferred_prefs}`

Inference framework (ADR-0014 §3)

Lives in ml/agents/inference/. run_inference(manifest, history) evaluates all InferredParam entries in the manifest and returns {key: value}. Rules:

Below min_history → emit cold_start_default
infer() error → emit cold_start_default (never crashes)
Results written to user_preferences with source='inferred'; keys with source='user' are never overwritten

All five agents are at v1.2.0. Per-agent inferred params (all live in ml/agents/<name>.py):

Agent	Inferred params	Notes
`time-of-day`	`preferred_hour`, `quiet_start`, `quiet_end`, `peak_hours`, `tz`	Quiet window = longest below-baseline hour run; peak = top-quartile done hours; tz cold-start only (from auth provider)
`momentum`	`engagement_trend`, `baseline_completions_per_day`, `stdev`	Baseline = 28d rolling mean done/day; snippet uses z-score language
`overdue-task`	`lateness_tolerance_days`, `project_realness`	Tolerance = p50 lateness from TaskCompletion history; realness = project median vs global median
`recent-patterns`	`lookback_days`, `weekly_cycle`, `daily_cycle`	Lookback sized to ≥30 done events; cycles use peak-to-mean ratio; snippet hints when strength > 0.5
`focus-area`	`preferred_areas`	Top-2 project IDs by task completion count; semantic clustering via `ml/agents/clustering.py` in compute()

UserHistory carries both events: list[FeedbackEvent] and task_completions: list[TaskCompletion]. AgentInferRequest (ml/serving) accepts task_completions: list[dict] alongside feedback_history.

min_history is checked against len(history.events) (feedback events), not task_completions. Agents that infer from completions should set min_history=0 and guard inside infer().

What NOT to do

Don't copy Todoist's data into our DB. Store the OAuth token + computed features/derivatives we need, fetch raw on demand.
Don't implement auth by hand. Auth.js behind an OIDC-shaped boundary (ADR-0004); swap to a dedicated OIDC provider only when mobile ships.
Don't hardwire a recommender. The contract is POST /recommend → {tip}. Swap internals (multi-agent orchestrator today, future LLM/hybrid variants), keep contract.
Don't hardcode the agent list. The orchestrator is registry-driven (ADR-0014); adding/removing an agent is a manifest change in ml/agents/<id>/, never a recommender edit.
Don't replace a policy in one step. New policies deploy shadow-first; promoted only after offline + online agreement with the incumbent (ADR-0002).
Don't over-split processes. Extract a service when pressure demands it, not in anticipation (ADR-0003).
Don't call LLMs directly from application code. All LLM calls go through ml/serving (Python) via LITELLM_URL. The TS recommender never holds a model name.
Don't embed MLflow/OpenWebUI in the admin panel. They are external services; link out to them. The admin shell links to o.alogins.net/mlflow, ai.alogins.net.
Don't nats.publish() directly from feature code. All publishes go through the in-process Bus (services/api/src/events/bus.ts); the NATS adapter (events/nats.ts) bridges every publish to JetStream when NATS_URL is set. This keeps subscribers, the ring-buffer tail used by the admin event viewer, and JetStream all in lockstep.

Admin app

apps/admin rewrites /api/* → $NEXT_PUBLIC_API_URL/api/* via next.config.ts. So apiFetch('/admin/stats') in apps/admin/src/lib/api.ts hits the Express backend, not a Next.js route.

Running tsc --noEmit -p apps/admin/tsconfig.json always reports Cannot find module 'next' errors — expected outside the Next.js build context; use next build for real type errors.

Auth / session pattern

Sessions use an sid cookie. Admin routes stack requireAuth (sets req.userId) then requireAdmin (checks role = 'admin' in DB). Token-based admin auth: POST /api/auth/token with { token } matching ADMIN_TOKEN env var sets the sid cookie — used by Playwright and CI.

19 KiB Raw Blame History Unescape Escape