Update docs: add benchmarks/ section, fix complex tier description

- CLAUDE.md: add benchmark commands (run_benchmark.py flags, dry-run, categories, voice benchmark) - README.md: add benchmarks/ to Files tree; fix incorrect claim that complex tier requires /think prefix — it is auto-classified via regex and embedding similarity; fix "Complex agent (/think prefix)" heading Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-24 02:13:14 +00:00
parent bd951f943f
commit 54cb940279
2 changed files with 225 additions and 1 deletions
--- a/README.md
+++ b/README.md
@@ -0,0 +1,208 @@
+# Adolf
+
+Autonomous personal assistant with a multi-channel gateway. Three-tier model routing with GPU VRAM management.
+
+## Architecture
+
+```
+┌─────────────────────────────────────────────────────┐
+│                 CHANNEL ADAPTERS                    │
+│                                                     │
+│  [Telegram/Grammy]   [CLI]   [Voice — future]       │
+│       ↕                ↕            ↕               │
+│       └────────────────┴────────────┘               │
+│                        ↕                            │
+│          ┌─────────────────────────┐                │
+│          │   GATEWAY  (agent.py)   │                │
+│          │   FastAPI  :8000        │                │
+│          │                         │                │
+│          │  POST /message          │  ← all inbound │
+│          │  POST /chat  (legacy)   │                │
+│          │  GET  /stream/{id} SSE  │  ← token stream│
+│          │  GET  /reply/{id}  SSE  │  ← legacy poll │
+│          │  GET  /health           │                │
+│          │                         │                │
+│          │  channels.py registry   │                │
+│          │  conversation buffers   │                │
+│          └──────────┬──────────────┘                │
+│                     ↓                               │
+│          ┌──────────────────────┐                   │
+│          │    AGENT CORE        │                   │
+│          │  three-tier routing  │                   │
+│          │  VRAM management     │                   │
+│          └──────────────────────┘                   │
+│                     ↓                               │
+│          channels.deliver(session_id, channel, text)│
+│               ↓                    ↓                │
+│    telegram → POST grammy/send   cli → SSE queue    │
+└─────────────────────────────────────────────────────┘
+```
+
+## Channel Adapters
+
+| Channel | session_id | Inbound | Outbound |
+|---------|-----------|---------|---------|
+| Telegram | `tg-<chat_id>` | Grammy long-poll → POST /message | channels.py → POST grammy:3001/send |
+| CLI | `cli-<user>` | POST /message directly | GET /stream/{id} SSE — Rich Live streaming |
+| Voice | `voice-<device>` | (future) | (future) |
+
+## Unified Message Flow
+
+```
+1. Channel adapter receives message
+2. POST /message {text, session_id, channel, user_id}
+3. 202 Accepted immediately
+4. Background: run_agent_task(message, session_id, channel)
+5. Parallel IO (asyncio.gather):
+   a. _fetch_urls_from_message()       — Crawl4AI fetches any URLs in message
+   b. _retrieve_memories()             — openmemory semantic search for context
+   c. _fast_tool_runner.run_matching() — FastTools (weather, commute) if pattern matches
+6. router.route() with enriched history (url_context + fast_context + memories)
+   - fast tool match → force medium (real-time data, no point routing to light)
+   - if URL content fetched and tier=light → upgrade to medium
+7. Invoke agent for tier with url_context + memories in system prompt
+8. Token streaming:
+   - medium: astream() pushes per-token chunks to _stream_queues[session_id]; <think> blocks filtered in real time
+   - light/complex: full reply pushed as single chunk after completion
+   - _end_stream() sends [DONE] sentinel
+9. channels.deliver(session_id, channel, reply_text) — Telegram callback
+10. _store_memory() background task — stores turn in openmemory
+11. GET /stream/{session_id} SSE clients receive chunks; CLI renders with Rich Live + final Markdown
+```
+
+## Tool Handling
+
+Adolf uses LangChain's tool interface but only the complex agent actually invokes tools at runtime.
+
+**Complex agent:** `web_search` and `fetch_url` are defined as `langchain_core.tools.Tool` objects and passed to `create_deep_agent()`. The deepagents library runs an agentic loop (LangGraph `create_react_agent` under the hood) that sends the tool schema to the model via OpenAI function-calling format and handles tool dispatch.
+
+**Medium agent (default):** `_DirectModel` makes a single `model.ainvoke(messages)` call with no tool schema. Context (memories, fetched URL content) is injected via the system prompt instead. This is intentional — `qwen3:4b` behaves unreliably when a tool array is present.
+
+**Memory tools (out-of-loop):** `add_memory` and `search_memory` are LangChain MCP tool objects (via `langchain_mcp_adapters`) but are excluded from both agents' tool lists. They are called directly — `await _memory_add_tool.ainvoke(...)` — outside the agent loop, before and after each turn.
+
+## Three-Tier Model Routing
+
+| Tier | Model | Agent | Trigger | Latency |
+|------|-------|-------|---------|---------|
+| Light | `qwen2.5:1.5b` (router answers directly) | — | Regex pre-match or 3-way embedding classifies "light" | ~2–4s |
+| Medium | `qwen3:4b` (`DEEPAGENTS_MODEL`) | `_DirectModel` — single LLM call, no tools | Default; also forced when message contains URLs | ~10–20s |
+| Complex | `deepseek/deepseek-r1:free` via LiteLLM (`DEEPAGENTS_COMPLEX_MODEL`) | `create_deep_agent` — agentic loop with tools | Auto-classified by embedding similarity | ~30–90s |
+
+Routing is fully automatic via 3-way cosine similarity over pre-embedded utterance centroids (light / medium / complex). No prefix required. Use `adolf-deep` model name to force complex tier via API.
+
+Complex tier is reached automatically for deep research queries — `исследуй`, `изучи все`, `напиши подробный`, etc. — via regex pre-classifier and embedding similarity. No prefix required. Use `adolf-deep` model name to force it via API.
+
+## Fast Tools (`fast_tools.py`)
+
+Pre-flight tools that run concurrently with URL fetch and memory retrieval before any LLM call. Each tool has two methods:
+- `matches(message) → bool` — regex classifier; also used by `Router` to force medium tier
+- `run(message) → str` — async fetch returning a context block injected into system prompt
+
+`FastToolRunner` holds all tools. `any_matches()` is called by the Router at step 0a; `run_matching()` is called in the pre-flight `asyncio.gather` in `run_agent_task()`.
+
+| Tool | Pattern | Source | Context returned |
+|------|---------|--------|-----------------|
+| `WeatherTool` | weather/forecast/temperature/snow/rain | SearXNG `"погода Балашиха сейчас"` | Current conditions in °C from Russian weather sites |
+| `CommuteTool` | commute/traffic/arrival/пробки | `routecheck:8090/api/route` (Yandex Routing API) | Drive time with/without traffic, Balashikha→Moscow |
+
+**To add a new fast tool:** subclass `FastTool` in `fast_tools.py`, implement `name`/`matches`/`run`, add an instance to `_fast_tool_runner` in `agent.py`.
+
+## routecheck Service (`routecheck/`)
+
+Local web service on port 8090. Exists because Yandex Routing API free tier requires a web UI that uses the API.
+
+**Web UI** (`http://localhost:8090`): PIL-generated arithmetic captcha → lat/lon form → travel time result.
+
+**Internal API**: `GET /api/route?from=lat,lon&to=lat,lon&token=ROUTECHECK_TOKEN` — bypasses captcha, used by `CommuteTool`. The `ROUTECHECK_TOKEN` shared secret is set in `.env` and passed to both `routecheck` and `deepagents` containers.
+
+Yandex API calls are routed through the host HTTPS proxy (`host.docker.internal:56928`) since the container has no direct external internet access.
+
+**Requires** `.env`: `YANDEX_ROUTING_KEY` (free from `developer.tech.yandex.ru`) + `ROUTECHECK_TOKEN`.
+
+## Crawl4AI Integration
+
+Crawl4AI runs as a Docker service (`crawl4ai:11235`) providing JS-rendered, bot-bypass page fetching.
+
+**Pre-routing fetch (all tiers):**
+- `_URL_RE` detects `https?://` URLs in any incoming message
+- `_crawl4ai_fetch_async()` uses `httpx.AsyncClient` to POST `{urls: [...]}` to `/crawl`
+- Up to 3 URLs fetched concurrently via `asyncio.gather`
+- Fetched content (up to 3000 chars/URL) injected as a system context block into enriched history before routing and into medium/complex system prompts
+- If fetch succeeds and router returns light → tier upgraded to medium
+
+**Complex agent tools:**
+- `web_search`: SearXNG query + Crawl4AI auto-fetch of top 2 result URLs → combined snippet + page text
+- `fetch_url`: Crawl4AI single-URL fetch for any specific URL
+
+## Memory Pipeline
+
+openmemory runs as a FastMCP server (`openmemory:8765`) backed by mem0 + Qdrant + nomic-embed-text.
+
+**Retrieval (before routing):** `_retrieve_memories()` calls `search_memory` MCP tool with the user message as query. Results (threshold ≥ 0.5) are prepended to enriched history so all tiers benefit.
+
+**Storage (after reply):** `_store_memory()` runs as an asyncio background task, calling `add_memory` with `"User: ...\nAssistant: ..."`. The extraction LLM (`qwen2.5:1.5b` on GPU Ollama) pulls facts; dedup is handled by mem0's update prompt.
+
+Memory tools (`add_memory`, `search_memory`, `get_all_memories`) are excluded from agent tool lists — memory management happens outside the agent loop.
+
+## VRAM Management
+
+GTX 1070 — 8 GB. Ollama must be restarted if CUDA init fails (model loads on CPU).
+
+1. Flush explicitly before loading qwen3:8b (`keep_alive=0`)
+2. Verify eviction via `/api/ps` poll (15s timeout) before proceeding
+3. Fallback: timeout → run medium agent instead
+4. Post-complex: flush 8b, pre-warm medium + router
+
+## Session ID Convention
+
+- Telegram: `tg-<chat_id>` (e.g. `tg-346967270`)
+- CLI: `cli-<username>` (e.g. `cli-alvis`)
+
+Conversation history is keyed by session_id (5-turn buffer).
+
+## Files
+
+```
+adolf/
+├── docker-compose.yml      Services: deepagents, openmemory, grammy, crawl4ai, routecheck, cli
+├── Dockerfile              deepagents container (Python 3.12)
+├── Dockerfile.cli          CLI container (python:3.12-slim + rich)
+├── agent.py                FastAPI gateway, run_agent_task, Crawl4AI pre-fetch, fast tools, memory pipeline
+├── fast_tools.py           FastTool base, FastToolRunner, WeatherTool, CommuteTool
+├── channels.py             Channel registry + deliver() + pending_replies
+├── router.py               Router class — regex + LLM tier classification, FastToolRunner integration
+├── vram_manager.py         VRAMManager — flush/prewarm/poll Ollama VRAM
+├── agent_factory.py        _DirectModel (medium) / create_deep_agent (complex)
+├── cli.py                  Interactive CLI REPL — Rich Live streaming + Markdown render
+├── wiki_research.py        Batch wiki research pipeline (uses /message + SSE)
+├── benchmarks/
+│   ├── run_benchmark.py    Routing accuracy benchmark — 120 queries across 3 tiers
+│   ├── run_voice_benchmark.py  Voice path benchmark
+│   ├── benchmark.json      Query dataset (gitignored)
+│   └── results_latest.json Last run results (gitignored)
+├── .env                    TELEGRAM_BOT_TOKEN, ROUTECHECK_TOKEN, YANDEX_ROUTING_KEY (not committed)
+├── routecheck/
+│   ├── app.py              FastAPI: image captcha + /api/route Yandex proxy
+│   └── Dockerfile
+├── tests/
+│   ├── integration/        Standalone integration test scripts (common.py + test_*.py)
+│   └── use_cases/          Claude Code skill markdown files — Claude acts as user + evaluator
+├── openmemory/
+│   ├── server.py           FastMCP + mem0: add_memory, search_memory, get_all_memories
+│   └── Dockerfile
+└── grammy/
+    ├── bot.mjs             grammY Telegram bot + POST /send HTTP endpoint
+    ├── package.json
+    └── Dockerfile
+```
+
+## External Services (host ports, from openai/ stack)
+
+| Service | Host Port | Role |
+|---------|-----------|------|
+| LiteLLM | 4000 | LLM proxy — all inference goes through here (`LITELLM_URL` env var) |
+| Ollama GPU | 11436 | GPU inference backend + VRAM management (direct) + memory extraction |
+| Ollama CPU | 11435 | nomic-embed-text embeddings for openmemory |
+| Langfuse | 3200 | LLM observability — traces all requests via LiteLLM callbacks |
+| Qdrant | 6333 | Vector store for memories |
+| SearXNG | 11437 | Web search (used by `web_search` tool) |