From 54cb940279fa25c34ceb1cf5ba58b0202f567e9e Mon Sep 17 00:00:00 2001 From: Alvis Date: Tue, 24 Mar 2026 02:13:14 +0000 Subject: [PATCH] Update docs: add benchmarks/ section, fix complex tier description MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - CLAUDE.md: add benchmark commands (run_benchmark.py flags, dry-run, categories, voice benchmark) - README.md: add benchmarks/ to Files tree; fix incorrect claim that complex tier requires /think prefix — it is auto-classified via regex and embedding similarity; fix "Complex agent (/think prefix)" heading Co-Authored-By: Claude Sonnet 4.6 --- CLAUDE.md | 18 ++++- README.md | 208 ++++++++++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 225 insertions(+), 1 deletion(-) create mode 100644 README.md diff --git a/CLAUDE.md b/CLAUDE.md index e0b51d7..f43ee4a 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -18,8 +18,24 @@ python3 test_routing.py [--easy-only|--medium-only|--hard-only] # Use case tests — read the .md file and follow its steps as Claude Code # example: read tests/use_cases/weather_now.md and execute it + +# Routing benchmark — measures tier classification accuracy across 120 queries +# Run from benchmarks/ — Adolf must be running. DO NOT run during active use (holds GPU). +cd benchmarks +python3 run_benchmark.py # full run (120 queries) +python3 run_benchmark.py --tier light # light tier only (30 queries) +python3 run_benchmark.py --tier medium # medium tier only (50 queries) +python3 run_benchmark.py --tier complex --dry-run # complex tier, medium model (no API cost) +python3 run_benchmark.py --category smart_home_control +python3 run_benchmark.py --ids 1,2,3 +python3 run_benchmark.py --list-categories + +# Voice benchmark +python3 run_voice_benchmark.py + +# benchmark.json (dataset) and results_latest.json are gitignored — not committed ``` ## Architecture -@ARCHITECTURE.md +@README.md diff --git a/README.md b/README.md new file mode 100644 index 0000000..93b54e1 --- /dev/null +++ b/README.md @@ -0,0 +1,208 @@ +# Adolf + +Autonomous personal assistant with a multi-channel gateway. Three-tier model routing with GPU VRAM management. + +## Architecture + +``` +┌─────────────────────────────────────────────────────┐ +│ CHANNEL ADAPTERS │ +│ │ +│ [Telegram/Grammy] [CLI] [Voice — future] │ +│ ↕ ↕ ↕ │ +│ └────────────────┴────────────┘ │ +│ ↕ │ +│ ┌─────────────────────────┐ │ +│ │ GATEWAY (agent.py) │ │ +│ │ FastAPI :8000 │ │ +│ │ │ │ +│ │ POST /message │ ← all inbound │ +│ │ POST /chat (legacy) │ │ +│ │ GET /stream/{id} SSE │ ← token stream│ +│ │ GET /reply/{id} SSE │ ← legacy poll │ +│ │ GET /health │ │ +│ │ │ │ +│ │ channels.py registry │ │ +│ │ conversation buffers │ │ +│ └──────────┬──────────────┘ │ +│ ↓ │ +│ ┌──────────────────────┐ │ +│ │ AGENT CORE │ │ +│ │ three-tier routing │ │ +│ │ VRAM management │ │ +│ └──────────────────────┘ │ +│ ↓ │ +│ channels.deliver(session_id, channel, text)│ +│ ↓ ↓ │ +│ telegram → POST grammy/send cli → SSE queue │ +└─────────────────────────────────────────────────────┘ +``` + +## Channel Adapters + +| Channel | session_id | Inbound | Outbound | +|---------|-----------|---------|---------| +| Telegram | `tg-` | Grammy long-poll → POST /message | channels.py → POST grammy:3001/send | +| CLI | `cli-` | POST /message directly | GET /stream/{id} SSE — Rich Live streaming | +| Voice | `voice-` | (future) | (future) | + +## Unified Message Flow + +``` +1. Channel adapter receives message +2. POST /message {text, session_id, channel, user_id} +3. 202 Accepted immediately +4. Background: run_agent_task(message, session_id, channel) +5. Parallel IO (asyncio.gather): + a. _fetch_urls_from_message() — Crawl4AI fetches any URLs in message + b. _retrieve_memories() — openmemory semantic search for context + c. _fast_tool_runner.run_matching() — FastTools (weather, commute) if pattern matches +6. router.route() with enriched history (url_context + fast_context + memories) + - fast tool match → force medium (real-time data, no point routing to light) + - if URL content fetched and tier=light → upgrade to medium +7. Invoke agent for tier with url_context + memories in system prompt +8. Token streaming: + - medium: astream() pushes per-token chunks to _stream_queues[session_id]; blocks filtered in real time + - light/complex: full reply pushed as single chunk after completion + - _end_stream() sends [DONE] sentinel +9. channels.deliver(session_id, channel, reply_text) — Telegram callback +10. _store_memory() background task — stores turn in openmemory +11. GET /stream/{session_id} SSE clients receive chunks; CLI renders with Rich Live + final Markdown +``` + +## Tool Handling + +Adolf uses LangChain's tool interface but only the complex agent actually invokes tools at runtime. + +**Complex agent:** `web_search` and `fetch_url` are defined as `langchain_core.tools.Tool` objects and passed to `create_deep_agent()`. The deepagents library runs an agentic loop (LangGraph `create_react_agent` under the hood) that sends the tool schema to the model via OpenAI function-calling format and handles tool dispatch. + +**Medium agent (default):** `_DirectModel` makes a single `model.ainvoke(messages)` call with no tool schema. Context (memories, fetched URL content) is injected via the system prompt instead. This is intentional — `qwen3:4b` behaves unreliably when a tool array is present. + +**Memory tools (out-of-loop):** `add_memory` and `search_memory` are LangChain MCP tool objects (via `langchain_mcp_adapters`) but are excluded from both agents' tool lists. They are called directly — `await _memory_add_tool.ainvoke(...)` — outside the agent loop, before and after each turn. + +## Three-Tier Model Routing + +| Tier | Model | Agent | Trigger | Latency | +|------|-------|-------|---------|---------| +| Light | `qwen2.5:1.5b` (router answers directly) | — | Regex pre-match or 3-way embedding classifies "light" | ~2–4s | +| Medium | `qwen3:4b` (`DEEPAGENTS_MODEL`) | `_DirectModel` — single LLM call, no tools | Default; also forced when message contains URLs | ~10–20s | +| Complex | `deepseek/deepseek-r1:free` via LiteLLM (`DEEPAGENTS_COMPLEX_MODEL`) | `create_deep_agent` — agentic loop with tools | Auto-classified by embedding similarity | ~30–90s | + +Routing is fully automatic via 3-way cosine similarity over pre-embedded utterance centroids (light / medium / complex). No prefix required. Use `adolf-deep` model name to force complex tier via API. + +Complex tier is reached automatically for deep research queries — `исследуй`, `изучи все`, `напиши подробный`, etc. — via regex pre-classifier and embedding similarity. No prefix required. Use `adolf-deep` model name to force it via API. + +## Fast Tools (`fast_tools.py`) + +Pre-flight tools that run concurrently with URL fetch and memory retrieval before any LLM call. Each tool has two methods: +- `matches(message) → bool` — regex classifier; also used by `Router` to force medium tier +- `run(message) → str` — async fetch returning a context block injected into system prompt + +`FastToolRunner` holds all tools. `any_matches()` is called by the Router at step 0a; `run_matching()` is called in the pre-flight `asyncio.gather` in `run_agent_task()`. + +| Tool | Pattern | Source | Context returned | +|------|---------|--------|-----------------| +| `WeatherTool` | weather/forecast/temperature/snow/rain | SearXNG `"погода Балашиха сейчас"` | Current conditions in °C from Russian weather sites | +| `CommuteTool` | commute/traffic/arrival/пробки | `routecheck:8090/api/route` (Yandex Routing API) | Drive time with/without traffic, Balashikha→Moscow | + +**To add a new fast tool:** subclass `FastTool` in `fast_tools.py`, implement `name`/`matches`/`run`, add an instance to `_fast_tool_runner` in `agent.py`. + +## routecheck Service (`routecheck/`) + +Local web service on port 8090. Exists because Yandex Routing API free tier requires a web UI that uses the API. + +**Web UI** (`http://localhost:8090`): PIL-generated arithmetic captcha → lat/lon form → travel time result. + +**Internal API**: `GET /api/route?from=lat,lon&to=lat,lon&token=ROUTECHECK_TOKEN` — bypasses captcha, used by `CommuteTool`. The `ROUTECHECK_TOKEN` shared secret is set in `.env` and passed to both `routecheck` and `deepagents` containers. + +Yandex API calls are routed through the host HTTPS proxy (`host.docker.internal:56928`) since the container has no direct external internet access. + +**Requires** `.env`: `YANDEX_ROUTING_KEY` (free from `developer.tech.yandex.ru`) + `ROUTECHECK_TOKEN`. + +## Crawl4AI Integration + +Crawl4AI runs as a Docker service (`crawl4ai:11235`) providing JS-rendered, bot-bypass page fetching. + +**Pre-routing fetch (all tiers):** +- `_URL_RE` detects `https?://` URLs in any incoming message +- `_crawl4ai_fetch_async()` uses `httpx.AsyncClient` to POST `{urls: [...]}` to `/crawl` +- Up to 3 URLs fetched concurrently via `asyncio.gather` +- Fetched content (up to 3000 chars/URL) injected as a system context block into enriched history before routing and into medium/complex system prompts +- If fetch succeeds and router returns light → tier upgraded to medium + +**Complex agent tools:** +- `web_search`: SearXNG query + Crawl4AI auto-fetch of top 2 result URLs → combined snippet + page text +- `fetch_url`: Crawl4AI single-URL fetch for any specific URL + +## Memory Pipeline + +openmemory runs as a FastMCP server (`openmemory:8765`) backed by mem0 + Qdrant + nomic-embed-text. + +**Retrieval (before routing):** `_retrieve_memories()` calls `search_memory` MCP tool with the user message as query. Results (threshold ≥ 0.5) are prepended to enriched history so all tiers benefit. + +**Storage (after reply):** `_store_memory()` runs as an asyncio background task, calling `add_memory` with `"User: ...\nAssistant: ..."`. The extraction LLM (`qwen2.5:1.5b` on GPU Ollama) pulls facts; dedup is handled by mem0's update prompt. + +Memory tools (`add_memory`, `search_memory`, `get_all_memories`) are excluded from agent tool lists — memory management happens outside the agent loop. + +## VRAM Management + +GTX 1070 — 8 GB. Ollama must be restarted if CUDA init fails (model loads on CPU). + +1. Flush explicitly before loading qwen3:8b (`keep_alive=0`) +2. Verify eviction via `/api/ps` poll (15s timeout) before proceeding +3. Fallback: timeout → run medium agent instead +4. Post-complex: flush 8b, pre-warm medium + router + +## Session ID Convention + +- Telegram: `tg-` (e.g. `tg-346967270`) +- CLI: `cli-` (e.g. `cli-alvis`) + +Conversation history is keyed by session_id (5-turn buffer). + +## Files + +``` +adolf/ +├── docker-compose.yml Services: deepagents, openmemory, grammy, crawl4ai, routecheck, cli +├── Dockerfile deepagents container (Python 3.12) +├── Dockerfile.cli CLI container (python:3.12-slim + rich) +├── agent.py FastAPI gateway, run_agent_task, Crawl4AI pre-fetch, fast tools, memory pipeline +├── fast_tools.py FastTool base, FastToolRunner, WeatherTool, CommuteTool +├── channels.py Channel registry + deliver() + pending_replies +├── router.py Router class — regex + LLM tier classification, FastToolRunner integration +├── vram_manager.py VRAMManager — flush/prewarm/poll Ollama VRAM +├── agent_factory.py _DirectModel (medium) / create_deep_agent (complex) +├── cli.py Interactive CLI REPL — Rich Live streaming + Markdown render +├── wiki_research.py Batch wiki research pipeline (uses /message + SSE) +├── benchmarks/ +│ ├── run_benchmark.py Routing accuracy benchmark — 120 queries across 3 tiers +│ ├── run_voice_benchmark.py Voice path benchmark +│ ├── benchmark.json Query dataset (gitignored) +│ └── results_latest.json Last run results (gitignored) +├── .env TELEGRAM_BOT_TOKEN, ROUTECHECK_TOKEN, YANDEX_ROUTING_KEY (not committed) +├── routecheck/ +│ ├── app.py FastAPI: image captcha + /api/route Yandex proxy +│ └── Dockerfile +├── tests/ +│ ├── integration/ Standalone integration test scripts (common.py + test_*.py) +│ └── use_cases/ Claude Code skill markdown files — Claude acts as user + evaluator +├── openmemory/ +│ ├── server.py FastMCP + mem0: add_memory, search_memory, get_all_memories +│ └── Dockerfile +└── grammy/ + ├── bot.mjs grammY Telegram bot + POST /send HTTP endpoint + ├── package.json + └── Dockerfile +``` + +## External Services (host ports, from openai/ stack) + +| Service | Host Port | Role | +|---------|-----------|------| +| LiteLLM | 4000 | LLM proxy — all inference goes through here (`LITELLM_URL` env var) | +| Ollama GPU | 11436 | GPU inference backend + VRAM management (direct) + memory extraction | +| Ollama CPU | 11435 | nomic-embed-text embeddings for openmemory | +| Langfuse | 3200 | LLM observability — traces all requests via LiteLLM callbacks | +| Qdrant | 6333 | Vector store for memories | +| SearXNG | 11437 | Web search (used by `web_search` tool) |