Files
adolf/README.md
Alvis 54cb940279 Update docs: add benchmarks/ section, fix complex tier description
- CLAUDE.md: add benchmark commands (run_benchmark.py flags, dry-run,
  categories, voice benchmark)
- README.md: add benchmarks/ to Files tree; fix incorrect claim that
  complex tier requires /think prefix — it is auto-classified via regex
  and embedding similarity; fix "Complex agent (/think prefix)" heading

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-24 02:13:14 +00:00

209 lines
12 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Adolf
Autonomous personal assistant with a multi-channel gateway. Three-tier model routing with GPU VRAM management.
## Architecture
```
┌─────────────────────────────────────────────────────┐
│ CHANNEL ADAPTERS │
│ │
│ [Telegram/Grammy] [CLI] [Voice — future] │
│ ↕ ↕ ↕ │
│ └────────────────┴────────────┘ │
│ ↕ │
│ ┌─────────────────────────┐ │
│ │ GATEWAY (agent.py) │ │
│ │ FastAPI :8000 │ │
│ │ │ │
│ │ POST /message │ ← all inbound │
│ │ POST /chat (legacy) │ │
│ │ GET /stream/{id} SSE │ ← token stream│
│ │ GET /reply/{id} SSE │ ← legacy poll │
│ │ GET /health │ │
│ │ │ │
│ │ channels.py registry │ │
│ │ conversation buffers │ │
│ └──────────┬──────────────┘ │
│ ↓ │
│ ┌──────────────────────┐ │
│ │ AGENT CORE │ │
│ │ three-tier routing │ │
│ │ VRAM management │ │
│ └──────────────────────┘ │
│ ↓ │
│ channels.deliver(session_id, channel, text)│
│ ↓ ↓ │
│ telegram → POST grammy/send cli → SSE queue │
└─────────────────────────────────────────────────────┘
```
## Channel Adapters
| Channel | session_id | Inbound | Outbound |
|---------|-----------|---------|---------|
| Telegram | `tg-<chat_id>` | Grammy long-poll → POST /message | channels.py → POST grammy:3001/send |
| CLI | `cli-<user>` | POST /message directly | GET /stream/{id} SSE — Rich Live streaming |
| Voice | `voice-<device>` | (future) | (future) |
## Unified Message Flow
```
1. Channel adapter receives message
2. POST /message {text, session_id, channel, user_id}
3. 202 Accepted immediately
4. Background: run_agent_task(message, session_id, channel)
5. Parallel IO (asyncio.gather):
a. _fetch_urls_from_message() — Crawl4AI fetches any URLs in message
b. _retrieve_memories() — openmemory semantic search for context
c. _fast_tool_runner.run_matching() — FastTools (weather, commute) if pattern matches
6. router.route() with enriched history (url_context + fast_context + memories)
- fast tool match → force medium (real-time data, no point routing to light)
- if URL content fetched and tier=light → upgrade to medium
7. Invoke agent for tier with url_context + memories in system prompt
8. Token streaming:
- medium: astream() pushes per-token chunks to _stream_queues[session_id]; <think> blocks filtered in real time
- light/complex: full reply pushed as single chunk after completion
- _end_stream() sends [DONE] sentinel
9. channels.deliver(session_id, channel, reply_text) — Telegram callback
10. _store_memory() background task — stores turn in openmemory
11. GET /stream/{session_id} SSE clients receive chunks; CLI renders with Rich Live + final Markdown
```
## Tool Handling
Adolf uses LangChain's tool interface but only the complex agent actually invokes tools at runtime.
**Complex agent:** `web_search` and `fetch_url` are defined as `langchain_core.tools.Tool` objects and passed to `create_deep_agent()`. The deepagents library runs an agentic loop (LangGraph `create_react_agent` under the hood) that sends the tool schema to the model via OpenAI function-calling format and handles tool dispatch.
**Medium agent (default):** `_DirectModel` makes a single `model.ainvoke(messages)` call with no tool schema. Context (memories, fetched URL content) is injected via the system prompt instead. This is intentional — `qwen3:4b` behaves unreliably when a tool array is present.
**Memory tools (out-of-loop):** `add_memory` and `search_memory` are LangChain MCP tool objects (via `langchain_mcp_adapters`) but are excluded from both agents' tool lists. They are called directly — `await _memory_add_tool.ainvoke(...)` — outside the agent loop, before and after each turn.
## Three-Tier Model Routing
| Tier | Model | Agent | Trigger | Latency |
|------|-------|-------|---------|---------|
| Light | `qwen2.5:1.5b` (router answers directly) | — | Regex pre-match or 3-way embedding classifies "light" | ~24s |
| Medium | `qwen3:4b` (`DEEPAGENTS_MODEL`) | `_DirectModel` — single LLM call, no tools | Default; also forced when message contains URLs | ~1020s |
| Complex | `deepseek/deepseek-r1:free` via LiteLLM (`DEEPAGENTS_COMPLEX_MODEL`) | `create_deep_agent` — agentic loop with tools | Auto-classified by embedding similarity | ~3090s |
Routing is fully automatic via 3-way cosine similarity over pre-embedded utterance centroids (light / medium / complex). No prefix required. Use `adolf-deep` model name to force complex tier via API.
Complex tier is reached automatically for deep research queries — `исследуй`, `изучи все`, `напиши подробный`, etc. — via regex pre-classifier and embedding similarity. No prefix required. Use `adolf-deep` model name to force it via API.
## Fast Tools (`fast_tools.py`)
Pre-flight tools that run concurrently with URL fetch and memory retrieval before any LLM call. Each tool has two methods:
- `matches(message) → bool` — regex classifier; also used by `Router` to force medium tier
- `run(message) → str` — async fetch returning a context block injected into system prompt
`FastToolRunner` holds all tools. `any_matches()` is called by the Router at step 0a; `run_matching()` is called in the pre-flight `asyncio.gather` in `run_agent_task()`.
| Tool | Pattern | Source | Context returned |
|------|---------|--------|-----------------|
| `WeatherTool` | weather/forecast/temperature/snow/rain | SearXNG `"погода Балашиха сейчас"` | Current conditions in °C from Russian weather sites |
| `CommuteTool` | commute/traffic/arrival/пробки | `routecheck:8090/api/route` (Yandex Routing API) | Drive time with/without traffic, Balashikha→Moscow |
**To add a new fast tool:** subclass `FastTool` in `fast_tools.py`, implement `name`/`matches`/`run`, add an instance to `_fast_tool_runner` in `agent.py`.
## routecheck Service (`routecheck/`)
Local web service on port 8090. Exists because Yandex Routing API free tier requires a web UI that uses the API.
**Web UI** (`http://localhost:8090`): PIL-generated arithmetic captcha → lat/lon form → travel time result.
**Internal API**: `GET /api/route?from=lat,lon&to=lat,lon&token=ROUTECHECK_TOKEN` — bypasses captcha, used by `CommuteTool`. The `ROUTECHECK_TOKEN` shared secret is set in `.env` and passed to both `routecheck` and `deepagents` containers.
Yandex API calls are routed through the host HTTPS proxy (`host.docker.internal:56928`) since the container has no direct external internet access.
**Requires** `.env`: `YANDEX_ROUTING_KEY` (free from `developer.tech.yandex.ru`) + `ROUTECHECK_TOKEN`.
## Crawl4AI Integration
Crawl4AI runs as a Docker service (`crawl4ai:11235`) providing JS-rendered, bot-bypass page fetching.
**Pre-routing fetch (all tiers):**
- `_URL_RE` detects `https?://` URLs in any incoming message
- `_crawl4ai_fetch_async()` uses `httpx.AsyncClient` to POST `{urls: [...]}` to `/crawl`
- Up to 3 URLs fetched concurrently via `asyncio.gather`
- Fetched content (up to 3000 chars/URL) injected as a system context block into enriched history before routing and into medium/complex system prompts
- If fetch succeeds and router returns light → tier upgraded to medium
**Complex agent tools:**
- `web_search`: SearXNG query + Crawl4AI auto-fetch of top 2 result URLs → combined snippet + page text
- `fetch_url`: Crawl4AI single-URL fetch for any specific URL
## Memory Pipeline
openmemory runs as a FastMCP server (`openmemory:8765`) backed by mem0 + Qdrant + nomic-embed-text.
**Retrieval (before routing):** `_retrieve_memories()` calls `search_memory` MCP tool with the user message as query. Results (threshold ≥ 0.5) are prepended to enriched history so all tiers benefit.
**Storage (after reply):** `_store_memory()` runs as an asyncio background task, calling `add_memory` with `"User: ...\nAssistant: ..."`. The extraction LLM (`qwen2.5:1.5b` on GPU Ollama) pulls facts; dedup is handled by mem0's update prompt.
Memory tools (`add_memory`, `search_memory`, `get_all_memories`) are excluded from agent tool lists — memory management happens outside the agent loop.
## VRAM Management
GTX 1070 — 8 GB. Ollama must be restarted if CUDA init fails (model loads on CPU).
1. Flush explicitly before loading qwen3:8b (`keep_alive=0`)
2. Verify eviction via `/api/ps` poll (15s timeout) before proceeding
3. Fallback: timeout → run medium agent instead
4. Post-complex: flush 8b, pre-warm medium + router
## Session ID Convention
- Telegram: `tg-<chat_id>` (e.g. `tg-346967270`)
- CLI: `cli-<username>` (e.g. `cli-alvis`)
Conversation history is keyed by session_id (5-turn buffer).
## Files
```
adolf/
├── docker-compose.yml Services: deepagents, openmemory, grammy, crawl4ai, routecheck, cli
├── Dockerfile deepagents container (Python 3.12)
├── Dockerfile.cli CLI container (python:3.12-slim + rich)
├── agent.py FastAPI gateway, run_agent_task, Crawl4AI pre-fetch, fast tools, memory pipeline
├── fast_tools.py FastTool base, FastToolRunner, WeatherTool, CommuteTool
├── channels.py Channel registry + deliver() + pending_replies
├── router.py Router class — regex + LLM tier classification, FastToolRunner integration
├── vram_manager.py VRAMManager — flush/prewarm/poll Ollama VRAM
├── agent_factory.py _DirectModel (medium) / create_deep_agent (complex)
├── cli.py Interactive CLI REPL — Rich Live streaming + Markdown render
├── wiki_research.py Batch wiki research pipeline (uses /message + SSE)
├── benchmarks/
│ ├── run_benchmark.py Routing accuracy benchmark — 120 queries across 3 tiers
│ ├── run_voice_benchmark.py Voice path benchmark
│ ├── benchmark.json Query dataset (gitignored)
│ └── results_latest.json Last run results (gitignored)
├── .env TELEGRAM_BOT_TOKEN, ROUTECHECK_TOKEN, YANDEX_ROUTING_KEY (not committed)
├── routecheck/
│ ├── app.py FastAPI: image captcha + /api/route Yandex proxy
│ └── Dockerfile
├── tests/
│ ├── integration/ Standalone integration test scripts (common.py + test_*.py)
│ └── use_cases/ Claude Code skill markdown files — Claude acts as user + evaluator
├── openmemory/
│ ├── server.py FastMCP + mem0: add_memory, search_memory, get_all_memories
│ └── Dockerfile
└── grammy/
├── bot.mjs grammY Telegram bot + POST /send HTTP endpoint
├── package.json
└── Dockerfile
```
## External Services (host ports, from openai/ stack)
| Service | Host Port | Role |
|---------|-----------|------|
| LiteLLM | 4000 | LLM proxy — all inference goes through here (`LITELLM_URL` env var) |
| Ollama GPU | 11436 | GPU inference backend + VRAM management (direct) + memory extraction |
| Ollama CPU | 11435 | nomic-embed-text embeddings for openmemory |
| Langfuse | 3200 | LLM observability — traces all requests via LiteLLM callbacks |
| Qdrant | 6333 | Vector store for memories |
| SearXNG | 11437 | Web search (used by `web_search` tool) |