In no_inference mode only the routing decision matters — fetching memories and URLs adds latency without affecting the classification. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adolf
Autonomous personal assistant with a multi-channel gateway. Three-tier model routing with GPU VRAM management.
Architecture
┌─────────────────────────────────────────────────────┐
│ CHANNEL ADAPTERS │
│ │
│ [Telegram/Grammy] [CLI] [Voice — future] │
│ ↕ ↕ ↕ │
│ └────────────────┴────────────┘ │
│ ↕ │
│ ┌─────────────────────────┐ │
│ │ GATEWAY (agent.py) │ │
│ │ FastAPI :8000 │ │
│ │ │ │
│ │ POST /message │ ← all inbound │
│ │ POST /chat (legacy) │ │
│ │ GET /stream/{id} SSE │ ← token stream│
│ │ GET /reply/{id} SSE │ ← legacy poll │
│ │ GET /health │ │
│ │ │ │
│ │ channels.py registry │ │
│ │ conversation buffers │ │
│ └──────────┬──────────────┘ │
│ ↓ │
│ ┌──────────────────────┐ │
│ │ AGENT CORE │ │
│ │ three-tier routing │ │
│ │ VRAM management │ │
│ └──────────────────────┘ │
│ ↓ │
│ channels.deliver(session_id, channel, text)│
│ ↓ ↓ │
│ telegram → POST grammy/send cli → SSE queue │
└─────────────────────────────────────────────────────┘
Channel Adapters
| Channel | session_id | Inbound | Outbound |
|---|---|---|---|
| Telegram | tg-<chat_id> |
Grammy long-poll → POST /message | channels.py → POST grammy:3001/send |
| CLI | cli-<user> |
POST /message directly | GET /stream/{id} SSE — Rich Live streaming |
| Voice | voice-<device> |
(future) | (future) |
Unified Message Flow
1. Channel adapter receives message
2. POST /message {text, session_id, channel, user_id}
3. 202 Accepted immediately
4. Background: run_agent_task(message, session_id, channel)
5. Parallel IO (asyncio.gather):
a. _fetch_urls_from_message() — Crawl4AI fetches any URLs in message
b. _retrieve_memories() — openmemory semantic search for context
c. _fast_tool_runner.run_matching() — FastTools (weather, commute) if pattern matches
6. router.route() with enriched history (url_context + fast_context + memories)
- fast tool match → force medium (real-time data, no point routing to light)
- if URL content fetched and tier=light → upgrade to medium
7. Invoke agent for tier with url_context + memories in system prompt
8. Token streaming:
- medium: astream() pushes per-token chunks to _stream_queues[session_id]; <think> blocks filtered in real time
- light/complex: full reply pushed as single chunk after completion
- _end_stream() sends [DONE] sentinel
9. channels.deliver(session_id, channel, reply_text) — Telegram callback
10. _store_memory() background task — stores turn in openmemory
11. GET /stream/{session_id} SSE clients receive chunks; CLI renders with Rich Live + final Markdown
Tool Handling
Adolf uses LangChain's tool interface but only the complex agent actually invokes tools at runtime.
Complex agent: web_search and fetch_url are defined as langchain_core.tools.Tool objects and passed to create_deep_agent(). The deepagents library runs an agentic loop (LangGraph create_react_agent under the hood) that sends the tool schema to the model via OpenAI function-calling format and handles tool dispatch.
Medium agent (default): _DirectModel makes a single model.ainvoke(messages) call with no tool schema. Context (memories, fetched URL content) is injected via the system prompt instead. This is intentional — qwen3:4b behaves unreliably when a tool array is present.
Memory tools (out-of-loop): add_memory and search_memory are LangChain MCP tool objects (via langchain_mcp_adapters) but are excluded from both agents' tool lists. They are called directly — await _memory_add_tool.ainvoke(...) — outside the agent loop, before and after each turn.
Three-Tier Model Routing
| Tier | Model | Agent | Trigger | Latency |
|---|---|---|---|---|
| Light | qwen2.5:1.5b (router answers directly) |
— | Regex pre-match or 3-way embedding classifies "light" | ~2–4s |
| Medium | qwen3:4b (DEEPAGENTS_MODEL) |
_DirectModel — single LLM call, no tools |
Default; also forced when message contains URLs | ~10–20s |
| Complex | deepseek/deepseek-r1:free via LiteLLM (DEEPAGENTS_COMPLEX_MODEL) |
create_deep_agent — agentic loop with tools |
Auto-classified by embedding similarity | ~30–90s |
Routing is fully automatic via 3-way cosine similarity over pre-embedded utterance centroids (light / medium / complex). No prefix required. Use adolf-deep model name to force complex tier via API.
Complex tier is reached automatically for deep research queries — исследуй, изучи все, напиши подробный, etc. — via regex pre-classifier and embedding similarity. No prefix required. Use adolf-deep model name to force it via API.
Fast Tools (fast_tools.py)
Pre-flight tools that run concurrently with URL fetch and memory retrieval before any LLM call. Each tool has two methods:
matches(message) → bool— regex classifier; also used byRouterto force medium tierrun(message) → str— async fetch returning a context block injected into system prompt
FastToolRunner holds all tools. any_matches() is called by the Router at step 0a; run_matching() is called in the pre-flight asyncio.gather in run_agent_task().
| Tool | Pattern | Source | Context returned |
|---|---|---|---|
WeatherTool |
weather/forecast/temperature/snow/rain | SearXNG "погода Балашиха сейчас" |
Current conditions in °C from Russian weather sites |
CommuteTool |
commute/traffic/arrival/пробки | routecheck:8090/api/route (Yandex Routing API) |
Drive time with/without traffic, Balashikha→Moscow |
To add a new fast tool: subclass FastTool in fast_tools.py, implement name/matches/run, add an instance to _fast_tool_runner in agent.py.
routecheck Service (routecheck/)
Local web service on port 8090. Exists because Yandex Routing API free tier requires a web UI that uses the API.
Web UI (http://localhost:8090): PIL-generated arithmetic captcha → lat/lon form → travel time result.
Internal API: GET /api/route?from=lat,lon&to=lat,lon&token=ROUTECHECK_TOKEN — bypasses captcha, used by CommuteTool. The ROUTECHECK_TOKEN shared secret is set in .env and passed to both routecheck and deepagents containers.
Yandex API calls are routed through the host HTTPS proxy (host.docker.internal:56928) since the container has no direct external internet access.
Requires .env: YANDEX_ROUTING_KEY (free from developer.tech.yandex.ru) + ROUTECHECK_TOKEN.
Crawl4AI Integration
Crawl4AI runs as a Docker service (crawl4ai:11235) providing JS-rendered, bot-bypass page fetching.
Pre-routing fetch (all tiers):
_URL_REdetectshttps?://URLs in any incoming message_crawl4ai_fetch_async()useshttpx.AsyncClientto POST{urls: [...]}to/crawl- Up to 3 URLs fetched concurrently via
asyncio.gather - Fetched content (up to 3000 chars/URL) injected as a system context block into enriched history before routing and into medium/complex system prompts
- If fetch succeeds and router returns light → tier upgraded to medium
Complex agent tools:
web_search: SearXNG query + Crawl4AI auto-fetch of top 2 result URLs → combined snippet + page textfetch_url: Crawl4AI single-URL fetch for any specific URL
Memory Pipeline
openmemory runs as a FastMCP server (openmemory:8765) backed by mem0 + Qdrant + nomic-embed-text.
Retrieval (before routing): _retrieve_memories() calls search_memory MCP tool with the user message as query. Results (threshold ≥ 0.5) are prepended to enriched history so all tiers benefit.
Storage (after reply): _store_memory() runs as an asyncio background task, calling add_memory with "User: ...\nAssistant: ...". The extraction LLM (qwen2.5:1.5b on GPU Ollama) pulls facts; dedup is handled by mem0's update prompt.
Memory tools (add_memory, search_memory, get_all_memories) are excluded from agent tool lists — memory management happens outside the agent loop.
VRAM Management
GTX 1070 — 8 GB. Ollama must be restarted if CUDA init fails (model loads on CPU).
- Flush explicitly before loading qwen3:8b (
keep_alive=0) - Verify eviction via
/api/pspoll (15s timeout) before proceeding - Fallback: timeout → run medium agent instead
- Post-complex: flush 8b, pre-warm medium + router
Session ID Convention
- Telegram:
tg-<chat_id>(e.g.tg-346967270) - CLI:
cli-<username>(e.g.cli-alvis)
Conversation history is keyed by session_id (5-turn buffer).
Files
adolf/
├── docker-compose.yml Services: deepagents, openmemory, grammy, crawl4ai, routecheck, cli
├── Dockerfile deepagents container (Python 3.12)
├── Dockerfile.cli CLI container (python:3.12-slim + rich)
├── agent.py FastAPI gateway, run_agent_task, Crawl4AI pre-fetch, fast tools, memory pipeline
├── fast_tools.py FastTool base, FastToolRunner, WeatherTool, CommuteTool
├── channels.py Channel registry + deliver() + pending_replies
├── router.py Router class — regex + LLM tier classification, FastToolRunner integration
├── vram_manager.py VRAMManager — flush/prewarm/poll Ollama VRAM
├── agent_factory.py _DirectModel (medium) / create_deep_agent (complex)
├── cli.py Interactive CLI REPL — Rich Live streaming + Markdown render
├── wiki_research.py Batch wiki research pipeline (uses /message + SSE)
├── benchmarks/
│ ├── run_benchmark.py Routing accuracy benchmark — 120 queries across 3 tiers
│ ├── run_voice_benchmark.py Voice path benchmark
│ ├── benchmark.json Query dataset (gitignored)
│ └── results_latest.json Last run results (gitignored)
├── .env TELEGRAM_BOT_TOKEN, ROUTECHECK_TOKEN, YANDEX_ROUTING_KEY (not committed)
├── routecheck/
│ ├── app.py FastAPI: image captcha + /api/route Yandex proxy
│ └── Dockerfile
├── tests/
│ ├── integration/ Standalone integration test scripts (common.py + test_*.py)
│ └── use_cases/ Claude Code skill markdown files — Claude acts as user + evaluator
├── openmemory/
│ ├── server.py FastMCP + mem0: add_memory, search_memory, get_all_memories
│ └── Dockerfile
└── grammy/
├── bot.mjs grammY Telegram bot + POST /send HTTP endpoint
├── package.json
└── Dockerfile
External Services (host ports, from openai/ stack)
| Service | Host Port | Role |
|---|---|---|
| LiteLLM | 4000 | LLM proxy — all inference goes through here (LITELLM_URL env var) |
| Ollama GPU | 11436 | GPU inference backend + VRAM management (direct) + memory extraction |
| Ollama CPU | 11435 | nomic-embed-text embeddings for openmemory |
| Langfuse | 3200 | LLM observability — traces all requests via LiteLLM callbacks |
| Qdrant | 6333 | Vector store for memories |
| SearXNG | 11437 | Web search (used by web_search tool) |