From 3770d9d7825fa841f4cb0489185353b06c54f0dc Mon Sep 17 00:00:00 2001 From: alvis Date: Sat, 28 Feb 2026 17:56:16 +0000 Subject: [PATCH] Update Adolf: three-tier routing, VRAM management, deepagents --- Adolf.md | 105 +++++++++++++++++++++++++++++++++++-------------------- 1 file changed, 67 insertions(+), 38 deletions(-) diff --git a/Adolf.md b/Adolf.md index 43f31cc..78eef57 100644 --- a/Adolf.md +++ b/Adolf.md @@ -1,6 +1,6 @@ # Adolf -Persistent AI assistant reachable via Telegram. GPU-accelerated inference with long-term memory and web search. +Persistent AI assistant reachable via Telegram. Three-tier model routing with GPU VRAM management and long-term memory. ## Architecture @@ -10,51 +10,83 @@ Telegram user [grammy] Node.js — port 3001 - grammY bot polls Telegram - on message: fire-and-forget POST /chat to deepagents - - exposes MCP SSE server: tool send_telegram_message(chat_id, text) - ↕ fire-and-forget HTTP ↕ MCP SSE tool call + - exposes MCP SSE: send_telegram_message(chat_id, text) + ↓ POST /chat → 202 Accepted immediately [deepagents] Python FastAPI — port 8000 - - POST /chat → 202 Accepted immediately - - background task: run LangGraph react agent - - LLM: qwen3:8b via Ollama GPU (host port 11436) - - tools: search_memory, get_all_memories, web_search - - after reply: async fire-and-forget → store memory on CPU - ↕ MCP SSE ↕ HTTP (SearXNG) -[openmemory] Python + mem0 — port 8765 [SearXNG — port 11437] + ↓ +Pre-check: /think prefix? → force_complex=True, strip prefix + ↓ +Router (qwen2.5:1.5b, temp=0, ~2–4s) + - light: simple/conversational → router answers directly + - medium: needs memory/web search → qwen3:4b + tools + - complex: multi-step research, planning → qwen3:8b + subagents + ↓ + ├── light ─────────── router reply used directly + ├── medium ────────── qwen3:4b + TodoList + tools (~20–100s) + └── complex ───────── VRAM flush → qwen3:8b + subagents (~60–180s) + └→ background: flush 8b, prewarm 4b+router + ↓ +send_telegram_message via Grammy MCP (auto-split if >4000 chars) + ↓ +asyncio.create_task(store_memory_async) — spin-wait GPU idle → add_memory + ↕ MCP SSE ↕ HTTP +[openmemory] Python + mem0 — port 8765 [SearXNG — port 11437] - MCP tools: add_memory, search_memory, get_all_memories - - mem0 backend: Qdrant (port 6333) + CPU Ollama (port 11435) - - embedder: nomic-embed-text (768 dims) - - extractor: gemma3:1b - - collection: adolf_memories + - extractor: qwen2.5:1.5b on GPU Ollama (11436) — 2–5s + - embedder: nomic-embed-text on CPU Ollama (11435) — 50–150ms + - vector store: Qdrant (port 6333), 768 dims ``` -## Queuing and Concurrency +## Three-Tier Model Routing -Two semaphores prevent resource contention: +| Tier | Model | VRAM | Trigger | Latency | +|------|-------|------|---------|---------| +| Light | qwen2.5:1.5b (router) | ~1.2 GB (shared with extraction) | Router classifies as light | ~2–4s | +| Medium | qwen3:4b | ~2.5 GB | Default | ~20–100s | +| Complex | qwen3:8b | ~5.5 GB | `/think` prefix | ~60–180s | -| Semaphore | Guards | Notes | -|-----------|--------|-------| -| `_reply_semaphore(1)` | GPU Ollama (qwen3:8b) | One LLM inference at a time | -| `_memory_semaphore(1)` | CPU Ollama (gemma3:1b) | One memory store at a time | +**Normal VRAM**: router/extraction (1.2 GB, shared) + medium (2.5 GB) = ~3.7 GB +**Complex VRAM**: 8b alone = ~5.5 GB — flushes others first -**Reply-first pipeline:** -1. User message arrives via Telegram → Grammy forwards to deepagents (fire-and-forget) -2. Deepagents queues behind `_reply_semaphore`, runs agent, sends reply via Grammy MCP tool -3. After reply is sent, `asyncio.create_task` fires `store_memory_async` in background -4. Memory task queues behind `_memory_semaphore`, calls `add_memory` on openmemory -5. openmemory uses CPU Ollama: embedding (~0.3s) + extraction (~1.6s) → stored in Qdrant +Router uses regex pre-classifier (greetings, simple patterns) then raw-text LLM classification. Complex tier requires `/think` prefix. -Reply latency: ~10–18s (GPU qwen3:8b inference + tool calls). -Memory latency: ~5–16s (runs async, never blocks replies). +## VRAM Management + +GTX 1070 (8 GB). Explicit flush before loading qwen3:8b prevents Ollama's CPU-spill bug: + +1. Flush qwen3:4b and qwen2.5:1.5b (`keep_alive=0`) +2. Poll `/api/ps` until evicted (15s timeout) +3. Fallback to medium agent if timeout +4. After complex reply: flush 8b, pre-warm 4b + router + +## Agents + +**Medium agent**: `create_deep_agent` (deepagents) + TodoListMiddleware +Tools: `search_memory`, `get_all_memories`, `web_search` + +**Complex agent**: `create_deep_agent` + TodoListMiddleware + SubAgentMiddleware +Subagents: `research` (web_search), `memory` (search_memory + get_all_memories) + +## Concurrency + +| Semaphore | Guards | +|-----------|--------| +| `_reply_semaphore(1)` | GPU Ollama — one LLM inference at a time | +| `_memory_semaphore(1)` | GPU Ollama — one memory extraction at a time | + +Memory extraction spin-waits until `_reply_semaphore` is free (60s timeout). ## External Services (from openai/ stack) | Service | Host Port | Role | |---------|-----------|------| -| Ollama GPU | 11436 | Main LLM (qwen3:8b) | -| Ollama CPU | 11435 | Memory embedding + extraction | +| Ollama GPU | 11436 | Reply inference + extraction (qwen2.5:1.5b) | +| Ollama CPU | 11435 | Memory embedding (nomic-embed-text) | | Qdrant | 6333 | Vector store for memories | | SearXNG | 11437 | Web search | +GPU Ollama config: `OLLAMA_MAX_LOADED_MODELS=2`, `OLLAMA_NUM_PARALLEL=1`. + ## Compose Stack Config: `agap_git/adolf/docker-compose.yml` @@ -66,20 +98,17 @@ docker compose up -d Requires `TELEGRAM_BOT_TOKEN` in `adolf/.env`. -## Memory - -- Stored per `chat_id` (Telegram user ID) as `user_id` in mem0 -- Semantic search via Qdrant (cosine similarity, 768-dim nomic-embed-text vectors) -- mem0 uses gemma3:1b to extract structured facts before embedding -- Collection: `adolf_memories` in Qdrant - ## Files ``` adolf/ ├── docker-compose.yml Services: deepagents, openmemory, grammy ├── Dockerfile deepagents container (Python 3.12) -├── agent.py FastAPI + LangGraph react agent +├── agent.py FastAPI + three-tier routing + run_agent_task +├── router.py Router — regex + qwen2.5:1.5b classification +├── vram_manager.py VRAMManager — flush/poll/prewarm Ollama VRAM +├── agent_factory.py build_medium_agent / build_complex_agent +├── test_pipeline.py Integration tests + benchmark (easy/medium/hard) ├── .env TELEGRAM_BOT_TOKEN (not committed) ├── openmemory/ │ ├── server.py FastMCP + mem0 MCP tools