Update Adolf: three-tier routing, VRAM management, deepagents

2026-02-28 17:56:16 +00:00
parent a9c6f697a4
commit 3770d9d782
1 changed files with 67 additions and 38 deletions
--- a/Adolf.md
+++ b/Adolf.md
@@ -1,6 +1,6 @@
 # Adolf

-Persistent AI assistant reachable via Telegram. GPU-accelerated inference with long-term memory and web search.
+Persistent AI assistant reachable via Telegram. Three-tier model routing with GPU VRAM management and long-term memory.

 ## Architecture

@@ -10,51 +10,83 @@ Telegram user
 [grammy] Node.js — port 3001
  - grammY bot polls Telegram
  - on message: fire-and-forget POST /chat to deepagents
-  - exposes MCP SSE server: tool send_telegram_message(chat_id, text)
-     ↕ fire-and-forget HTTP          ↕ MCP SSE tool call
+  - exposes MCP SSE: send_telegram_message(chat_id, text)
+     ↓ POST /chat → 202 Accepted immediately
 [deepagents] Python FastAPI — port 8000
-  - POST /chat → 202 Accepted immediately
-  - background task: run LangGraph react agent
-  - LLM: qwen3:8b via Ollama GPU (host port 11436)
-  - tools: search_memory, get_all_memories, web_search
-  - after reply: async fire-and-forget → store memory on CPU
-     ↕ MCP SSE                        ↕ HTTP (SearXNG)
-[openmemory] Python + mem0 — port 8765    [SearXNG — port 11437]
+     ↓
+Pre-check: /think prefix? → force_complex=True, strip prefix
+     ↓
+Router (qwen2.5:1.5b, temp=0, ~2–4s)
+  - light:   simple/conversational → router answers directly
+  - medium:  needs memory/web search → qwen3:4b + tools
+  - complex: multi-step research, planning → qwen3:8b + subagents
+     ↓
+     ├── light ─────────── router reply used directly
+     ├── medium ──────────  qwen3:4b + TodoList + tools (~20–100s)
+     └── complex ─────────  VRAM flush → qwen3:8b + subagents (~60–180s)
+                             └→ background: flush 8b, prewarm 4b+router
+     ↓
+send_telegram_message via Grammy MCP (auto-split if >4000 chars)
+     ↓
+asyncio.create_task(store_memory_async) — spin-wait GPU idle → add_memory
+     ↕ MCP SSE                         ↕ HTTP
+[openmemory] Python + mem0 — port 8765   [SearXNG — port 11437]
  - MCP tools: add_memory, search_memory, get_all_memories
-  - mem0 backend: Qdrant (port 6333) + CPU Ollama (port 11435)
-  - embedder: nomic-embed-text (768 dims)
-  - extractor: gemma3:1b
-  - collection: adolf_memories
+  - extractor: qwen2.5:1.5b on GPU Ollama (11436) — 2–5s
+  - embedder: nomic-embed-text on CPU Ollama (11435) — 50–150ms
+  - vector store: Qdrant (port 6333), 768 dims
 ```

-## Queuing and Concurrency
+## Three-Tier Model Routing

-Two semaphores prevent resource contention:
+| Tier | Model | VRAM | Trigger | Latency |
+|------|-------|------|---------|---------|
+| Light | qwen2.5:1.5b (router) | ~1.2 GB (shared with extraction) | Router classifies as light | ~2–4s |
+| Medium | qwen3:4b | ~2.5 GB | Default | ~20–100s |
+| Complex | qwen3:8b | ~5.5 GB | `/think` prefix | ~60–180s |

-| Semaphore | Guards | Notes |
-|-----------|--------|-------|
-| `_reply_semaphore(1)` | GPU Ollama (qwen3:8b) | One LLM inference at a time |
-| `_memory_semaphore(1)` | CPU Ollama (gemma3:1b) | One memory store at a time |
+**Normal VRAM**: router/extraction (1.2 GB, shared) + medium (2.5 GB) = ~3.7 GB
+**Complex VRAM**: 8b alone = ~5.5 GB — flushes others first

-**Reply-first pipeline:**
-1. User message arrives via Telegram → Grammy forwards to deepagents (fire-and-forget)
-2. Deepagents queues behind `_reply_semaphore`, runs agent, sends reply via Grammy MCP tool
-3. After reply is sent, `asyncio.create_task` fires `store_memory_async` in background
-4. Memory task queues behind `_memory_semaphore`, calls `add_memory` on openmemory
-5. openmemory uses CPU Ollama: embedding (~0.3s) + extraction (~1.6s) → stored in Qdrant
+Router uses regex pre-classifier (greetings, simple patterns) then raw-text LLM classification. Complex tier requires `/think` prefix.

-Reply latency: ~10–18s (GPU qwen3:8b inference + tool calls).
-Memory latency: ~5–16s (runs async, never blocks replies).
+## VRAM Management
+
+GTX 1070 (8 GB). Explicit flush before loading qwen3:8b prevents Ollama's CPU-spill bug:
+
+1. Flush qwen3:4b and qwen2.5:1.5b (`keep_alive=0`)
+2. Poll `/api/ps` until evicted (15s timeout)
+3. Fallback to medium agent if timeout
+4. After complex reply: flush 8b, pre-warm 4b + router
+
+## Agents
+
+**Medium agent**: `create_deep_agent` (deepagents) + TodoListMiddleware
+Tools: `search_memory`, `get_all_memories`, `web_search`
+
+**Complex agent**: `create_deep_agent` + TodoListMiddleware + SubAgentMiddleware
+Subagents: `research` (web_search), `memory` (search_memory + get_all_memories)
+
+## Concurrency
+
+| Semaphore | Guards |
+|-----------|--------|
+| `_reply_semaphore(1)` | GPU Ollama — one LLM inference at a time |
+| `_memory_semaphore(1)` | GPU Ollama — one memory extraction at a time |
+
+Memory extraction spin-waits until `_reply_semaphore` is free (60s timeout).

 ## External Services (from openai/ stack)

 | Service | Host Port | Role |
 |---------|-----------|------|
-| Ollama GPU | 11436 | Main LLM (qwen3:8b) |
-| Ollama CPU | 11435 | Memory embedding + extraction |
+| Ollama GPU | 11436 | Reply inference + extraction (qwen2.5:1.5b) |
+| Ollama CPU | 11435 | Memory embedding (nomic-embed-text) |
 | Qdrant | 6333 | Vector store for memories |
 | SearXNG | 11437 | Web search |

+GPU Ollama config: `OLLAMA_MAX_LOADED_MODELS=2`, `OLLAMA_NUM_PARALLEL=1`.
+
 ## Compose Stack

 Config: `agap_git/adolf/docker-compose.yml`
@@ -66,20 +98,17 @@ docker compose up -d

 Requires `TELEGRAM_BOT_TOKEN` in `adolf/.env`.

-## Memory
-
- Stored per `chat_id` (Telegram user ID) as `user_id` in mem0
- Semantic search via Qdrant (cosine similarity, 768-dim nomic-embed-text vectors)
- mem0 uses gemma3:1b to extract structured facts before embedding
- Collection: `adolf_memories` in Qdrant
-
 ## Files

 ```
 adolf/
 ├── docker-compose.yml      Services: deepagents, openmemory, grammy
 ├── Dockerfile              deepagents container (Python 3.12)
-├── agent.py                FastAPI + LangGraph react agent
+├── agent.py                FastAPI + three-tier routing + run_agent_task
+├── router.py               Router — regex + qwen2.5:1.5b classification
+├── vram_manager.py         VRAMManager — flush/poll/prewarm Ollama VRAM
+├── agent_factory.py        build_medium_agent / build_complex_agent
+├── test_pipeline.py        Integration tests + benchmark (easy/medium/hard)
 ├── .env                    TELEGRAM_BOT_TOKEN (not committed)
 ├── openmemory/
 │   ├── server.py           FastMCP + mem0 MCP tools