Files
AgapHost/adolf/ARCHITECTURE.md
Alvis 09a93c661e Add three-tier model routing with VRAM management and benchmark suite
- Three-tier routing: light (router answers directly ~3s), medium (qwen3:4b
  + tools ~60s), complex (/think prefix → qwen3:8b + subagents ~140s)
- Router: qwen2.5:1.5b, temp=0, regex pre-classifier + raw-text LLM classify
- VRAMManager: explicit flush/poll/prewarm to prevent Ollama CPU-spill bug
- agent_factory: build_medium_agent and build_complex_agent using deepagents
  (TodoListMiddleware + SubAgentMiddleware with research/memory subagents)
- Fix: split Telegram replies >4000 chars into multiple messages
- Benchmark: 30 questions (easy/medium/hard) — 10/10/10 verified passing
  easy→light, medium→medium, hard→complex with VRAM flush confirmed

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-28 17:54:51 +00:00

145 lines
6.0 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Adolf
Persistent AI assistant reachable via Telegram. Three-tier model routing with GPU VRAM management.
## Architecture
```
Telegram user
↕ (long-polling)
[grammy] Node.js — port 3001
- grammY bot polls Telegram
- on message: fire-and-forget POST /chat to deepagents
- exposes MCP SSE server: tool send_telegram_message(chat_id, text)
↓ POST /chat → 202 Accepted immediately
[deepagents] Python FastAPI — port 8000
Pre-check: starts with /think? → force_complex=True, strip prefix
Router (qwen2.5:0.5b, ~1-2s, always warm in VRAM)
Structured output: {tier: light|medium|complex, confidence: 0.0-1.0, reply?: str}
- light: simple conversational → router answers directly, ~1-2s
- medium: needs memory/web search → qwen3:4b + deepagents tools
- complex: multi-step research, planning, code → qwen3:8b + subagents
force_complex always overrides to complex
complex only if confidence >= 0.85 (else downgraded to medium)
├── light ─────────── router reply used directly (no extra LLM call)
├── medium ────────── deepagents qwen3:4b + TodoList + tools
└── complex ───────── VRAM flush → deepagents qwen3:8b + TodoList + subagents
└→ background: exit_complex_mode (flush 8b, prewarm 4b+router)
send_telegram_message via grammy MCP
asyncio.create_task(store_memory_async) — spin-wait GPU idle → openmemory add_memory
↕ MCP SSE ↕ HTTP
[openmemory] Python + mem0 — port 8765 [SearXNG — port 11437]
- add_memory, search_memory, get_all_memories
- extractor: qwen2.5:1.5b on GPU Ollama (11436) — 25s
- embedder: nomic-embed-text on CPU Ollama (11435) — 50150ms
- vector store: Qdrant (port 6333), 768 dims
```
## Three-Tier Model Routing
| Tier | Model | VRAM | Trigger | Latency |
|------|-------|------|---------|---------|
| Light | qwen2.5:1.5b (router answers) | ~1.2 GB (shared with extraction) | Router classifies as light | ~24s |
| Medium | qwen3:4b | ~2.5 GB | Default; router classifies medium | ~2040s |
| Complex | qwen3:8b | ~5.5 GB | `/think` prefix | ~60120s |
**Normal VRAM** (light + medium): router/extraction(1.2, shared) + medium(2.5) = ~3.7 GB
**Complex VRAM**: 8b alone = ~5.5 GB — must flush others first
### Router model: qwen2.5:1.5b (not 0.5b)
qwen2.5:0.5b is too small for reliable classification — tends to output "medium" for everything
or produces nonsensical output. qwen2.5:1.5b is already loaded in VRAM for memory extraction,
so switching adds zero net VRAM overhead while dramatically improving accuracy.
Router uses **raw text generation** (not structured output/JSON schema):
- Ask model to output one word: `light`, `medium`, or `complex`
- Parse with simple keyword matching (fallback: `medium`)
- For `light` tier: a second call generates the reply text
## VRAM Management
GTX 1070 has 8 GB VRAM. Ollama's auto-eviction can spill models to CPU RAM permanently
(all subsequent loads stay on CPU). To prevent this:
1. **Always flush explicitly** before loading qwen3:8b (`keep_alive=0`)
2. **Verify eviction** via `/api/ps` poll (15s timeout) before proceeding
3. **Fallback**: timeout → log warning, run medium agent instead
4. **Post-complex**: flush 8b immediately, pre-warm 4b + router
```python
# Flush (force immediate unload):
POST /api/generate {"model": "qwen3:4b", "prompt": "", "keep_alive": 0}
# Pre-warm (load into VRAM for 5 min):
POST /api/generate {"model": "qwen3:4b", "prompt": "", "keep_alive": 300}
```
## Agents
**Medium agent** (`build_medium_agent`):
- `create_deep_agent` with TodoListMiddleware (auto-included)
- Tools: `search_memory`, `get_all_memories`, `web_search`
- No subagents
**Complex agent** (`build_complex_agent`):
- `create_deep_agent` with TodoListMiddleware + SubAgentMiddleware
- Tools: all agent tools
- Subagents:
- `research`: web_search only, for thorough multi-query web research
- `memory`: search_memory + get_all_memories, for comprehensive context retrieval
## Concurrency
| Semaphore | Guards | Notes |
|-----------|--------|-------|
| `_reply_semaphore(1)` | GPU Ollama (all tiers) | One LLM reply inference at a time |
| `_memory_semaphore(1)` | GPU Ollama (qwen2.5:1.5b extraction) | One memory extraction at a time |
Light path holds `_reply_semaphore` briefly (no GPU inference).
Memory extraction spin-waits until `_reply_semaphore` is free (60s timeout).
## Pipeline
1. User message → Grammy → `POST /chat` → 202 Accepted
2. Background: acquire `_reply_semaphore` → route → run agent tier → send reply
3. `asyncio.create_task(store_memory_async)` — spin-waits GPU free, then extracts memories
4. For complex: `asyncio.create_task(exit_complex_mode)` — flushes 8b, pre-warms 4b+router
## External Services (from openai/ stack)
| Service | Host Port | Role |
|---------|-----------|------|
| Ollama GPU | 11436 | All reply inference + extraction (qwen2.5:1.5b) |
| Ollama CPU | 11435 | Memory embedding (nomic-embed-text) |
| Qdrant | 6333 | Vector store for memories |
| SearXNG | 11437 | Web search |
GPU Ollama config: `OLLAMA_MAX_LOADED_MODELS=2`, `OLLAMA_NUM_PARALLEL=1`.
## Files
```
adolf/
├── docker-compose.yml Services: deepagents, openmemory, grammy
├── Dockerfile deepagents container (Python 3.12)
├── agent.py FastAPI + three-tier routing + run_agent_task
├── router.py Router class — qwen2.5:0.5b structured output routing
├── vram_manager.py VRAMManager — flush/prewarm/poll Ollama VRAM
├── agent_factory.py build_medium_agent / build_complex_agent (deepagents)
├── .env TELEGRAM_BOT_TOKEN (not committed)
├── openmemory/
│ ├── server.py FastMCP + mem0 MCP tools
│ ├── requirements.txt
│ └── Dockerfile
└── grammy/
├── bot.mjs grammY bot + MCP SSE server
├── package.json
└── Dockerfile
```