5.4 KiB
5.4 KiB
Adolf
Autonomous personal assistant with a multi-channel gateway. Three-tier model routing with GPU VRAM management.
Architecture
┌─────────────────────────────────────────────────────┐
│ CHANNEL ADAPTERS │
│ │
│ [Telegram/Grammy] [CLI] [Voice — future] │
│ ↕ ↕ ↕ │
│ └────────────────┴────────────┘ │
│ ↕ │
│ ┌─────────────────────────┐ │
│ │ GATEWAY (agent.py) │ │
│ │ FastAPI :8000 │ │
│ │ │ │
│ │ POST /message │ ← all inbound │
│ │ POST /chat (legacy) │ │
│ │ GET /reply/{id} SSE │ ← CLI polling │
│ │ GET /health │ │
│ │ │ │
│ │ channels.py registry │ │
│ │ conversation buffers │ │
│ └──────────┬──────────────┘ │
│ ↓ │
│ ┌──────────────────────┐ │
│ │ AGENT CORE │ │
│ │ three-tier routing │ │
│ │ VRAM management │ │
│ └──────────────────────┘ │
│ ↓ │
│ channels.deliver(session_id, channel, text)│
│ ↓ ↓ │
│ telegram → POST grammy/send cli → SSE queue │
└─────────────────────────────────────────────────────┘
Channel Adapters
| Channel | session_id | Inbound | Outbound |
|---|---|---|---|
| Telegram | tg-<chat_id> |
Grammy long-poll → POST /message | channels.py → POST grammy:3001/send |
| CLI | cli-<user> |
POST /message directly | GET /reply/{id} SSE stream |
| Voice | voice-<device> |
(future) | (future) |
Unified Message Flow
1. Channel adapter receives message
2. POST /message {text, session_id, channel, user_id}
3. 202 Accepted immediately
4. Background: run_agent_task(message, session_id, channel)
5. Route → run agent tier → get reply text
6. channels.deliver(session_id, channel, reply_text)
- always puts reply in pending_replies[session_id] queue (for SSE)
- calls channel-specific send callback
7. GET /reply/{session_id} SSE clients receive the reply
Three-Tier Model Routing
| Tier | Model | VRAM | Trigger | Latency |
|---|---|---|---|---|
| Light | qwen2.5:1.5b (router answers) | ~1.2 GB | Router classifies as light | ~2–4s |
| Medium | qwen3:4b | ~2.5 GB | Default | ~20–40s |
| Complex | qwen3:8b | ~6.0 GB | /think prefix |
~60–120s |
/think prefix: forces complex tier, stripped before sending to agent.
VRAM Management
GTX 1070 — 8 GB. Ollama must be restarted if CUDA init fails (model loads on CPU).
- Flush explicitly before loading qwen3:8b (
keep_alive=0) - Verify eviction via
/api/pspoll (15s timeout) before proceeding - Fallback: timeout → run medium agent instead
- Post-complex: flush 8b, pre-warm 4b + router
Session ID Convention
- Telegram:
tg-<chat_id>(e.g.tg-346967270) - CLI:
cli-<username>(e.g.cli-alvis)
Conversation history is keyed by session_id (5-turn buffer).
Files
adolf/
├── docker-compose.yml Services: deepagents, openmemory, grammy
├── Dockerfile deepagents container (Python 3.12)
├── agent.py FastAPI gateway + three-tier routing
├── channels.py Channel registry + deliver() + pending_replies
├── router.py Router class — qwen2.5:1.5b routing
├── vram_manager.py VRAMManager — flush/prewarm/poll Ollama VRAM
├── agent_factory.py build_medium_agent / build_complex_agent
├── cli.py Interactive CLI REPL client
├── wiki_research.py Batch wiki research pipeline (uses /message + SSE)
├── .env TELEGRAM_BOT_TOKEN (not committed)
├── openmemory/
│ ├── server.py FastMCP + mem0 MCP tools
│ └── Dockerfile
└── grammy/
├── bot.mjs grammY Telegram bot + POST /send HTTP endpoint
├── package.json
└── Dockerfile
External Services (from openai/ stack)
| Service | Host Port | Role |
|---|---|---|
| Ollama GPU | 11436 | All reply inference |
| Ollama CPU | 11435 | Memory embedding (nomic-embed-text) |
| Qdrant | 6333 | Vector store for memories |
| SearXNG | 11437 | Web search |