adolf/ARCHITECTURE.md

# Adolf

Autonomous personal assistant with a multi-channel gateway. Three-tier model routing with GPU VRAM management.

## Architecture

```
┌─────────────────────────────────────────────────────┐
│                 CHANNEL ADAPTERS                    │
│                                                     │
│  [Telegram/Grammy]   [CLI]   [Voice — future]       │
│       ↕                ↕            ↕               │
│       └────────────────┴────────────┘               │
│                        ↕                            │
│          ┌─────────────────────────┐                │
│          │   GATEWAY  (agent.py)   │                │
│          │   FastAPI  :8000        │                │
│          │                         │                │
│          │  POST /message          │  ← all inbound │
│          │  POST /chat  (legacy)   │                │
│          │  GET  /reply/{id}  SSE  │  ← CLI polling │
│          │  GET  /health           │                │
│          │                         │                │
│          │  channels.py registry   │                │
│          │  conversation buffers   │                │
│          └──────────┬──────────────┘                │
│                     ↓                               │
│          ┌──────────────────────┐                   │
│          │    AGENT CORE        │                   │
│          │  three-tier routing  │                   │
│          │  VRAM management     │                   │
│          └──────────────────────┘                   │
│                     ↓                               │
│          channels.deliver(session_id, channel, text)│
│               ↓                    ↓                │
│    telegram → POST grammy/send   cli → SSE queue    │
└─────────────────────────────────────────────────────┘
```

## Channel Adapters

| Channel | session_id | Inbound | Outbound |
|---------|-----------|---------|---------|
| Telegram | `tg-<chat_id>` | Grammy long-poll → POST /message | channels.py → POST grammy:3001/send |
| CLI | `cli-<user>` | POST /message directly | GET /reply/{id} SSE stream |
| Voice | `voice-<device>` | (future) | (future) |

## Unified Message Flow

```
1. Channel adapter receives message
2. POST /message {text, session_id, channel, user_id}
3. 202 Accepted immediately
4. Background: run_agent_task(message, session_id, channel)
5. Route → run agent tier → get reply text
6. channels.deliver(session_id, channel, reply_text)
   - always puts reply in pending_replies[session_id] queue (for SSE)
   - calls channel-specific send callback
7. GET /reply/{session_id} SSE clients receive the reply
```

## Three-Tier Model Routing

| Tier | Model | VRAM | Trigger | Latency |
|------|-------|------|---------|---------|
| Light | qwen2.5:1.5b (router answers) | ~1.2 GB | Router classifies as light | ~2–4s |
| Medium | qwen3:4b | ~2.5 GB | Default | ~20–40s |
| Complex | qwen3:8b | ~6.0 GB | `/think` prefix | ~60–120s |

**`/think` prefix**: forces complex tier, stripped before sending to agent.

## VRAM Management

GTX 1070 — 8 GB. Ollama must be restarted if CUDA init fails (model loads on CPU).

1. Flush explicitly before loading qwen3:8b (`keep_alive=0`)
2. Verify eviction via `/api/ps` poll (15s timeout) before proceeding
3. Fallback: timeout → run medium agent instead
4. Post-complex: flush 8b, pre-warm 4b + router

## Session ID Convention

- Telegram: `tg-<chat_id>` (e.g. `tg-346967270`)
- CLI: `cli-<username>` (e.g. `cli-alvis`)

Conversation history is keyed by session_id (5-turn buffer).

## Files

```
adolf/
├── docker-compose.yml      Services: deepagents, openmemory, grammy
├── Dockerfile              deepagents container (Python 3.12)
├── agent.py                FastAPI gateway + three-tier routing
├── channels.py             Channel registry + deliver() + pending_replies
├── router.py               Router class — qwen2.5:1.5b routing
├── vram_manager.py         VRAMManager — flush/prewarm/poll Ollama VRAM
├── agent_factory.py        build_medium_agent / build_complex_agent
├── cli.py                  Interactive CLI REPL client
├── wiki_research.py        Batch wiki research pipeline (uses /message + SSE)
├── .env                    TELEGRAM_BOT_TOKEN (not committed)
├── openmemory/
│   ├── server.py           FastMCP + mem0 MCP tools
│   └── Dockerfile
└── grammy/
    ├── bot.mjs             grammY Telegram bot + POST /send HTTP endpoint
    ├── package.json
    └── Dockerfile
```

## External Services (from openai/ stack)

| Service | Host Port | Role |
|---------|-----------|------|
| Ollama GPU | 11436 | All reply inference |
| Ollama CPU | 11435 | Memory embedding (nomic-embed-text) |
| Qdrant | 6333 | Vector store for memories |
| SearXNG | 11437 | Web search |