Files
adolf/ARCHITECTURE.md
Alvis 50097d6092 Embed Crawl4AI at all tiers, restore qwen3:4b medium, update docs
- Pre-routing URL fetch: any message with URLs gets content fetched
  async (httpx.AsyncClient) before routing via _fetch_urls_from_message()
- URL context and memories gathered concurrently with asyncio.gather
- Light tier upgraded to medium when URL content is present
- url_context injected into system prompt for medium and complex agents
- Complex agent retains web_search/fetch_url tools + receives pre-fetched content
- Medium model restored to qwen3:4b (was temporarily qwen2.5:1.5b)
- Unit tests added for _extract_urls
- ARCHITECTURE.md: added Tool Handling, Crawl4AI Integration, Memory Pipeline sections
- CLAUDE.md: updated request flow and Crawl4AI integration docs

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-12 15:49:34 +00:00

162 lines
8.9 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Adolf
Autonomous personal assistant with a multi-channel gateway. Three-tier model routing with GPU VRAM management.
## Architecture
```
┌─────────────────────────────────────────────────────┐
│ CHANNEL ADAPTERS │
│ │
│ [Telegram/Grammy] [CLI] [Voice — future] │
│ ↕ ↕ ↕ │
│ └────────────────┴────────────┘ │
│ ↕ │
│ ┌─────────────────────────┐ │
│ │ GATEWAY (agent.py) │ │
│ │ FastAPI :8000 │ │
│ │ │ │
│ │ POST /message │ ← all inbound │
│ │ POST /chat (legacy) │ │
│ │ GET /reply/{id} SSE │ ← CLI polling │
│ │ GET /health │ │
│ │ │ │
│ │ channels.py registry │ │
│ │ conversation buffers │ │
│ └──────────┬──────────────┘ │
│ ↓ │
│ ┌──────────────────────┐ │
│ │ AGENT CORE │ │
│ │ three-tier routing │ │
│ │ VRAM management │ │
│ └──────────────────────┘ │
│ ↓ │
│ channels.deliver(session_id, channel, text)│
│ ↓ ↓ │
│ telegram → POST grammy/send cli → SSE queue │
└─────────────────────────────────────────────────────┘
```
## Channel Adapters
| Channel | session_id | Inbound | Outbound |
|---------|-----------|---------|---------|
| Telegram | `tg-<chat_id>` | Grammy long-poll → POST /message | channels.py → POST grammy:3001/send |
| CLI | `cli-<user>` | POST /message directly | GET /reply/{id} SSE stream |
| Voice | `voice-<device>` | (future) | (future) |
## Unified Message Flow
```
1. Channel adapter receives message
2. POST /message {text, session_id, channel, user_id}
3. 202 Accepted immediately
4. Background: run_agent_task(message, session_id, channel)
5. Parallel IO (asyncio.gather):
a. _fetch_urls_from_message() — Crawl4AI fetches any URLs in message
b. _retrieve_memories() — openmemory semantic search for context
6. router.route() with enriched history (url_context + memories as system msgs)
- if URL content fetched and tier=light → upgrade to medium
7. Invoke agent for tier with url_context + memories in system prompt
8. channels.deliver(session_id, channel, reply_text)
- always puts reply in pending_replies[session_id] queue (for SSE)
- calls channel-specific send callback
9. _store_memory() background task — stores turn in openmemory
10. GET /reply/{session_id} SSE clients receive the reply
```
## Tool Handling
Adolf uses LangChain's tool interface but only the complex agent actually invokes tools at runtime.
**Complex agent (`/think` prefix):** `web_search` and `fetch_url` are defined as `langchain_core.tools.Tool` objects and passed to `create_deep_agent()`. The deepagents library runs an agentic loop (LangGraph `create_react_agent` under the hood) that sends the tool schema to the model via OpenAI function-calling format and handles tool dispatch.
**Medium agent (default):** `_DirectModel` makes a single `model.ainvoke(messages)` call with no tool schema. Context (memories, fetched URL content) is injected via the system prompt instead. This is intentional — `qwen3:4b` behaves unreliably when a tool array is present.
**Memory tools (out-of-loop):** `add_memory` and `search_memory` are LangChain MCP tool objects (via `langchain_mcp_adapters`) but are excluded from both agents' tool lists. They are called directly — `await _memory_add_tool.ainvoke(...)` — outside the agent loop, before and after each turn.
## Three-Tier Model Routing
| Tier | Model | Agent | Trigger | Latency |
|------|-------|-------|---------|---------|
| Light | `qwen2.5:1.5b` (router answers directly) | — | Regex pre-match or LLM classifies "light" | ~24s |
| Medium | `qwen3:4b` (`DEEPAGENTS_MODEL`) | `_DirectModel` — single LLM call, no tools | Default; also forced when message contains URLs | ~1020s |
| Complex | `qwen3:8b` (`DEEPAGENTS_COMPLEX_MODEL`) | `create_deep_agent` — agentic loop with tools | `/think` prefix only | ~60120s |
**`/think` prefix**: forces complex tier, stripped before sending to agent.
Complex tier is locked out unless the message starts with `/think` — any LLM classification of "complex" is downgraded to medium.
## Crawl4AI Integration
Crawl4AI runs as a Docker service (`crawl4ai:11235`) providing JS-rendered, bot-bypass page fetching.
**Pre-routing fetch (all tiers):**
- `_URL_RE` detects `https?://` URLs in any incoming message
- `_crawl4ai_fetch_async()` uses `httpx.AsyncClient` to POST `{urls: [...]}` to `/crawl`
- Up to 3 URLs fetched concurrently via `asyncio.gather`
- Fetched content (up to 3000 chars/URL) injected as a system context block into enriched history before routing and into medium/complex system prompts
- If fetch succeeds and router returns light → tier upgraded to medium
**Complex agent tools:**
- `web_search`: SearXNG query + Crawl4AI auto-fetch of top 2 result URLs → combined snippet + page text
- `fetch_url`: Crawl4AI single-URL fetch for any specific URL
## Memory Pipeline
openmemory runs as a FastMCP server (`openmemory:8765`) backed by mem0 + Qdrant + nomic-embed-text.
**Retrieval (before routing):** `_retrieve_memories()` calls `search_memory` MCP tool with the user message as query. Results (threshold ≥ 0.5) are prepended to enriched history so all tiers benefit.
**Storage (after reply):** `_store_memory()` runs as an asyncio background task, calling `add_memory` with `"User: ...\nAssistant: ..."`. The extraction LLM (`qwen2.5:1.5b` on GPU Ollama) pulls facts; dedup is handled by mem0's update prompt.
Memory tools (`add_memory`, `search_memory`, `get_all_memories`) are excluded from agent tool lists — memory management happens outside the agent loop.
## VRAM Management
GTX 1070 — 8 GB. Ollama must be restarted if CUDA init fails (model loads on CPU).
1. Flush explicitly before loading qwen3:8b (`keep_alive=0`)
2. Verify eviction via `/api/ps` poll (15s timeout) before proceeding
3. Fallback: timeout → run medium agent instead
4. Post-complex: flush 8b, pre-warm medium + router
## Session ID Convention
- Telegram: `tg-<chat_id>` (e.g. `tg-346967270`)
- CLI: `cli-<username>` (e.g. `cli-alvis`)
Conversation history is keyed by session_id (5-turn buffer).
## Files
```
adolf/
├── docker-compose.yml Services: bifrost, deepagents, openmemory, grammy, crawl4ai
├── Dockerfile deepagents container (Python 3.12)
├── agent.py FastAPI gateway, run_agent_task, Crawl4AI pre-fetch, memory pipeline
├── channels.py Channel registry + deliver() + pending_replies
├── router.py Router class — regex + LLM tier classification
├── vram_manager.py VRAMManager — flush/prewarm/poll Ollama VRAM
├── agent_factory.py _DirectModel (medium) / create_deep_agent (complex)
├── cli.py Interactive CLI REPL client
├── wiki_research.py Batch wiki research pipeline (uses /message + SSE)
├── .env TELEGRAM_BOT_TOKEN (not committed)
├── openmemory/
│ ├── server.py FastMCP + mem0: add_memory, search_memory, get_all_memories
│ └── Dockerfile
└── grammy/
├── bot.mjs grammY Telegram bot + POST /send HTTP endpoint
├── package.json
└── Dockerfile
```
## External Services (host ports, from openai/ stack)
| Service | Host Port | Role |
|---------|-----------|------|
| Ollama GPU | 11436 | All LLM inference (via Bifrost) + VRAM management (direct) + memory extraction |
| Ollama CPU | 11435 | nomic-embed-text embeddings for openmemory |
| Qdrant | 6333 | Vector store for memories |
| SearXNG | 11437 | Web search (used by `web_search` tool) |