- Pre-routing URL fetch: any message with URLs gets content fetched async (httpx.AsyncClient) before routing via _fetch_urls_from_message() - URL context and memories gathered concurrently with asyncio.gather - Light tier upgraded to medium when URL content is present - url_context injected into system prompt for medium and complex agents - Complex agent retains web_search/fetch_url tools + receives pre-fetched content - Medium model restored to qwen3:4b (was temporarily qwen2.5:1.5b) - Unit tests added for _extract_urls - ARCHITECTURE.md: added Tool Handling, Crawl4AI Integration, Memory Pipeline sections - CLAUDE.md: updated request flow and Crawl4AI integration docs Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
7.9 KiB
CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
Commands
Start all services:
docker compose up --build
Interactive CLI (requires gateway running):
python3 cli.py [--url http://localhost:8000] [--session cli-alvis] [--timeout 400]
Run integration tests:
python3 test_pipeline.py [--chat-id CHAT_ID]
# Selective sections:
python3 test_pipeline.py --bench-only # routing + memory benchmarks only (sections 10–13)
python3 test_pipeline.py --easy-only # light-tier routing benchmark
python3 test_pipeline.py --medium-only # medium-tier routing benchmark
python3 test_pipeline.py --hard-only # complex-tier + VRAM flush benchmark
python3 test_pipeline.py --memory-only # memory store/recall/dedup benchmark
python3 test_pipeline.py --no-bench # service health + single name store/recall only
Architecture
Adolf is a multi-channel personal assistant. All LLM inference is routed through Bifrost, an open-source Go-based LLM gateway that adds retry logic, failover, and observability in front of Ollama.
Request flow
Channel adapter → POST /message {text, session_id, channel, user_id}
→ 202 Accepted (immediate)
→ background: run_agent_task()
→ asyncio.gather(
_fetch_urls_from_message() ← Crawl4AI, concurrent
_retrieve_memories() ← openmemory search, concurrent
)
→ router.route() → tier decision (light/medium/complex)
if URL content fetched → upgrade light→medium
→ invoke agent for tier via Bifrost (url_context + memories in system prompt)
deepagents:8000 → bifrost:8080/v1 → ollama:11436
→ channels.deliver(session_id, channel, reply)
→ pending_replies[session_id] queue (SSE)
→ channel-specific callback (Telegram POST, CLI no-op)
→ _store_memory() background task (openmemory)
CLI/wiki polling → GET /reply/{session_id} (SSE, blocks until reply)
Bifrost integration
Bifrost (bifrost-config.json) is configured with the ollama provider pointing to the GPU Ollama instance on host port 11436. It exposes an OpenAI-compatible API at http://bifrost:8080/v1.
agent.py uses langchain_openai.ChatOpenAI with base_url=BIFROST_URL. Model names use the provider/model format that Bifrost expects: ollama/qwen3:4b, ollama/qwen3:8b, ollama/qwen2.5:1.5b. Bifrost strips the ollama/ prefix before forwarding to Ollama.
VRAMManager bypasses Bifrost and talks directly to Ollama via OLLAMA_BASE_URL (host:11436) for flush/poll/prewarm operations — Bifrost cannot manage GPU VRAM.
Three-tier routing (router.py, agent.py)
| Tier | Model (env var) | Trigger |
|---|---|---|
| light | qwen2.5:1.5b (DEEPAGENTS_ROUTER_MODEL) |
Regex pre-match or LLM classifies "light" — answered by router model directly, no agent invoked |
| medium | qwen3:4b (DEEPAGENTS_MODEL) |
Default for tool-requiring queries |
| complex | qwen3:8b (DEEPAGENTS_COMPLEX_MODEL) |
/think prefix only |
The router does regex pre-classification first, then LLM classification. Complex tier is blocked unless the message starts with /think — any LLM classification of "complex" is downgraded to medium.
A global asyncio.Semaphore(1) (_reply_semaphore) serializes all LLM inference — one request at a time.
Thinking mode
qwen3 models produce chain-of-thought <think>...</think> tokens via Ollama's OpenAI-compatible endpoint. Adolf controls this via system prompt prefixes:
- Medium (
qwen2.5:1.5b): no thinking mode in this model; fast ~3s calls - Complex (
qwen3:8b): no prefix — thinking enabled by default, used for deep research - Router (
qwen2.5:1.5b): no thinking support in this model
_strip_think() in agent.py and router.py strips any <think> blocks from model output before returning to users.
VRAM management (vram_manager.py)
Hardware: GTX 1070 (8 GB). Before running the 8b model, medium models are flushed via Ollama keep_alive=0, then /api/ps is polled (15s timeout) to confirm eviction. On timeout, falls back to medium tier. After complex reply, 8b is flushed and medium models are pre-warmed as a background task.
Channel adapters (channels.py)
- Telegram: Grammy Node.js bot (
grammy/bot.mjs) long-polls Telegram →POST /message; replies delivered viaPOST grammy:3001/send - CLI:
cli.pyposts to/message, then blocks onGET /reply/{session_id}SSE
Session IDs: tg-<chat_id> for Telegram, cli-<username> for CLI. Conversation history: 5-turn buffer per session.
Services (docker-compose.yml)
| Service | Port | Role |
|---|---|---|
bifrost |
8080 | LLM gateway — retries, failover, observability; config from bifrost-config.json |
deepagents |
8000 | FastAPI gateway + agent core |
openmemory |
8765 | FastMCP server + mem0 memory tools (Qdrant-backed) |
grammy |
3001 | grammY Telegram bot + /send HTTP endpoint |
crawl4ai |
11235 | JS-rendered page fetching |
External (from openai/ stack, host ports):
- Ollama GPU:
11436— all reply inference (via Bifrost) + VRAM management (direct) - Ollama CPU:
11435— nomic-embed-text embeddings for openmemory - Qdrant:
6333— vector store for memories - SearXNG:
11437— web search
Bifrost config (bifrost-config.json)
The file is mounted into the bifrost container at /app/data/config.json. It declares one Ollama provider key pointing to host.docker.internal:11436 with 2 retries and 300s timeout. To add fallback providers or adjust weights, edit this file and restart the bifrost container.
Crawl4AI integration
Crawl4AI is embedded at all levels of the pipeline:
- Pre-routing (all tiers):
_fetch_urls_from_message()detects URLs in any message via_URL_RE, fetches up to 3 URLs concurrently with_crawl4ai_fetch_async()(async httpx). URL content is injected as a system context block into enriched history before routing, and into the system prompt for medium/complex agents. - Tier upgrade: if URL content is successfully fetched, light tier is upgraded to medium (light model cannot process page content).
- Complex agent tools:
web_search(SearXNG + Crawl4AI auto-fetch of top 2 results) andfetch_url(single-URL Crawl4AI fetch) remain available for the complex agent's agentic loop. Complex tier also receives the pre-fetched content in system prompt to avoid redundant re-fetching.
MCP tools from openmemory (add_memory, search_memory, get_all_memories) are excluded from agent tools — memory management is handled outside the agent loop.
Medium vs Complex agent
| Agent | Builder | Speed | Use case |
|---|---|---|---|
| medium | _DirectModel (single LLM call, no tools) |
~3s | General questions, conversation |
| complex | create_deep_agent (deepagents) |
Slow — multi-step planner | Deep research via /think prefix |
Key files
agent.py— FastAPI app, lifespan wiring,run_agent_task(), Crawl4AI pre-fetch, memory pipeline, all endpointsbifrost-config.json— Bifrost provider config (Ollama GPU, retries, timeouts)channels.py— channel registry anddeliver()dispatcherrouter.py—Routerclass: regex + LLM classification, light-tier reply generationvram_manager.py—VRAMManager: flush/poll/prewarm Ollama VRAM directlyagent_factory.py—build_medium_agent(_DirectModel, single call) /build_complex_agent(create_deep_agent)openmemory/server.py— FastMCP + mem0 config with custom extraction/dedup promptswiki_research.py— batch research pipeline using/message+ SSE pollinggrammy/bot.mjs— Telegram long-poll + HTTP/sendendpoint