- /stream/{session_id} SSE endpoint replaces /reply/ for CLI
- Medium tier streams per-token via astream() with in_think filtering
- CLI now runs as Docker container (Dockerfile.cli, profile:tools)
- Correct medium model to qwen3:4b with real-time think block filtering
- Add use_cases/ test category to commands section
- Update files tree and services table
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
9.1 KiB
CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
Commands
Start all services:
docker compose up --build
Interactive CLI (Docker container, requires gateway running):
docker compose --profile tools run --rm -it cli
# or with options:
docker compose --profile tools run --rm -it cli python3 cli.py --url http://deepagents:8000 --session cli-alvis
Run integration tests (from tests/integration/, require all Docker services running):
python3 test_health.py # service health: deepagents, bifrost, Ollama, Qdrant, SearXNG
python3 test_memory.py # name store/recall + memory benchmark + dedup
python3 test_memory.py --name-only # only name store/recall pipeline
python3 test_memory.py --bench-only # only 5-fact store + 10-question recall
python3 test_memory.py --dedup-only # only deduplication test
python3 test_routing.py # all routing benchmarks (easy + medium + hard)
python3 test_routing.py --easy-only # light-tier routing benchmark
python3 test_routing.py --medium-only # medium-tier routing benchmark
python3 test_routing.py --hard-only # complex-tier + VRAM flush benchmark
Shared config and helpers are in tests/integration/common.py.
Use case tests (tests/use_cases/) — markdown skill files executed by Claude Code, which acts as mock user and quality evaluator. Run by reading the .md file and following its steps with tools (Bash, WebFetch, etc.).
Architecture
Adolf is a multi-channel personal assistant. All LLM inference is routed through Bifrost, an open-source Go-based LLM gateway that adds retry logic, failover, and observability in front of Ollama.
Request flow
Channel adapter → POST /message {text, session_id, channel, user_id}
→ 202 Accepted (immediate)
→ background: run_agent_task()
→ asyncio.gather(
_fetch_urls_from_message() ← Crawl4AI, concurrent
_retrieve_memories() ← openmemory search, concurrent
)
→ router.route() → tier decision (light/medium/complex)
if URL content fetched → upgrade light→medium
→ invoke agent for tier via Bifrost (url_context + memories in system prompt)
deepagents:8000 → bifrost:8080/v1 → ollama:11436
→ _push_stream_chunk() per token (medium streaming) / full reply (light, complex)
→ _stream_queues[session_id] asyncio.Queue
→ _end_stream() sends [DONE] sentinel
→ channels.deliver(session_id, channel, reply)
→ channel-specific callback (Telegram POST)
→ _store_memory() background task (openmemory)
CLI streaming → GET /stream/{session_id} (SSE, per-token for medium, single-chunk for others)
Bifrost integration
Bifrost (bifrost-config.json) is configured with the ollama provider pointing to the GPU Ollama instance on host port 11436. It exposes an OpenAI-compatible API at http://bifrost:8080/v1.
agent.py uses langchain_openai.ChatOpenAI with base_url=BIFROST_URL. Model names use the provider/model format that Bifrost expects: ollama/qwen3:4b, ollama/qwen3:8b, ollama/qwen2.5:1.5b. Bifrost strips the ollama/ prefix before forwarding to Ollama.
VRAMManager bypasses Bifrost and talks directly to Ollama via OLLAMA_BASE_URL (host:11436) for flush/poll/prewarm operations — Bifrost cannot manage GPU VRAM.
Three-tier routing (router.py, agent.py)
| Tier | Model (env var) | Trigger |
|---|---|---|
| light | qwen2.5:1.5b (DEEPAGENTS_ROUTER_MODEL) |
Regex pre-match or LLM classifies "light" — answered by router model directly, no agent invoked |
| medium | qwen3:4b (DEEPAGENTS_MODEL) |
Default for tool-requiring queries |
| complex | qwen3:8b (DEEPAGENTS_COMPLEX_MODEL) |
/think prefix only |
The router does regex pre-classification first, then LLM classification. Complex tier is blocked unless the message starts with /think — any LLM classification of "complex" is downgraded to medium.
A global asyncio.Semaphore(1) (_reply_semaphore) serializes all LLM inference — one request at a time.
Thinking mode and streaming
qwen3 models produce chain-of-thought <think>...</think> tokens. Handling differs by tier:
- Medium (
qwen3:4b): streams viaastream(). A state machine (in_thinkflag) filters<think>blocks in real time — only non-think tokens are pushed to_stream_queuesand displayed to the user. - Complex (
qwen3:8b):create_deep_agentreturns a complete reply;_strip_think()filters think blocks before the reply is pushed as a single chunk. - Router/light (
qwen2.5:1.5b): no thinking support;_strip_think()used defensively.
_strip_think() in agent.py and router.py strips any <think> blocks from non-streaming output.
VRAM management (vram_manager.py)
Hardware: GTX 1070 (8 GB). Before running the 8b model, medium models are flushed via Ollama keep_alive=0, then /api/ps is polled (15s timeout) to confirm eviction. On timeout, falls back to medium tier. After complex reply, 8b is flushed and medium models are pre-warmed as a background task.
Channel adapters (channels.py)
- Telegram: Grammy Node.js bot (
grammy/bot.mjs) long-polls Telegram →POST /message; replies delivered viaPOST grammy:3001/send - CLI:
cli.py(Docker container,profiles: [tools]) posts to/message, then streams fromGET /stream/{session_id}SSE with RichLivedisplay and final Markdown render.
Session IDs: tg-<chat_id> for Telegram, cli-<username> for CLI. Conversation history: 5-turn buffer per session.
Services (docker-compose.yml)
| Service | Port | Role |
|---|---|---|
bifrost |
8080 | LLM gateway — retries, failover, observability; config from bifrost-config.json |
deepagents |
8000 | FastAPI gateway + agent core |
openmemory |
8765 | FastMCP server + mem0 memory tools (Qdrant-backed) |
grammy |
3001 | grammY Telegram bot + /send HTTP endpoint |
crawl4ai |
11235 | JS-rendered page fetching |
cli |
— | Interactive CLI container (profiles: [tools]), Rich streaming display |
External (from openai/ stack, host ports):
- Ollama GPU:
11436— all reply inference (via Bifrost) + VRAM management (direct) - Ollama CPU:
11435— nomic-embed-text embeddings for openmemory - Qdrant:
6333— vector store for memories - SearXNG:
11437— web search
Bifrost config (bifrost-config.json)
The file is mounted into the bifrost container at /app/data/config.json. It declares one Ollama provider key pointing to host.docker.internal:11436 with 2 retries and 300s timeout. To add fallback providers or adjust weights, edit this file and restart the bifrost container.
Crawl4AI integration
Crawl4AI is embedded at all levels of the pipeline:
- Pre-routing (all tiers):
_fetch_urls_from_message()detects URLs in any message via_URL_RE, fetches up to 3 URLs concurrently with_crawl4ai_fetch_async()(async httpx). URL content is injected as a system context block into enriched history before routing, and into the system prompt for medium/complex agents. - Tier upgrade: if URL content is successfully fetched, light tier is upgraded to medium (light model cannot process page content).
- Complex agent tools:
web_search(SearXNG + Crawl4AI auto-fetch of top 2 results) andfetch_url(single-URL Crawl4AI fetch) remain available for the complex agent's agentic loop. Complex tier also receives the pre-fetched content in system prompt to avoid redundant re-fetching.
MCP tools from openmemory (add_memory, search_memory, get_all_memories) are excluded from agent tools — memory management is handled outside the agent loop.
Medium vs Complex agent
| Agent | Builder | Speed | Use case |
|---|---|---|---|
| medium | _DirectModel (single LLM call, no tools) |
~3s | General questions, conversation |
| complex | create_deep_agent (deepagents) |
Slow — multi-step planner | Deep research via /think prefix |
Key files
agent.py— FastAPI app, lifespan wiring,run_agent_task(), Crawl4AI pre-fetch, memory pipeline, all endpointsbifrost-config.json— Bifrost provider config (Ollama GPU, retries, timeouts)channels.py— channel registry anddeliver()dispatcherrouter.py—Routerclass: regex + LLM classification, light-tier reply generationvram_manager.py—VRAMManager: flush/poll/prewarm Ollama VRAM directlyagent_factory.py—build_medium_agent(_DirectModel, single call) /build_complex_agent(create_deep_agent)openmemory/server.py— FastMCP + mem0 config with custom extraction/dedup promptswiki_research.py— batch research pipeline using/message+ SSE pollinggrammy/bot.mjs— Telegram long-poll + HTTP/sendendpoint