Files

Alvis 8cd41940f0 Update docs: streaming, CLI container, use_cases tests

- /stream/{session_id} SSE endpoint replaces /reply/ for CLI
- Medium tier streams per-token via astream() with in_think filtering
- CLI now runs as Docker container (Dockerfile.cli, profile:tools)
- Correct medium model to qwen3:4b with real-time think block filtering
- Add use_cases/ test category to commands section
- Update files tree and services table

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-03-12 17:31:36 +00:00

9.1 KiB

Raw Blame History

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Commands

Start all services:

docker compose up --build

Interactive CLI (Docker container, requires gateway running):

docker compose --profile tools run --rm -it cli
# or with options:
docker compose --profile tools run --rm -it cli python3 cli.py --url http://deepagents:8000 --session cli-alvis

Run integration tests (from tests/integration/, require all Docker services running):

python3 test_health.py                          # service health: deepagents, bifrost, Ollama, Qdrant, SearXNG

python3 test_memory.py                          # name store/recall + memory benchmark + dedup
python3 test_memory.py --name-only              # only name store/recall pipeline
python3 test_memory.py --bench-only             # only 5-fact store + 10-question recall
python3 test_memory.py --dedup-only             # only deduplication test

python3 test_routing.py                         # all routing benchmarks (easy + medium + hard)
python3 test_routing.py --easy-only             # light-tier routing benchmark
python3 test_routing.py --medium-only           # medium-tier routing benchmark
python3 test_routing.py --hard-only             # complex-tier + VRAM flush benchmark

Shared config and helpers are in tests/integration/common.py.

Use case tests (tests/use_cases/) — markdown skill files executed by Claude Code, which acts as mock user and quality evaluator. Run by reading the .md file and following its steps with tools (Bash, WebFetch, etc.).

Architecture

Adolf is a multi-channel personal assistant. All LLM inference is routed through Bifrost, an open-source Go-based LLM gateway that adds retry logic, failover, and observability in front of Ollama.

Request flow

Channel adapter → POST /message {text, session_id, channel, user_id}
                → 202 Accepted (immediate)
                → background: run_agent_task()
                    → asyncio.gather(
                        _fetch_urls_from_message()  ← Crawl4AI, concurrent
                        _retrieve_memories()         ← openmemory search, concurrent
                      )
                    → router.route() → tier decision (light/medium/complex)
                        if URL content fetched → upgrade light→medium
                    → invoke agent for tier via Bifrost (url_context + memories in system prompt)
                        deepagents:8000 → bifrost:8080/v1 → ollama:11436
                    → _push_stream_chunk() per token (medium streaming) / full reply (light, complex)
                        → _stream_queues[session_id] asyncio.Queue
                    → _end_stream() sends [DONE] sentinel
                    → channels.deliver(session_id, channel, reply)
                        → channel-specific callback (Telegram POST)
                    → _store_memory() background task (openmemory)
CLI streaming    → GET /stream/{session_id}  (SSE, per-token for medium, single-chunk for others)

Bifrost integration

Bifrost (bifrost-config.json) is configured with the ollama provider pointing to the GPU Ollama instance on host port 11436. It exposes an OpenAI-compatible API at http://bifrost:8080/v1.

agent.py uses langchain_openai.ChatOpenAI with base_url=BIFROST_URL. Model names use the provider/model format that Bifrost expects: ollama/qwen3:4b, ollama/qwen3:8b, ollama/qwen2.5:1.5b. Bifrost strips the ollama/ prefix before forwarding to Ollama.

VRAMManager bypasses Bifrost and talks directly to Ollama via OLLAMA_BASE_URL (host:11436) for flush/poll/prewarm operations — Bifrost cannot manage GPU VRAM.

Three-tier routing (`router.py`, `agent.py`)

Tier	Model (env var)	Trigger
light	`qwen2.5:1.5b` (`DEEPAGENTS_ROUTER_MODEL`)	Regex pre-match or LLM classifies "light" — answered by router model directly, no agent invoked
medium	`qwen3:4b` (`DEEPAGENTS_MODEL`)	Default for tool-requiring queries
complex	`qwen3:8b` (`DEEPAGENTS_COMPLEX_MODEL`)	`/think` prefix only

The router does regex pre-classification first, then LLM classification. Complex tier is blocked unless the message starts with /think — any LLM classification of "complex" is downgraded to medium.

A global asyncio.Semaphore(1) (_reply_semaphore) serializes all LLM inference — one request at a time.

Thinking mode and streaming

qwen3 models produce chain-of-thought <think>...</think> tokens. Handling differs by tier:

Medium (qwen3:4b): streams via astream(). A state machine (in_think flag) filters <think> blocks in real time — only non-think tokens are pushed to _stream_queues and displayed to the user.
Complex (qwen3:8b): create_deep_agent returns a complete reply; _strip_think() filters think blocks before the reply is pushed as a single chunk.
Router/light (qwen2.5:1.5b): no thinking support; _strip_think() used defensively.

_strip_think() in agent.py and router.py strips any <think> blocks from non-streaming output.

VRAM management (`vram_manager.py`)

Hardware: GTX 1070 (8 GB). Before running the 8b model, medium models are flushed via Ollama keep_alive=0, then /api/ps is polled (15s timeout) to confirm eviction. On timeout, falls back to medium tier. After complex reply, 8b is flushed and medium models are pre-warmed as a background task.

Channel adapters (`channels.py`)

Telegram: Grammy Node.js bot (grammy/bot.mjs) long-polls Telegram → POST /message; replies delivered via POST grammy:3001/send
CLI: cli.py (Docker container, profiles: [tools]) posts to /message, then streams from GET /stream/{session_id} SSE with Rich Live display and final Markdown render.

Session IDs: tg-<chat_id> for Telegram, cli-<username> for CLI. Conversation history: 5-turn buffer per session.

Services (`docker-compose.yml`)

Service	Port	Role
`bifrost`	8080	LLM gateway — retries, failover, observability; config from `bifrost-config.json`
`deepagents`	8000	FastAPI gateway + agent core
`openmemory`	8765	FastMCP server + mem0 memory tools (Qdrant-backed)
`grammy`	3001	grammY Telegram bot + `/send` HTTP endpoint
`crawl4ai`	11235	JS-rendered page fetching
`cli`	—	Interactive CLI container (`profiles: [tools]`), Rich streaming display

External (from openai/ stack, host ports):

Ollama GPU: 11436 — all reply inference (via Bifrost) + VRAM management (direct)
Ollama CPU: 11435 — nomic-embed-text embeddings for openmemory
Qdrant: 6333 — vector store for memories
SearXNG: 11437 — web search

Bifrost config (`bifrost-config.json`)

The file is mounted into the bifrost container at /app/data/config.json. It declares one Ollama provider key pointing to host.docker.internal:11436 with 2 retries and 300s timeout. To add fallback providers or adjust weights, edit this file and restart the bifrost container.

Crawl4AI integration

Crawl4AI is embedded at all levels of the pipeline:

Pre-routing (all tiers): _fetch_urls_from_message() detects URLs in any message via _URL_RE, fetches up to 3 URLs concurrently with _crawl4ai_fetch_async() (async httpx). URL content is injected as a system context block into enriched history before routing, and into the system prompt for medium/complex agents.
Tier upgrade: if URL content is successfully fetched, light tier is upgraded to medium (light model cannot process page content).
Complex agent tools: web_search (SearXNG + Crawl4AI auto-fetch of top 2 results) and fetch_url (single-URL Crawl4AI fetch) remain available for the complex agent's agentic loop. Complex tier also receives the pre-fetched content in system prompt to avoid redundant re-fetching.

MCP tools from openmemory (add_memory, search_memory, get_all_memories) are excluded from agent tools — memory management is handled outside the agent loop.

Medium vs Complex agent

Agent	Builder	Speed	Use case
medium	`_DirectModel` (single LLM call, no tools)	~3s	General questions, conversation
complex	`create_deep_agent` (deepagents)	Slow — multi-step planner	Deep research via `/think` prefix

Key files

agent.py — FastAPI app, lifespan wiring, run_agent_task(), Crawl4AI pre-fetch, memory pipeline, all endpoints
bifrost-config.json — Bifrost provider config (Ollama GPU, retries, timeouts)
channels.py — channel registry and deliver() dispatcher
router.py — Router class: regex + LLM classification, light-tier reply generation
vram_manager.py — VRAMManager: flush/poll/prewarm Ollama VRAM directly
agent_factory.py — build_medium_agent (_DirectModel, single call) / build_complex_agent (create_deep_agent)
openmemory/server.py — FastMCP + mem0 config with custom extraction/dedup prompts
wiki_research.py — batch research pipeline using /message + SSE polling
grammy/bot.mjs — Telegram long-poll + HTTP /send endpoint

9.1 KiB Raw Blame History