Files

Alvis eba805f787 Update docs: fast tools, routecheck service, commute tool

- Request flow: add fast_tool_runner.run_matching() to pre-flight gather
- New Fast Tools section: WeatherTool + CommuteTool table, extension guide
- New routecheck section: captcha UI, internal API, proxy requirements
- Services table: add routecheck:8090
- Files tree: add fast_tools.py, routecheck/, updated .env note

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-03-13 07:10:30 +00:00

11 KiB

Raw Blame History

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Commands

Start all services:

docker compose up --build

Interactive CLI (Docker container, requires gateway running):

docker compose --profile tools run --rm -it cli
# or with options:
docker compose --profile tools run --rm -it cli python3 cli.py --url http://deepagents:8000 --session cli-alvis

Run integration tests (from tests/integration/, require all Docker services running):

python3 test_health.py                          # service health: deepagents, bifrost, Ollama, Qdrant, SearXNG

python3 test_memory.py                          # name store/recall + memory benchmark + dedup
python3 test_memory.py --name-only              # only name store/recall pipeline
python3 test_memory.py --bench-only             # only 5-fact store + 10-question recall
python3 test_memory.py --dedup-only             # only deduplication test

python3 test_routing.py                         # all routing benchmarks (easy + medium + hard)
python3 test_routing.py --easy-only             # light-tier routing benchmark
python3 test_routing.py --medium-only           # medium-tier routing benchmark
python3 test_routing.py --hard-only             # complex-tier + VRAM flush benchmark

Shared config and helpers are in tests/integration/common.py.

Use case tests (tests/use_cases/) — markdown skill files executed by Claude Code, which acts as mock user and quality evaluator. Run by reading the .md file and following its steps with tools (Bash, WebFetch, etc.).

Architecture

Adolf is a multi-channel personal assistant. All LLM inference is routed through Bifrost, an open-source Go-based LLM gateway that adds retry logic, failover, and observability in front of Ollama.

Request flow

Channel adapter → POST /message {text, session_id, channel, user_id}
                → 202 Accepted (immediate)
                → background: run_agent_task()
                    → asyncio.gather(
                        _fetch_urls_from_message()          ← Crawl4AI, concurrent
                        _retrieve_memories()                 ← openmemory search, concurrent
                        _fast_tool_runner.run_matching()     ← FastTools (weather, commute), concurrent
                      )
                    → router.route() → tier decision (light/medium/complex)
                        fast tool match → force medium
                        if URL content fetched → upgrade light→medium
                    → invoke agent for tier via Bifrost (url_context + memories in system prompt)
                        deepagents:8000 → bifrost:8080/v1 → ollama:11436
                    → _push_stream_chunk() per token (medium streaming) / full reply (light, complex)
                        → _stream_queues[session_id] asyncio.Queue
                    → _end_stream() sends [DONE] sentinel
                    → channels.deliver(session_id, channel, reply)
                        → channel-specific callback (Telegram POST)
                    → _store_memory() background task (openmemory)
CLI streaming    → GET /stream/{session_id}  (SSE, per-token for medium, single-chunk for others)

Bifrost integration

Bifrost (bifrost-config.json) is configured with the ollama provider pointing to the GPU Ollama instance on host port 11436. It exposes an OpenAI-compatible API at http://bifrost:8080/v1.

agent.py uses langchain_openai.ChatOpenAI with base_url=BIFROST_URL. Model names use the provider/model format that Bifrost expects: ollama/qwen3:4b, ollama/qwen3:8b, ollama/qwen2.5:1.5b. Bifrost strips the ollama/ prefix before forwarding to Ollama.

VRAMManager bypasses Bifrost and talks directly to Ollama via OLLAMA_BASE_URL (host:11436) for flush/poll/prewarm operations — Bifrost cannot manage GPU VRAM.

Three-tier routing (`router.py`, `agent.py`)

Tier	Model (env var)	Trigger
light	`qwen2.5:1.5b` (`DEEPAGENTS_ROUTER_MODEL`)	Regex pre-match or LLM classifies "light" — answered by router model directly, no agent invoked
medium	`qwen3:4b` (`DEEPAGENTS_MODEL`)	Default for tool-requiring queries
complex	`qwen3:8b` (`DEEPAGENTS_COMPLEX_MODEL`)	`/think` prefix only

The router does regex pre-classification first, then LLM classification. Complex tier is blocked unless the message starts with /think — any LLM classification of "complex" is downgraded to medium.

A global asyncio.Semaphore(1) (_reply_semaphore) serializes all LLM inference — one request at a time.

Thinking mode and streaming

qwen3 models produce chain-of-thought <think>...</think> tokens. Handling differs by tier:

Medium (qwen3:4b): streams via astream(). A state machine (in_think flag) filters <think> blocks in real time — only non-think tokens are pushed to _stream_queues and displayed to the user.
Complex (qwen3:8b): create_deep_agent returns a complete reply; _strip_think() filters think blocks before the reply is pushed as a single chunk.
Router/light (qwen2.5:1.5b): no thinking support; _strip_think() used defensively.

_strip_think() in agent.py and router.py strips any <think> blocks from non-streaming output.

VRAM management (`vram_manager.py`)

Hardware: GTX 1070 (8 GB). Before running the 8b model, medium models are flushed via Ollama keep_alive=0, then /api/ps is polled (15s timeout) to confirm eviction. On timeout, falls back to medium tier. After complex reply, 8b is flushed and medium models are pre-warmed as a background task.

Channel adapters (`channels.py`)

Telegram: Grammy Node.js bot (grammy/bot.mjs) long-polls Telegram → POST /message; replies delivered via POST grammy:3001/send
CLI: cli.py (Docker container, profiles: [tools]) posts to /message, then streams from GET /stream/{session_id} SSE with Rich Live display and final Markdown render.

Session IDs: tg-<chat_id> for Telegram, cli-<username> for CLI. Conversation history: 5-turn buffer per session.

Services (`docker-compose.yml`)

Service	Port	Role
`bifrost`	8080	LLM gateway — retries, failover, observability; config from `bifrost-config.json`
`deepagents`	8000	FastAPI gateway + agent core
`openmemory`	8765	FastMCP server + mem0 memory tools (Qdrant-backed)
`grammy`	3001	grammY Telegram bot + `/send` HTTP endpoint
`crawl4ai`	11235	JS-rendered page fetching
`routecheck`	8090	Local routing web service — image captcha UI + Yandex Routing API backend
`cli`	—	Interactive CLI container (`profiles: [tools]`), Rich streaming display

External (from openai/ stack, host ports):

Ollama GPU: 11436 — all reply inference (via Bifrost) + VRAM management (direct)
Ollama CPU: 11435 — nomic-embed-text embeddings for openmemory
Qdrant: 6333 — vector store for memories
SearXNG: 11437 — web search

Bifrost config (`bifrost-config.json`)

The file is mounted into the bifrost container at /app/data/config.json. It declares one Ollama provider key pointing to host.docker.internal:11436 with 2 retries and 300s timeout. To add fallback providers or adjust weights, edit this file and restart the bifrost container.

Crawl4AI integration

Crawl4AI is embedded at all levels of the pipeline:

Pre-routing (all tiers): _fetch_urls_from_message() detects URLs in any message via _URL_RE, fetches up to 3 URLs concurrently with _crawl4ai_fetch_async() (async httpx). URL content is injected as a system context block into enriched history before routing, and into the system prompt for medium/complex agents.
Tier upgrade: if URL content is successfully fetched, light tier is upgraded to medium (light model cannot process page content).
Complex agent tools: web_search (SearXNG + Crawl4AI auto-fetch of top 2 results) and fetch_url (single-URL Crawl4AI fetch) remain available for the complex agent's agentic loop. Complex tier also receives the pre-fetched content in system prompt to avoid redundant re-fetching.

MCP tools from openmemory (add_memory, search_memory, get_all_memories) are excluded from agent tools — memory management is handled outside the agent loop.

Fast Tools (`fast_tools.py`)

Pre-flight tools that run before the LLM in the asyncio.gather alongside URL fetch and memory retrieval. Each tool has a regex matches() classifier and an async run() that returns a context string injected into the system prompt. The router uses FastToolRunner.any_matches() to force medium tier when a tool matches.

Tool	Trigger	Data source
`WeatherTool`	weather/forecast/temperature keywords	SearXNG query `"погода Балашиха сейчас"` — Russian sources return °C
`CommuteTool`	commute/traffic/arrival time keywords	`routecheck:8090/api/route` — Yandex Routing API, Balashikha→Moscow center

To add a new fast tool: subclass FastTool in fast_tools.py, add an instance to _fast_tool_runner in agent.py.

`routecheck` service (`routecheck/app.py`)

Local web service that exposes Yandex Routing API behind an image captcha. Two access paths:

Web UI (localhost:8090): solve PIL-generated arithmetic captcha → query any two lat/lon points
Internal API: GET /api/route?from=lat,lon&to=lat,lon&token=ROUTECHECK_TOKEN — bypasses captcha, used by CommuteTool

Requires .env: YANDEX_ROUTING_KEY (free tier from developer.tech.yandex.ru) and ROUTECHECK_TOKEN. The container routes Yandex API calls through the host HTTPS proxy (host.docker.internal:56928).

Medium vs Complex agent

Agent	Builder	Speed	Use case
medium	`_DirectModel` (single LLM call, no tools)	~3s	General questions, conversation
complex	`create_deep_agent` (deepagents)	Slow — multi-step planner	Deep research via `/think` prefix

Key files

agent.py — FastAPI app, lifespan wiring, run_agent_task(), Crawl4AI pre-fetch, fast tools, memory pipeline, all endpoints
fast_tools.py — FastTool base class, FastToolRunner, WeatherTool, CommuteTool
routecheck/app.py — captcha UI + /api/route Yandex proxy
bifrost-config.json — Bifrost provider config (Ollama GPU, retries, timeouts)
channels.py — channel registry and deliver() dispatcher
router.py — Router class: regex + LLM classification, light-tier reply generation
vram_manager.py — VRAMManager: flush/poll/prewarm Ollama VRAM directly
agent_factory.py — build_medium_agent (_DirectModel, single call) / build_complex_agent (create_deep_agent)
openmemory/server.py — FastMCP + mem0 config with custom extraction/dedup prompts
wiki_research.py — batch research pipeline using /message + SSE polling
grammy/bot.mjs — Telegram long-poll + HTTP /send endpoint

11 KiB Raw Blame History