Files
adolf/CLAUDE.md
Alvis 50097d6092 Embed Crawl4AI at all tiers, restore qwen3:4b medium, update docs
- Pre-routing URL fetch: any message with URLs gets content fetched
  async (httpx.AsyncClient) before routing via _fetch_urls_from_message()
- URL context and memories gathered concurrently with asyncio.gather
- Light tier upgraded to medium when URL content is present
- url_context injected into system prompt for medium and complex agents
- Complex agent retains web_search/fetch_url tools + receives pre-fetched content
- Medium model restored to qwen3:4b (was temporarily qwen2.5:1.5b)
- Unit tests added for _extract_urls
- ARCHITECTURE.md: added Tool Handling, Crawl4AI Integration, Memory Pipeline sections
- CLAUDE.md: updated request flow and Crawl4AI integration docs

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-12 15:49:34 +00:00

144 lines
7.9 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
## Commands
**Start all services:**
```bash
docker compose up --build
```
**Interactive CLI (requires gateway running):**
```bash
python3 cli.py [--url http://localhost:8000] [--session cli-alvis] [--timeout 400]
```
**Run integration tests:**
```bash
python3 test_pipeline.py [--chat-id CHAT_ID]
# Selective sections:
python3 test_pipeline.py --bench-only # routing + memory benchmarks only (sections 1013)
python3 test_pipeline.py --easy-only # light-tier routing benchmark
python3 test_pipeline.py --medium-only # medium-tier routing benchmark
python3 test_pipeline.py --hard-only # complex-tier + VRAM flush benchmark
python3 test_pipeline.py --memory-only # memory store/recall/dedup benchmark
python3 test_pipeline.py --no-bench # service health + single name store/recall only
```
## Architecture
Adolf is a multi-channel personal assistant. All LLM inference is routed through **Bifrost**, an open-source Go-based LLM gateway that adds retry logic, failover, and observability in front of Ollama.
### Request flow
```
Channel adapter → POST /message {text, session_id, channel, user_id}
→ 202 Accepted (immediate)
→ background: run_agent_task()
→ asyncio.gather(
_fetch_urls_from_message() ← Crawl4AI, concurrent
_retrieve_memories() ← openmemory search, concurrent
)
→ router.route() → tier decision (light/medium/complex)
if URL content fetched → upgrade light→medium
→ invoke agent for tier via Bifrost (url_context + memories in system prompt)
deepagents:8000 → bifrost:8080/v1 → ollama:11436
→ channels.deliver(session_id, channel, reply)
→ pending_replies[session_id] queue (SSE)
→ channel-specific callback (Telegram POST, CLI no-op)
→ _store_memory() background task (openmemory)
CLI/wiki polling → GET /reply/{session_id} (SSE, blocks until reply)
```
### Bifrost integration
Bifrost (`bifrost-config.json`) is configured with the `ollama` provider pointing to the GPU Ollama instance on host port 11436. It exposes an OpenAI-compatible API at `http://bifrost:8080/v1`.
`agent.py` uses `langchain_openai.ChatOpenAI` with `base_url=BIFROST_URL`. Model names use the `provider/model` format that Bifrost expects: `ollama/qwen3:4b`, `ollama/qwen3:8b`, `ollama/qwen2.5:1.5b`. Bifrost strips the `ollama/` prefix before forwarding to Ollama.
`VRAMManager` bypasses Bifrost and talks directly to Ollama via `OLLAMA_BASE_URL` (host:11436) for flush/poll/prewarm operations — Bifrost cannot manage GPU VRAM.
### Three-tier routing (`router.py`, `agent.py`)
| Tier | Model (env var) | Trigger |
|------|-----------------|---------|
| light | `qwen2.5:1.5b` (`DEEPAGENTS_ROUTER_MODEL`) | Regex pre-match or LLM classifies "light" — answered by router model directly, no agent invoked |
| medium | `qwen3:4b` (`DEEPAGENTS_MODEL`) | Default for tool-requiring queries |
| complex | `qwen3:8b` (`DEEPAGENTS_COMPLEX_MODEL`) | `/think ` prefix only |
The router does regex pre-classification first, then LLM classification. Complex tier is blocked unless the message starts with `/think ` — any LLM classification of "complex" is downgraded to medium.
A global `asyncio.Semaphore(1)` (`_reply_semaphore`) serializes all LLM inference — one request at a time.
### Thinking mode
qwen3 models produce chain-of-thought `<think>...</think>` tokens via Ollama's OpenAI-compatible endpoint. Adolf controls this via system prompt prefixes:
- **Medium** (`qwen2.5:1.5b`): no thinking mode in this model; fast ~3s calls
- **Complex** (`qwen3:8b`): no prefix — thinking enabled by default, used for deep research
- **Router** (`qwen2.5:1.5b`): no thinking support in this model
`_strip_think()` in `agent.py` and `router.py` strips any `<think>` blocks from model output before returning to users.
### VRAM management (`vram_manager.py`)
Hardware: GTX 1070 (8 GB). Before running the 8b model, medium models are flushed via Ollama `keep_alive=0`, then `/api/ps` is polled (15s timeout) to confirm eviction. On timeout, falls back to medium tier. After complex reply, 8b is flushed and medium models are pre-warmed as a background task.
### Channel adapters (`channels.py`)
- **Telegram**: Grammy Node.js bot (`grammy/bot.mjs`) long-polls Telegram → `POST /message`; replies delivered via `POST grammy:3001/send`
- **CLI**: `cli.py` posts to `/message`, then blocks on `GET /reply/{session_id}` SSE
Session IDs: `tg-<chat_id>` for Telegram, `cli-<username>` for CLI. Conversation history: 5-turn buffer per session.
### Services (`docker-compose.yml`)
| Service | Port | Role |
|---------|------|------|
| `bifrost` | 8080 | LLM gateway — retries, failover, observability; config from `bifrost-config.json` |
| `deepagents` | 8000 | FastAPI gateway + agent core |
| `openmemory` | 8765 | FastMCP server + mem0 memory tools (Qdrant-backed) |
| `grammy` | 3001 | grammY Telegram bot + `/send` HTTP endpoint |
| `crawl4ai` | 11235 | JS-rendered page fetching |
External (from `openai/` stack, host ports):
- Ollama GPU: `11436` — all reply inference (via Bifrost) + VRAM management (direct)
- Ollama CPU: `11435` — nomic-embed-text embeddings for openmemory
- Qdrant: `6333` — vector store for memories
- SearXNG: `11437` — web search
### Bifrost config (`bifrost-config.json`)
The file is mounted into the bifrost container at `/app/data/config.json`. It declares one Ollama provider key pointing to `host.docker.internal:11436` with 2 retries and 300s timeout. To add fallback providers or adjust weights, edit this file and restart the bifrost container.
### Crawl4AI integration
Crawl4AI is embedded at all levels of the pipeline:
- **Pre-routing (all tiers)**: `_fetch_urls_from_message()` detects URLs in any message via `_URL_RE`, fetches up to 3 URLs concurrently with `_crawl4ai_fetch_async()` (async httpx). URL content is injected as a system context block into enriched history before routing, and into the system prompt for medium/complex agents.
- **Tier upgrade**: if URL content is successfully fetched, light tier is upgraded to medium (light model cannot process page content).
- **Complex agent tools**: `web_search` (SearXNG + Crawl4AI auto-fetch of top 2 results) and `fetch_url` (single-URL Crawl4AI fetch) remain available for the complex agent's agentic loop. Complex tier also receives the pre-fetched content in system prompt to avoid redundant re-fetching.
MCP tools from openmemory (`add_memory`, `search_memory`, `get_all_memories`) are **excluded** from agent tools — memory management is handled outside the agent loop.
### Medium vs Complex agent
| Agent | Builder | Speed | Use case |
|-------|---------|-------|----------|
| medium | `_DirectModel` (single LLM call, no tools) | ~3s | General questions, conversation |
| complex | `create_deep_agent` (deepagents) | Slow — multi-step planner | Deep research via `/think` prefix |
### Key files
- `agent.py` — FastAPI app, lifespan wiring, `run_agent_task()`, Crawl4AI pre-fetch, memory pipeline, all endpoints
- `bifrost-config.json` — Bifrost provider config (Ollama GPU, retries, timeouts)
- `channels.py` — channel registry and `deliver()` dispatcher
- `router.py``Router` class: regex + LLM classification, light-tier reply generation
- `vram_manager.py``VRAMManager`: flush/poll/prewarm Ollama VRAM directly
- `agent_factory.py``build_medium_agent` (`_DirectModel`, single call) / `build_complex_agent` (`create_deep_agent`)
- `openmemory/server.py` — FastMCP + mem0 config with custom extraction/dedup prompts
- `wiki_research.py` — batch research pipeline using `/message` + SSE polling
- `grammy/bot.mjs` — Telegram long-poll + HTTP `/send` endpoint