Files

Alvis 50097d6092 Embed Crawl4AI at all tiers, restore qwen3:4b medium, update docs

- Pre-routing URL fetch: any message with URLs gets content fetched
  async (httpx.AsyncClient) before routing via _fetch_urls_from_message()
- URL context and memories gathered concurrently with asyncio.gather
- Light tier upgraded to medium when URL content is present
- url_context injected into system prompt for medium and complex agents
- Complex agent retains web_search/fetch_url tools + receives pre-fetched content
- Medium model restored to qwen3:4b (was temporarily qwen2.5:1.5b)
- Unit tests added for _extract_urls
- ARCHITECTURE.md: added Tool Handling, Crawl4AI Integration, Memory Pipeline sections
- CLAUDE.md: updated request flow and Crawl4AI integration docs

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-03-12 15:49:34 +00:00

7.9 KiB

Raw Blame History

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Commands

Start all services:

docker compose up --build

Interactive CLI (requires gateway running):

python3 cli.py [--url http://localhost:8000] [--session cli-alvis] [--timeout 400]

Run integration tests:

python3 test_pipeline.py [--chat-id CHAT_ID]

# Selective sections:
python3 test_pipeline.py --bench-only      # routing + memory benchmarks only (sections 10–13)
python3 test_pipeline.py --easy-only       # light-tier routing benchmark
python3 test_pipeline.py --medium-only     # medium-tier routing benchmark
python3 test_pipeline.py --hard-only       # complex-tier + VRAM flush benchmark
python3 test_pipeline.py --memory-only     # memory store/recall/dedup benchmark
python3 test_pipeline.py --no-bench        # service health + single name store/recall only

Architecture

Adolf is a multi-channel personal assistant. All LLM inference is routed through Bifrost, an open-source Go-based LLM gateway that adds retry logic, failover, and observability in front of Ollama.

Request flow

Channel adapter → POST /message {text, session_id, channel, user_id}
                → 202 Accepted (immediate)
                → background: run_agent_task()
                    → asyncio.gather(
                        _fetch_urls_from_message()  ← Crawl4AI, concurrent
                        _retrieve_memories()         ← openmemory search, concurrent
                      )
                    → router.route() → tier decision (light/medium/complex)
                        if URL content fetched → upgrade light→medium
                    → invoke agent for tier via Bifrost (url_context + memories in system prompt)
                        deepagents:8000 → bifrost:8080/v1 → ollama:11436
                    → channels.deliver(session_id, channel, reply)
                        → pending_replies[session_id] queue (SSE)
                        → channel-specific callback (Telegram POST, CLI no-op)
                    → _store_memory() background task (openmemory)
CLI/wiki polling → GET /reply/{session_id}  (SSE, blocks until reply)

Bifrost integration

Bifrost (bifrost-config.json) is configured with the ollama provider pointing to the GPU Ollama instance on host port 11436. It exposes an OpenAI-compatible API at http://bifrost:8080/v1.

agent.py uses langchain_openai.ChatOpenAI with base_url=BIFROST_URL. Model names use the provider/model format that Bifrost expects: ollama/qwen3:4b, ollama/qwen3:8b, ollama/qwen2.5:1.5b. Bifrost strips the ollama/ prefix before forwarding to Ollama.

VRAMManager bypasses Bifrost and talks directly to Ollama via OLLAMA_BASE_URL (host:11436) for flush/poll/prewarm operations — Bifrost cannot manage GPU VRAM.

Three-tier routing (`router.py`, `agent.py`)

Tier	Model (env var)	Trigger
light	`qwen2.5:1.5b` (`DEEPAGENTS_ROUTER_MODEL`)	Regex pre-match or LLM classifies "light" — answered by router model directly, no agent invoked
medium	`qwen3:4b` (`DEEPAGENTS_MODEL`)	Default for tool-requiring queries
complex	`qwen3:8b` (`DEEPAGENTS_COMPLEX_MODEL`)	`/think` prefix only

The router does regex pre-classification first, then LLM classification. Complex tier is blocked unless the message starts with /think — any LLM classification of "complex" is downgraded to medium.

A global asyncio.Semaphore(1) (_reply_semaphore) serializes all LLM inference — one request at a time.

Thinking mode

qwen3 models produce chain-of-thought <think>...</think> tokens via Ollama's OpenAI-compatible endpoint. Adolf controls this via system prompt prefixes:

Medium (qwen2.5:1.5b): no thinking mode in this model; fast ~3s calls
Complex (qwen3:8b): no prefix — thinking enabled by default, used for deep research
Router (qwen2.5:1.5b): no thinking support in this model

_strip_think() in agent.py and router.py strips any <think> blocks from model output before returning to users.

VRAM management (`vram_manager.py`)

Hardware: GTX 1070 (8 GB). Before running the 8b model, medium models are flushed via Ollama keep_alive=0, then /api/ps is polled (15s timeout) to confirm eviction. On timeout, falls back to medium tier. After complex reply, 8b is flushed and medium models are pre-warmed as a background task.

Channel adapters (`channels.py`)

Telegram: Grammy Node.js bot (grammy/bot.mjs) long-polls Telegram → POST /message; replies delivered via POST grammy:3001/send
CLI: cli.py posts to /message, then blocks on GET /reply/{session_id} SSE

Session IDs: tg-<chat_id> for Telegram, cli-<username> for CLI. Conversation history: 5-turn buffer per session.

Services (`docker-compose.yml`)

Service	Port	Role
`bifrost`	8080	LLM gateway — retries, failover, observability; config from `bifrost-config.json`
`deepagents`	8000	FastAPI gateway + agent core
`openmemory`	8765	FastMCP server + mem0 memory tools (Qdrant-backed)
`grammy`	3001	grammY Telegram bot + `/send` HTTP endpoint
`crawl4ai`	11235	JS-rendered page fetching

External (from openai/ stack, host ports):

Ollama GPU: 11436 — all reply inference (via Bifrost) + VRAM management (direct)
Ollama CPU: 11435 — nomic-embed-text embeddings for openmemory
Qdrant: 6333 — vector store for memories
SearXNG: 11437 — web search

Bifrost config (`bifrost-config.json`)

The file is mounted into the bifrost container at /app/data/config.json. It declares one Ollama provider key pointing to host.docker.internal:11436 with 2 retries and 300s timeout. To add fallback providers or adjust weights, edit this file and restart the bifrost container.

Crawl4AI integration

Crawl4AI is embedded at all levels of the pipeline:

Pre-routing (all tiers): _fetch_urls_from_message() detects URLs in any message via _URL_RE, fetches up to 3 URLs concurrently with _crawl4ai_fetch_async() (async httpx). URL content is injected as a system context block into enriched history before routing, and into the system prompt for medium/complex agents.
Tier upgrade: if URL content is successfully fetched, light tier is upgraded to medium (light model cannot process page content).
Complex agent tools: web_search (SearXNG + Crawl4AI auto-fetch of top 2 results) and fetch_url (single-URL Crawl4AI fetch) remain available for the complex agent's agentic loop. Complex tier also receives the pre-fetched content in system prompt to avoid redundant re-fetching.

MCP tools from openmemory (add_memory, search_memory, get_all_memories) are excluded from agent tools — memory management is handled outside the agent loop.

Medium vs Complex agent

Agent	Builder	Speed	Use case
medium	`_DirectModel` (single LLM call, no tools)	~3s	General questions, conversation
complex	`create_deep_agent` (deepagents)	Slow — multi-step planner	Deep research via `/think` prefix

Key files

agent.py — FastAPI app, lifespan wiring, run_agent_task(), Crawl4AI pre-fetch, memory pipeline, all endpoints
bifrost-config.json — Bifrost provider config (Ollama GPU, retries, timeouts)
channels.py — channel registry and deliver() dispatcher
router.py — Router class: regex + LLM classification, light-tier reply generation
vram_manager.py — VRAMManager: flush/poll/prewarm Ollama VRAM directly
agent_factory.py — build_medium_agent (_DirectModel, single call) / build_complex_agent (create_deep_agent)
openmemory/server.py — FastMCP + mem0 config with custom extraction/dedup prompts
wiki_research.py — batch research pipeline using /message + SSE polling
grammy/bot.mjs — Telegram long-poll + HTTP /send endpoint

7.9 KiB Raw Blame History Unescape Escape