Files
adolf/CLAUDE.md
Alvis f9618a9bbf Integrate Bifrost LLM gateway, add test suite, implement memory pipeline
- Add Bifrost (maximhq/bifrost) as LLM gateway: all inference routes through
  bifrost:8080/v1 with retry logic and observability; VRAMManager keeps direct
  Ollama access for VRAM flush/prewarm operations
- Switch medium model from qwen3:4b to qwen2.5:1.5b (direct call, no tools)
  via _DirectModel wrapper; complex keeps create_deep_agent with qwen3:8b
- Implement out-of-agent memory pipeline: _retrieve_memories pre-fetches
  relevant context (injected into all tiers), _store_memory runs as background
  task after each reply writing to openmemory/Qdrant
- Add tests/unit/ with 133 tests covering router, channels, vram_manager,
  agent helpers; move integration test to tests/integration/
- Add bifrost-config.json with GPU Ollama (qwen2.5:0.5b/1.5b, qwen3:4b/8b,
  gemma3:4b) and CPU Ollama providers
- Integration test 28/29 pass (only grammy fails — no TELEGRAM_BOT_TOKEN)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-12 13:50:12 +00:00

134 lines
6.8 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
## Commands
**Start all services:**
```bash
docker compose up --build
```
**Interactive CLI (requires gateway running):**
```bash
python3 cli.py [--url http://localhost:8000] [--session cli-alvis] [--timeout 400]
```
**Run integration tests:**
```bash
python3 test_pipeline.py [--chat-id CHAT_ID]
# Selective sections:
python3 test_pipeline.py --bench-only # routing + memory benchmarks only (sections 1013)
python3 test_pipeline.py --easy-only # light-tier routing benchmark
python3 test_pipeline.py --medium-only # medium-tier routing benchmark
python3 test_pipeline.py --hard-only # complex-tier + VRAM flush benchmark
python3 test_pipeline.py --memory-only # memory store/recall/dedup benchmark
python3 test_pipeline.py --no-bench # service health + single name store/recall only
```
## Architecture
Adolf is a multi-channel personal assistant. All LLM inference is routed through **Bifrost**, an open-source Go-based LLM gateway that adds retry logic, failover, and observability in front of Ollama.
### Request flow
```
Channel adapter → POST /message {text, session_id, channel, user_id}
→ 202 Accepted (immediate)
→ background: run_agent_task()
→ router.route() → tier decision (light/medium/complex)
→ invoke agent for tier via Bifrost
deepagents:8000 → bifrost:8080/v1 → ollama:11436
→ channels.deliver(session_id, channel, reply)
→ pending_replies[session_id] queue (SSE)
→ channel-specific callback (Telegram POST, CLI no-op)
CLI/wiki polling → GET /reply/{session_id} (SSE, blocks until reply)
```
### Bifrost integration
Bifrost (`bifrost-config.json`) is configured with the `ollama` provider pointing to the GPU Ollama instance on host port 11436. It exposes an OpenAI-compatible API at `http://bifrost:8080/v1`.
`agent.py` uses `langchain_openai.ChatOpenAI` with `base_url=BIFROST_URL`. Model names use the `provider/model` format that Bifrost expects: `ollama/qwen3:4b`, `ollama/qwen3:8b`, `ollama/qwen2.5:1.5b`. Bifrost strips the `ollama/` prefix before forwarding to Ollama.
`VRAMManager` bypasses Bifrost and talks directly to Ollama via `OLLAMA_BASE_URL` (host:11436) for flush/poll/prewarm operations — Bifrost cannot manage GPU VRAM.
### Three-tier routing (`router.py`, `agent.py`)
| Tier | Model (env var) | Trigger |
|------|-----------------|---------|
| light | `qwen2.5:1.5b` (`DEEPAGENTS_ROUTER_MODEL`) | Regex pre-match or LLM classifies "light" — answered by router model directly, no agent invoked |
| medium | `qwen2.5:1.5b` (`DEEPAGENTS_MODEL`) | Default for tool-requiring queries |
| complex | `qwen3:8b` (`DEEPAGENTS_COMPLEX_MODEL`) | `/think ` prefix only |
The router does regex pre-classification first, then LLM classification. Complex tier is blocked unless the message starts with `/think ` — any LLM classification of "complex" is downgraded to medium.
A global `asyncio.Semaphore(1)` (`_reply_semaphore`) serializes all LLM inference — one request at a time.
### Thinking mode
qwen3 models produce chain-of-thought `<think>...</think>` tokens via Ollama's OpenAI-compatible endpoint. Adolf controls this via system prompt prefixes:
- **Medium** (`qwen2.5:1.5b`): no thinking mode in this model; fast ~3s calls
- **Complex** (`qwen3:8b`): no prefix — thinking enabled by default, used for deep research
- **Router** (`qwen2.5:1.5b`): no thinking support in this model
`_strip_think()` in `agent.py` and `router.py` strips any `<think>` blocks from model output before returning to users.
### VRAM management (`vram_manager.py`)
Hardware: GTX 1070 (8 GB). Before running the 8b model, medium models are flushed via Ollama `keep_alive=0`, then `/api/ps` is polled (15s timeout) to confirm eviction. On timeout, falls back to medium tier. After complex reply, 8b is flushed and medium models are pre-warmed as a background task.
### Channel adapters (`channels.py`)
- **Telegram**: Grammy Node.js bot (`grammy/bot.mjs`) long-polls Telegram → `POST /message`; replies delivered via `POST grammy:3001/send`
- **CLI**: `cli.py` posts to `/message`, then blocks on `GET /reply/{session_id}` SSE
Session IDs: `tg-<chat_id>` for Telegram, `cli-<username>` for CLI. Conversation history: 5-turn buffer per session.
### Services (`docker-compose.yml`)
| Service | Port | Role |
|---------|------|------|
| `bifrost` | 8080 | LLM gateway — retries, failover, observability; config from `bifrost-config.json` |
| `deepagents` | 8000 | FastAPI gateway + agent core |
| `openmemory` | 8765 | FastMCP server + mem0 memory tools (Qdrant-backed) |
| `grammy` | 3001 | grammY Telegram bot + `/send` HTTP endpoint |
| `crawl4ai` | 11235 | JS-rendered page fetching |
External (from `openai/` stack, host ports):
- Ollama GPU: `11436` — all reply inference (via Bifrost) + VRAM management (direct)
- Ollama CPU: `11435` — nomic-embed-text embeddings for openmemory
- Qdrant: `6333` — vector store for memories
- SearXNG: `11437` — web search
### Bifrost config (`bifrost-config.json`)
The file is mounted into the bifrost container at `/app/data/config.json`. It declares one Ollama provider key pointing to `host.docker.internal:11436` with 2 retries and 300s timeout. To add fallback providers or adjust weights, edit this file and restart the bifrost container.
### Agent tools
`web_search`: SearXNG search + Crawl4AI auto-fetch of top 2 results → combined snippet + full page content.
`fetch_url`: Crawl4AI single-URL fetch.
MCP tools from openmemory (`add_memory`, `search_memory`, `get_all_memories`) are **excluded** from agent tools — memory management is handled outside the agent loop.
### Medium vs Complex agent
| Agent | Builder | Speed | Use case |
|-------|---------|-------|----------|
| medium | `_DirectModel` (single LLM call, no tools) | ~3s | General questions, conversation |
| complex | `create_deep_agent` (deepagents) | Slow — multi-step planner | Deep research via `/think` prefix |
### Key files
- `agent.py` — FastAPI app, lifespan wiring, `run_agent_task()`, all endpoints
- `bifrost-config.json` — Bifrost provider config (Ollama GPU, retries, timeouts)
- `channels.py` — channel registry and `deliver()` dispatcher
- `router.py``Router` class: regex + LLM classification, light-tier reply generation
- `vram_manager.py``VRAMManager`: flush/poll/prewarm Ollama VRAM directly
- `agent_factory.py``build_medium_agent` / `build_complex_agent` via `create_deep_agent()`
- `openmemory/server.py` — FastMCP + mem0 config with custom extraction/dedup prompts
- `wiki_research.py` — batch research pipeline using `/message` + SSE polling
- `grammy/bot.mjs` — Telegram long-poll + HTTP `/send` endpoint