Go to file

alvis 3fb90ae083 Skip _reply_semaphore in no_inference mode

No GPU inference happens in this mode, so serialization is not needed.
Without this, timed-out routing benchmark queries hold the semaphore
and cascade-block all subsequent queries.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-03-24 07:40:07 +00:00

.claude/rules

Switch from Bifrost to LiteLLM; add Matrix channel; update rules

2026-03-24 02:14:13 +00:00

benchmarks

routing benchmark: 1s strict deadline per query

2026-03-24 07:35:13 +00:00

grammy

wiki search people tested pipeline

2026-03-05 11:22:34 +00:00

openmemory

Split CLAUDE.md per official Claude Code recommendations

2026-03-13 07:15:51 +00:00

routecheck

Split CLAUDE.md per official Claude Code recommendations

2026-03-13 07:15:51 +00:00

tests

Merge pull request 'Remove Bifrost: replace test 4 with LiteLLM health check' (#14 ) from fix/remove-bifrost into main

2026-03-24 02:48:40 +00:00

.gitignore

Move benchmark scripts into benchmarks/ subdir

2026-03-24 02:02:46 +00:00

agent_factory.py

Integrate Bifrost LLM gateway, add test suite, implement memory pipeline

2026-03-12 13:50:12 +00:00

agent.py

Skip _reply_semaphore in no_inference mode

2026-03-24 07:40:07 +00:00

bifrost-config.json

Switch from Bifrost to LiteLLM; add Matrix channel; update rules

2026-03-24 02:14:13 +00:00

channels.py

Switch from Bifrost to LiteLLM; add Matrix channel; update rules

2026-03-24 02:14:13 +00:00

CLAUDE.md

Update docs: add benchmarks/ section, fix complex tier description

2026-03-24 02:13:14 +00:00

cli.py

Add Rich token streaming: server SSE + CLI live display + CLI container

2026-03-12 17:26:52 +00:00

docker-compose.yml

Switch from Bifrost to LiteLLM; add Matrix channel; update rules

2026-03-24 02:14:13 +00:00

Dockerfile

Add routecheck service and CommuteTool fast tool

2026-03-13 07:08:48 +00:00

Dockerfile.cli

Add Rich token streaming: server SSE + CLI live display + CLI container

2026-03-12 17:26:52 +00:00

fast_tools.py

WeatherTool: fetch open-meteo directly, skip LLM for fast tool replies

2026-03-15 09:42:55 +00:00

hello_world.py

wiki search people tested pipeline

2026-03-05 11:22:34 +00:00

langgraph.md

wiki search people tested pipeline

2026-03-05 11:22:34 +00:00

potential-directions.md

Switch extraction model to qwen2.5:1.5b, fix mem0migrations dims, update tests

2026-02-23 05:11:29 +00:00

pytest.ini

Integrate Bifrost LLM gateway, add test suite, implement memory pipeline

2026-03-12 13:50:12 +00:00

README.md

Update docs: add benchmarks/ section, fix complex tier description

2026-03-24 02:13:14 +00:00

reasoning.md

wiki search people tested pipeline

2026-03-05 11:22:34 +00:00

router.py

Fix routing: add Russian tech def patterns to light, strengthen medium smart home

2026-03-24 02:45:42 +00:00

vram_manager.py

Add three-tier model routing with VRAM management and benchmark suite

2026-02-28 17:54:51 +00:00

wiki_research.py

wiki search people tested pipeline

2026-03-05 11:22:34 +00:00

README.md

Adolf

Autonomous personal assistant with a multi-channel gateway. Three-tier model routing with GPU VRAM management.

Architecture

┌─────────────────────────────────────────────────────┐
│                 CHANNEL ADAPTERS                    │
│                                                     │
│  [Telegram/Grammy]   [CLI]   [Voice — future]       │
│       ↕                ↕            ↕               │
│       └────────────────┴────────────┘               │
│                        ↕                            │
│          ┌─────────────────────────┐                │
│          │   GATEWAY  (agent.py)   │                │
│          │   FastAPI  :8000        │                │
│          │                         │                │
│          │  POST /message          │  ← all inbound │
│          │  POST /chat  (legacy)   │                │
│          │  GET  /stream/{id} SSE  │  ← token stream│
│          │  GET  /reply/{id}  SSE  │  ← legacy poll │
│          │  GET  /health           │                │
│          │                         │                │
│          │  channels.py registry   │                │
│          │  conversation buffers   │                │
│          └──────────┬──────────────┘                │
│                     ↓                               │
│          ┌──────────────────────┐                   │
│          │    AGENT CORE        │                   │
│          │  three-tier routing  │                   │
│          │  VRAM management     │                   │
│          └──────────────────────┘                   │
│                     ↓                               │
│          channels.deliver(session_id, channel, text)│
│               ↓                    ↓                │
│    telegram → POST grammy/send   cli → SSE queue    │
└─────────────────────────────────────────────────────┘

Channel Adapters

Channel	session_id	Inbound	Outbound
Telegram	`tg-<chat_id>`	Grammy long-poll → POST /message	channels.py → POST grammy:3001/send
CLI	`cli-<user>`	POST /message directly	GET /stream/{id} SSE — Rich Live streaming
Voice	`voice-<device>`	(future)	(future)

Unified Message Flow

1. Channel adapter receives message
2. POST /message {text, session_id, channel, user_id}
3. 202 Accepted immediately
4. Background: run_agent_task(message, session_id, channel)
5. Parallel IO (asyncio.gather):
   a. _fetch_urls_from_message()       — Crawl4AI fetches any URLs in message
   b. _retrieve_memories()             — openmemory semantic search for context
   c. _fast_tool_runner.run_matching() — FastTools (weather, commute) if pattern matches
6. router.route() with enriched history (url_context + fast_context + memories)
   - fast tool match → force medium (real-time data, no point routing to light)
   - if URL content fetched and tier=light → upgrade to medium
7. Invoke agent for tier with url_context + memories in system prompt
8. Token streaming:
   - medium: astream() pushes per-token chunks to _stream_queues[session_id]; <think> blocks filtered in real time
   - light/complex: full reply pushed as single chunk after completion
   - _end_stream() sends [DONE] sentinel
9. channels.deliver(session_id, channel, reply_text) — Telegram callback
10. _store_memory() background task — stores turn in openmemory
11. GET /stream/{session_id} SSE clients receive chunks; CLI renders with Rich Live + final Markdown

Tool Handling

Adolf uses LangChain's tool interface but only the complex agent actually invokes tools at runtime.

Complex agent: web_search and fetch_url are defined as langchain_core.tools.Tool objects and passed to create_deep_agent(). The deepagents library runs an agentic loop (LangGraph create_react_agent under the hood) that sends the tool schema to the model via OpenAI function-calling format and handles tool dispatch.

Medium agent (default): _DirectModel makes a single model.ainvoke(messages) call with no tool schema. Context (memories, fetched URL content) is injected via the system prompt instead. This is intentional — qwen3:4b behaves unreliably when a tool array is present.

Memory tools (out-of-loop): add_memory and search_memory are LangChain MCP tool objects (via langchain_mcp_adapters) but are excluded from both agents' tool lists. They are called directly — await _memory_add_tool.ainvoke(...) — outside the agent loop, before and after each turn.

Three-Tier Model Routing

Tier	Model	Agent	Trigger	Latency
Light	`qwen2.5:1.5b` (router answers directly)	—	Regex pre-match or 3-way embedding classifies "light"	~2–4s
Medium	`qwen3:4b` (`DEEPAGENTS_MODEL`)	`_DirectModel` — single LLM call, no tools	Default; also forced when message contains URLs	~10–20s
Complex	`deepseek/deepseek-r1:free` via LiteLLM (`DEEPAGENTS_COMPLEX_MODEL`)	`create_deep_agent` — agentic loop with tools	Auto-classified by embedding similarity	~30–90s

Routing is fully automatic via 3-way cosine similarity over pre-embedded utterance centroids (light / medium / complex). No prefix required. Use adolf-deep model name to force complex tier via API.

Complex tier is reached automatically for deep research queries — исследуй, изучи все, напиши подробный, etc. — via regex pre-classifier and embedding similarity. No prefix required. Use adolf-deep model name to force it via API.

Fast Tools (`fast_tools.py`)

Pre-flight tools that run concurrently with URL fetch and memory retrieval before any LLM call. Each tool has two methods:

matches(message) → bool — regex classifier; also used by Router to force medium tier
run(message) → str — async fetch returning a context block injected into system prompt

FastToolRunner holds all tools. any_matches() is called by the Router at step 0a; run_matching() is called in the pre-flight asyncio.gather in run_agent_task().

Tool	Pattern	Source	Context returned
`WeatherTool`	weather/forecast/temperature/snow/rain	SearXNG `"погода Балашиха сейчас"`	Current conditions in °C from Russian weather sites
`CommuteTool`	commute/traffic/arrival/пробки	`routecheck:8090/api/route` (Yandex Routing API)	Drive time with/without traffic, Balashikha→Moscow

To add a new fast tool: subclass FastTool in fast_tools.py, implement name/matches/run, add an instance to _fast_tool_runner in agent.py.

routecheck Service (`routecheck/`)

Local web service on port 8090. Exists because Yandex Routing API free tier requires a web UI that uses the API.

Web UI (http://localhost:8090): PIL-generated arithmetic captcha → lat/lon form → travel time result.

Internal API: GET /api/route?from=lat,lon&to=lat,lon&token=ROUTECHECK_TOKEN — bypasses captcha, used by CommuteTool. The ROUTECHECK_TOKEN shared secret is set in .env and passed to both routecheck and deepagents containers.

Yandex API calls are routed through the host HTTPS proxy (host.docker.internal:56928) since the container has no direct external internet access.

Requires .env: YANDEX_ROUTING_KEY (free from developer.tech.yandex.ru) + ROUTECHECK_TOKEN.

Crawl4AI Integration

Crawl4AI runs as a Docker service (crawl4ai:11235) providing JS-rendered, bot-bypass page fetching.

Pre-routing fetch (all tiers):

_URL_RE detects https?:// URLs in any incoming message
_crawl4ai_fetch_async() uses httpx.AsyncClient to POST {urls: [...]} to /crawl
Up to 3 URLs fetched concurrently via asyncio.gather
Fetched content (up to 3000 chars/URL) injected as a system context block into enriched history before routing and into medium/complex system prompts
If fetch succeeds and router returns light → tier upgraded to medium

Complex agent tools:

web_search: SearXNG query + Crawl4AI auto-fetch of top 2 result URLs → combined snippet + page text
fetch_url: Crawl4AI single-URL fetch for any specific URL

Memory Pipeline

openmemory runs as a FastMCP server (openmemory:8765) backed by mem0 + Qdrant + nomic-embed-text.

Retrieval (before routing): _retrieve_memories() calls search_memory MCP tool with the user message as query. Results (threshold ≥ 0.5) are prepended to enriched history so all tiers benefit.

Storage (after reply): _store_memory() runs as an asyncio background task, calling add_memory with "User: ...\nAssistant: ...". The extraction LLM (qwen2.5:1.5b on GPU Ollama) pulls facts; dedup is handled by mem0's update prompt.

Memory tools (add_memory, search_memory, get_all_memories) are excluded from agent tool lists — memory management happens outside the agent loop.

VRAM Management

GTX 1070 — 8 GB. Ollama must be restarted if CUDA init fails (model loads on CPU).

Flush explicitly before loading qwen3:8b (keep_alive=0)
Verify eviction via /api/ps poll (15s timeout) before proceeding
Fallback: timeout → run medium agent instead
Post-complex: flush 8b, pre-warm medium + router

Session ID Convention

Telegram: tg-<chat_id> (e.g. tg-346967270)
CLI: cli-<username> (e.g. cli-alvis)

Conversation history is keyed by session_id (5-turn buffer).

Files

adolf/
├── docker-compose.yml      Services: deepagents, openmemory, grammy, crawl4ai, routecheck, cli
├── Dockerfile              deepagents container (Python 3.12)
├── Dockerfile.cli          CLI container (python:3.12-slim + rich)
├── agent.py                FastAPI gateway, run_agent_task, Crawl4AI pre-fetch, fast tools, memory pipeline
├── fast_tools.py           FastTool base, FastToolRunner, WeatherTool, CommuteTool
├── channels.py             Channel registry + deliver() + pending_replies
├── router.py               Router class — regex + LLM tier classification, FastToolRunner integration
├── vram_manager.py         VRAMManager — flush/prewarm/poll Ollama VRAM
├── agent_factory.py        _DirectModel (medium) / create_deep_agent (complex)
├── cli.py                  Interactive CLI REPL — Rich Live streaming + Markdown render
├── wiki_research.py        Batch wiki research pipeline (uses /message + SSE)
├── benchmarks/
│   ├── run_benchmark.py    Routing accuracy benchmark — 120 queries across 3 tiers
│   ├── run_voice_benchmark.py  Voice path benchmark
│   ├── benchmark.json      Query dataset (gitignored)
│   └── results_latest.json Last run results (gitignored)
├── .env                    TELEGRAM_BOT_TOKEN, ROUTECHECK_TOKEN, YANDEX_ROUTING_KEY (not committed)
├── routecheck/
│   ├── app.py              FastAPI: image captcha + /api/route Yandex proxy
│   └── Dockerfile
├── tests/
│   ├── integration/        Standalone integration test scripts (common.py + test_*.py)
│   └── use_cases/          Claude Code skill markdown files — Claude acts as user + evaluator
├── openmemory/
│   ├── server.py           FastMCP + mem0: add_memory, search_memory, get_all_memories
│   └── Dockerfile
└── grammy/
    ├── bot.mjs             grammY Telegram bot + POST /send HTTP endpoint
    ├── package.json
    └── Dockerfile

External Services (host ports, from openai/ stack)

Service	Host Port	Role
LiteLLM	4000	LLM proxy — all inference goes through here (`LITELLM_URL` env var)
Ollama GPU	11436	GPU inference backend + VRAM management (direct) + memory extraction
Ollama CPU	11435	nomic-embed-text embeddings for openmemory
Langfuse	3200	LLM observability — traces all requests via LiteLLM callbacks
Qdrant	6333	Vector store for memories
SearXNG	11437	Web search (used by `web_search` tool)

README.md Unescape Escape

Adolf

Architecture

Channel Adapters

Unified Message Flow

Tool Handling

Three-Tier Model Routing

Fast Tools (fast_tools.py)

routecheck Service (routecheck/)

Crawl4AI Integration

Memory Pipeline

VRAM Management

Session ID Convention

Files

External Services (host ports, from openai/ stack)

README.md

Fast Tools (`fast_tools.py`)

routecheck Service (`routecheck/`)