voice benchmark: rename --dry-run → --no-inference, fix log extraction

- --no-inference applies to all tiers (not just complex) - metadata key: dry_run → no_inference - extract_tier_from_logs: forward iteration (not reversed), updated regex - GPU check skipped when --no-inference - Fix TypeError in misclassified print when actual=None Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Remove benchmark.json from gitignore — dataset is now tracked
2026-03-24 07:58:05 +00:00 · 2026-03-24 07:53:35 +00:00 · 2026-03-24 07:53:01 +00:00 · 2026-03-24 07:40:07 +00:00 · 2026-03-24 07:37:55 +00:00 · 2026-03-24 07:35:13 +00:00
45 changed files with 5250 additions and 1468 deletions
--- a/.claude/rules/agent-pipeline.md
+++ b/.claude/rules/agent-pipeline.md
@@ -0,0 +1,22 @@
+# Agent Pipeline Rules
+
+## Tiers
+- Routing is fully automatic: router classifies into light/medium/complex via 3-way embedding similarity.
+- Complex tier is reached automatically for deep research queries — no prefix required.
+- Medium is the default tier. Light is only for trivial static-knowledge queries matched by regex or embedding.
+- Light tier upgrade to medium is automatic when URL content is pre-fetched or a fast tool matches.
+- `tier_override` API parameter still allows callers to force a specific tier (e.g. `adolf-deep` model → complex).
+
+## Medium agent
+- `_DirectModel` makes a single `ainvoke()` call with no tool schema. Do not add tools to the medium agent.
+- `qwen3:4b` behaves unreliably when a tool array is present in the request — inject context via system prompt instead.
+
+## Memory
+- `add_memory` and `search_memory` are called directly in `run_agent_task()`, outside the agent loop.
+- Never add memory tools to any agent's tool list.
+- Memory storage (`_store_memory`) runs as an asyncio background task after the semaphore is released.
+
+## Fast tools
+- `FastToolRunner.run_matching()` runs in the pre-flight `asyncio.gather` alongside URL fetch and memory retrieval.
+- Fast tool results are injected as a system prompt block, not returned to the user directly.
+- When `any_matches()` is true, the router forces medium tier before LLM classification.
--- a/.claude/rules/fast-tools.md
+++ b/.claude/rules/fast-tools.md
@@ -0,0 +1,24 @@
+---
+paths:
+  - "fast_tools.py"
+  - "agent.py"
+---
+
+# Fast Tools — Extension Guide
+
+To add a new fast tool:
+
+1. In `fast_tools.py`, subclass `FastTool` and implement:
+   - `name` (str property) — unique identifier, used in logs
+   - `matches(message: str) -> bool` — regex or logic; keep it cheap, runs on every message
+   - `run(message: str) -> str` — async fetch; return a short context block or `""` on failure; never raise
+
+2. In `agent.py`, add an instance to the `_fast_tool_runner` list (module level, after env vars are defined).
+
+3. The router will automatically force medium tier when `matches()` returns true — no router changes needed.
+
+## Constraints
+- `run()` must return in under 15s — it runs in the pre-flight gather that blocks routing.
+- Return `""` or a `[tool error: ...]` string on failure — never raise exceptions.
+- Keep returned context under ~1000 chars — larger contexts slow down `qwen3:4b` streaming significantly.
+- The deepagents container has no direct external internet. Use SearXNG (`host.docker.internal:11437`) or internal services.
--- a/.claude/rules/llm-inference.md
+++ b/.claude/rules/llm-inference.md
@@ -0,0 +1,8 @@
+# LLM Inference Rules
+
+- All LLM calls must use `base_url=LITELLM_URL` (points to LiteLLM at `host.docker.internal:4000/v1`). Never call Ollama directly for inference.
+- `_reply_semaphore` (asyncio.Semaphore(1)) serializes all GPU inference. Never bypass it or add a second semaphore.
+- Local Ollama models use the `ollama/` prefix: `ollama/qwen3:4b`, `ollama/qwen2.5:1.5b`. Remote models (e.g. OpenRouter) use their full LiteLLM name: `openrouter/deepseek-r1`.
+- Timeout values: router=30s, medium=180s, complex=600s. Do not reduce them.
+- `VRAMManager` is the only component that contacts Ollama directly (for flush/prewarm/poll). This is intentional — LiteLLM cannot manage VRAM.
+- Complex tier uses a remote model (`DEEPAGENTS_COMPLEX_MODEL`) — no VRAM management is needed for it.
--- a/.claude/rules/secrets.md
+++ b/.claude/rules/secrets.md
@@ -0,0 +1,7 @@
+# Secrets and Environment
+
+- `.env` is required at project root and must never be committed. It is in `.gitignore`.
+- Required keys: `TELEGRAM_BOT_TOKEN`, `ROUTECHECK_TOKEN`, `YANDEX_ROUTING_KEY`.
+- `ROUTECHECK_TOKEN` is a shared secret between `deepagents` and `routecheck` containers — generate once with `python3 -c "import uuid; print(uuid.uuid4())"`.
+- All tokens are stored in Vaultwarden (AI collection). Fetch with `bw get password "<NAME>"` — see `~/.claude/CLAUDE.md` for the full procedure.
+- Do not hardcode tokens, URLs, or credentials anywhere in source code.
--- a/.gitignore
+++ b/.gitignore
@@ -0,0 +1,7 @@
+__pycache__/
+*.pyc
+logs/*.jsonl
+adolf_tuning_data/voice_audio/
+benchmarks/results_latest.json
+benchmarks/voice_results*.json
+benchmarks/voice_audio/
--- a/ARCHITECTURE.md
+++ b/ARCHITECTURE.md
@@ -1,118 +0,0 @@
-# Adolf
-
-Autonomous personal assistant with a multi-channel gateway. Three-tier model routing with GPU VRAM management.
-
-## Architecture
-
-```
-┌─────────────────────────────────────────────────────┐
-│                 CHANNEL ADAPTERS                    │
-│                                                     │
-│  [Telegram/Grammy]   [CLI]   [Voice — future]       │
-│       ↕                ↕            ↕               │
-│       └────────────────┴────────────┘               │
-│                        ↕                            │
-│          ┌─────────────────────────┐                │
-│          │   GATEWAY  (agent.py)   │                │
-│          │   FastAPI  :8000        │                │
-│          │                         │                │
-│          │  POST /message          │  ← all inbound │
-│          │  POST /chat  (legacy)   │                │
-│          │  GET  /reply/{id}  SSE  │  ← CLI polling │
-│          │  GET  /health           │                │
-│          │                         │                │
-│          │  channels.py registry   │                │
-│          │  conversation buffers   │                │
-│          └──────────┬──────────────┘                │
-│                     ↓                               │
-│          ┌──────────────────────┐                   │
-│          │    AGENT CORE        │                   │
-│          │  three-tier routing  │                   │
-│          │  VRAM management     │                   │
-│          └──────────────────────┘                   │
-│                     ↓                               │
-│          channels.deliver(session_id, channel, text)│
-│               ↓                    ↓                │
-│    telegram → POST grammy/send   cli → SSE queue    │
-└─────────────────────────────────────────────────────┘
-```
-
-## Channel Adapters
-
-| Channel | session_id | Inbound | Outbound |
-|---------|-----------|---------|---------|
-| Telegram | `tg-<chat_id>` | Grammy long-poll → POST /message | channels.py → POST grammy:3001/send |
-| CLI | `cli-<user>` | POST /message directly | GET /reply/{id} SSE stream |
-| Voice | `voice-<device>` | (future) | (future) |
-
-## Unified Message Flow
-
-```
-1. Channel adapter receives message
-2. POST /message {text, session_id, channel, user_id}
-3. 202 Accepted immediately
-4. Background: run_agent_task(message, session_id, channel)
-5. Route → run agent tier → get reply text
-6. channels.deliver(session_id, channel, reply_text)
-   - always puts reply in pending_replies[session_id] queue (for SSE)
-   - calls channel-specific send callback
-7. GET /reply/{session_id} SSE clients receive the reply
-```
-
-## Three-Tier Model Routing
-
-| Tier | Model | VRAM | Trigger | Latency |
-|------|-------|------|---------|---------|
-| Light | qwen2.5:1.5b (router answers) | ~1.2 GB | Router classifies as light | ~2–4s |
-| Medium | qwen3:4b | ~2.5 GB | Default | ~20–40s |
-| Complex | qwen3:8b | ~6.0 GB | `/think` prefix | ~60–120s |
-
-**`/think` prefix**: forces complex tier, stripped before sending to agent.
-
-## VRAM Management
-
-GTX 1070 — 8 GB. Ollama must be restarted if CUDA init fails (model loads on CPU).
-
-1. Flush explicitly before loading qwen3:8b (`keep_alive=0`)
-2. Verify eviction via `/api/ps` poll (15s timeout) before proceeding
-3. Fallback: timeout → run medium agent instead
-4. Post-complex: flush 8b, pre-warm 4b + router
-
-## Session ID Convention
-
- Telegram: `tg-<chat_id>` (e.g. `tg-346967270`)
- CLI: `cli-<username>` (e.g. `cli-alvis`)
-
-Conversation history is keyed by session_id (5-turn buffer).
-
-## Files
-
-```
-adolf/
-├── docker-compose.yml      Services: deepagents, openmemory, grammy
-├── Dockerfile              deepagents container (Python 3.12)
-├── agent.py                FastAPI gateway + three-tier routing
-├── channels.py             Channel registry + deliver() + pending_replies
-├── router.py               Router class — qwen2.5:1.5b routing
-├── vram_manager.py         VRAMManager — flush/prewarm/poll Ollama VRAM
-├── agent_factory.py        build_medium_agent / build_complex_agent
-├── cli.py                  Interactive CLI REPL client
-├── wiki_research.py        Batch wiki research pipeline (uses /message + SSE)
-├── .env                    TELEGRAM_BOT_TOKEN (not committed)
-├── openmemory/
-│   ├── server.py           FastMCP + mem0 MCP tools
-│   └── Dockerfile
-└── grammy/
-    ├── bot.mjs             grammY Telegram bot + POST /send HTTP endpoint
-    ├── package.json
-    └── Dockerfile
-```
-
-## External Services (from openai/ stack)
-
-| Service | Host Port | Role |
-|---------|-----------|------|
-| Ollama GPU | 11436 | All reply inference |
-| Ollama CPU | 11435 | Memory embedding (nomic-embed-text) |
-| Qdrant | 6333 | Vector store for memories |
-| SearXNG | 11437 | Web search |
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -0,0 +1,41 @@
+# CLAUDE.md
+
+This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
+
+## Commands
+
+```bash
+# Start all services
+docker compose up --build
+
+# Interactive CLI (requires services running)
+docker compose --profile tools run --rm -it cli
+
+# Integration tests — run from tests/integration/, require all services up
+python3 test_health.py
+python3 test_memory.py [--name-only|--bench-only|--dedup-only]
+python3 test_routing.py [--easy-only|--medium-only|--hard-only]
+
+# Use case tests — read the .md file and follow its steps as Claude Code
+# example: read tests/use_cases/weather_now.md and execute it
+
+# Routing benchmark — measures tier classification accuracy across 120 queries
+# Run from benchmarks/ — Adolf must be running. DO NOT run during active use (holds GPU).
+cd benchmarks
+python3 run_benchmark.py                       # full run (120 queries)
+python3 run_benchmark.py --tier light          # light tier only (30 queries)
+python3 run_benchmark.py --tier medium         # medium tier only (50 queries)
+python3 run_benchmark.py --tier complex --dry-run  # complex tier, medium model (no API cost)
+python3 run_benchmark.py --category smart_home_control
+python3 run_benchmark.py --ids 1,2,3
+python3 run_benchmark.py --list-categories
+
+# Voice benchmark
+python3 run_voice_benchmark.py
+
+# benchmark.json (dataset) and results_latest.json are gitignored — not committed
+```
+
+## Architecture
+
+@README.md
--- a/4
+++ b/4
@@ -2,9 +2,9 @@ FROM python:3.12-slim

 WORKDIR /app

-RUN pip install --no-cache-dir deepagents langchain-ollama langgraph \
+RUN pip install --no-cache-dir deepagents langchain-openai langgraph \
    fastapi uvicorn langchain-mcp-adapters langchain-community httpx

-COPY agent.py channels.py vram_manager.py router.py agent_factory.py hello_world.py .
+COPY agent.py channels.py vram_manager.py router.py agent_factory.py fast_tools.py hello_world.py ./

 CMD ["uvicorn", "agent:app", "--host", "0.0.0.0", "--port", "8000"]
--- a/Dockerfile.cli
+++ b/Dockerfile.cli
@@ -0,0 +1,9 @@
+FROM python:3.12-slim
+
+WORKDIR /app
+
+RUN pip install --no-cache-dir rich
+
+COPY cli.py .
+
+CMD ["python3", "cli.py"]
--- a/README.md
+++ b/README.md
@@ -0,0 +1,208 @@
+# Adolf
+
+Autonomous personal assistant with a multi-channel gateway. Three-tier model routing with GPU VRAM management.
+
+## Architecture
+
+```
+┌─────────────────────────────────────────────────────┐
+│                 CHANNEL ADAPTERS                    │
+│                                                     │
+│  [Telegram/Grammy]   [CLI]   [Voice — future]       │
+│       ↕                ↕            ↕               │
+│       └────────────────┴────────────┘               │
+│                        ↕                            │
+│          ┌─────────────────────────┐                │
+│          │   GATEWAY  (agent.py)   │                │
+│          │   FastAPI  :8000        │                │
+│          │                         │                │
+│          │  POST /message          │  ← all inbound │
+│          │  POST /chat  (legacy)   │                │
+│          │  GET  /stream/{id} SSE  │  ← token stream│
+│          │  GET  /reply/{id}  SSE  │  ← legacy poll │
+│          │  GET  /health           │                │
+│          │                         │                │
+│          │  channels.py registry   │                │
+│          │  conversation buffers   │                │
+│          └──────────┬──────────────┘                │
+│                     ↓                               │
+│          ┌──────────────────────┐                   │
+│          │    AGENT CORE        │                   │
+│          │  three-tier routing  │                   │
+│          │  VRAM management     │                   │
+│          └──────────────────────┘                   │
+│                     ↓                               │
+│          channels.deliver(session_id, channel, text)│
+│               ↓                    ↓                │
+│    telegram → POST grammy/send   cli → SSE queue    │
+└─────────────────────────────────────────────────────┘
+```
+
+## Channel Adapters
+
+| Channel | session_id | Inbound | Outbound |
+|---------|-----------|---------|---------|
+| Telegram | `tg-<chat_id>` | Grammy long-poll → POST /message | channels.py → POST grammy:3001/send |
+| CLI | `cli-<user>` | POST /message directly | GET /stream/{id} SSE — Rich Live streaming |
+| Voice | `voice-<device>` | (future) | (future) |
+
+## Unified Message Flow
+
+```
+1. Channel adapter receives message
+2. POST /message {text, session_id, channel, user_id}
+3. 202 Accepted immediately
+4. Background: run_agent_task(message, session_id, channel)
+5. Parallel IO (asyncio.gather):
+   a. _fetch_urls_from_message()       — Crawl4AI fetches any URLs in message
+   b. _retrieve_memories()             — openmemory semantic search for context
+   c. _fast_tool_runner.run_matching() — FastTools (weather, commute) if pattern matches
+6. router.route() with enriched history (url_context + fast_context + memories)
+   - fast tool match → force medium (real-time data, no point routing to light)
+   - if URL content fetched and tier=light → upgrade to medium
+7. Invoke agent for tier with url_context + memories in system prompt
+8. Token streaming:
+   - medium: astream() pushes per-token chunks to _stream_queues[session_id]; <think> blocks filtered in real time
+   - light/complex: full reply pushed as single chunk after completion
+   - _end_stream() sends [DONE] sentinel
+9. channels.deliver(session_id, channel, reply_text) — Telegram callback
+10. _store_memory() background task — stores turn in openmemory
+11. GET /stream/{session_id} SSE clients receive chunks; CLI renders with Rich Live + final Markdown
+```
+
+## Tool Handling
+
+Adolf uses LangChain's tool interface but only the complex agent actually invokes tools at runtime.
+
+**Complex agent:** `web_search` and `fetch_url` are defined as `langchain_core.tools.Tool` objects and passed to `create_deep_agent()`. The deepagents library runs an agentic loop (LangGraph `create_react_agent` under the hood) that sends the tool schema to the model via OpenAI function-calling format and handles tool dispatch.
+
+**Medium agent (default):** `_DirectModel` makes a single `model.ainvoke(messages)` call with no tool schema. Context (memories, fetched URL content) is injected via the system prompt instead. This is intentional — `qwen3:4b` behaves unreliably when a tool array is present.
+
+**Memory tools (out-of-loop):** `add_memory` and `search_memory` are LangChain MCP tool objects (via `langchain_mcp_adapters`) but are excluded from both agents' tool lists. They are called directly — `await _memory_add_tool.ainvoke(...)` — outside the agent loop, before and after each turn.
+
+## Three-Tier Model Routing
+
+| Tier | Model | Agent | Trigger | Latency |
+|------|-------|-------|---------|---------|
+| Light | `qwen2.5:1.5b` (router answers directly) | — | Regex pre-match or 3-way embedding classifies "light" | ~2–4s |
+| Medium | `qwen3:4b` (`DEEPAGENTS_MODEL`) | `_DirectModel` — single LLM call, no tools | Default; also forced when message contains URLs | ~10–20s |
+| Complex | `deepseek/deepseek-r1:free` via LiteLLM (`DEEPAGENTS_COMPLEX_MODEL`) | `create_deep_agent` — agentic loop with tools | Auto-classified by embedding similarity | ~30–90s |
+
+Routing is fully automatic via 3-way cosine similarity over pre-embedded utterance centroids (light / medium / complex). No prefix required. Use `adolf-deep` model name to force complex tier via API.
+
+Complex tier is reached automatically for deep research queries — `исследуй`, `изучи все`, `напиши подробный`, etc. — via regex pre-classifier and embedding similarity. No prefix required. Use `adolf-deep` model name to force it via API.
+
+## Fast Tools (`fast_tools.py`)
+
+Pre-flight tools that run concurrently with URL fetch and memory retrieval before any LLM call. Each tool has two methods:
+- `matches(message) → bool` — regex classifier; also used by `Router` to force medium tier
+- `run(message) → str` — async fetch returning a context block injected into system prompt
+
+`FastToolRunner` holds all tools. `any_matches()` is called by the Router at step 0a; `run_matching()` is called in the pre-flight `asyncio.gather` in `run_agent_task()`.
+
+| Tool | Pattern | Source | Context returned |
+|------|---------|--------|-----------------|
+| `WeatherTool` | weather/forecast/temperature/snow/rain | SearXNG `"погода Балашиха сейчас"` | Current conditions in °C from Russian weather sites |
+| `CommuteTool` | commute/traffic/arrival/пробки | `routecheck:8090/api/route` (Yandex Routing API) | Drive time with/without traffic, Balashikha→Moscow |
+
+**To add a new fast tool:** subclass `FastTool` in `fast_tools.py`, implement `name`/`matches`/`run`, add an instance to `_fast_tool_runner` in `agent.py`.
+
+## routecheck Service (`routecheck/`)
+
+Local web service on port 8090. Exists because Yandex Routing API free tier requires a web UI that uses the API.
+
+**Web UI** (`http://localhost:8090`): PIL-generated arithmetic captcha → lat/lon form → travel time result.
+
+**Internal API**: `GET /api/route?from=lat,lon&to=lat,lon&token=ROUTECHECK_TOKEN` — bypasses captcha, used by `CommuteTool`. The `ROUTECHECK_TOKEN` shared secret is set in `.env` and passed to both `routecheck` and `deepagents` containers.
+
+Yandex API calls are routed through the host HTTPS proxy (`host.docker.internal:56928`) since the container has no direct external internet access.
+
+**Requires** `.env`: `YANDEX_ROUTING_KEY` (free from `developer.tech.yandex.ru`) + `ROUTECHECK_TOKEN`.
+
+## Crawl4AI Integration
+
+Crawl4AI runs as a Docker service (`crawl4ai:11235`) providing JS-rendered, bot-bypass page fetching.
+
+**Pre-routing fetch (all tiers):**
+- `_URL_RE` detects `https?://` URLs in any incoming message
+- `_crawl4ai_fetch_async()` uses `httpx.AsyncClient` to POST `{urls: [...]}` to `/crawl`
+- Up to 3 URLs fetched concurrently via `asyncio.gather`
+- Fetched content (up to 3000 chars/URL) injected as a system context block into enriched history before routing and into medium/complex system prompts
+- If fetch succeeds and router returns light → tier upgraded to medium
+
+**Complex agent tools:**
+- `web_search`: SearXNG query + Crawl4AI auto-fetch of top 2 result URLs → combined snippet + page text
+- `fetch_url`: Crawl4AI single-URL fetch for any specific URL
+
+## Memory Pipeline
+
+openmemory runs as a FastMCP server (`openmemory:8765`) backed by mem0 + Qdrant + nomic-embed-text.
+
+**Retrieval (before routing):** `_retrieve_memories()` calls `search_memory` MCP tool with the user message as query. Results (threshold ≥ 0.5) are prepended to enriched history so all tiers benefit.
+
+**Storage (after reply):** `_store_memory()` runs as an asyncio background task, calling `add_memory` with `"User: ...\nAssistant: ..."`. The extraction LLM (`qwen2.5:1.5b` on GPU Ollama) pulls facts; dedup is handled by mem0's update prompt.
+
+Memory tools (`add_memory`, `search_memory`, `get_all_memories`) are excluded from agent tool lists — memory management happens outside the agent loop.
+
+## VRAM Management
+
+GTX 1070 — 8 GB. Ollama must be restarted if CUDA init fails (model loads on CPU).
+
+1. Flush explicitly before loading qwen3:8b (`keep_alive=0`)
+2. Verify eviction via `/api/ps` poll (15s timeout) before proceeding
+3. Fallback: timeout → run medium agent instead
+4. Post-complex: flush 8b, pre-warm medium + router
+
+## Session ID Convention
+
+- Telegram: `tg-<chat_id>` (e.g. `tg-346967270`)
+- CLI: `cli-<username>` (e.g. `cli-alvis`)
+
+Conversation history is keyed by session_id (5-turn buffer).
+
+## Files
+
+```
+adolf/
+├── docker-compose.yml      Services: deepagents, openmemory, grammy, crawl4ai, routecheck, cli
+├── Dockerfile              deepagents container (Python 3.12)
+├── Dockerfile.cli          CLI container (python:3.12-slim + rich)
+├── agent.py                FastAPI gateway, run_agent_task, Crawl4AI pre-fetch, fast tools, memory pipeline
+├── fast_tools.py           FastTool base, FastToolRunner, WeatherTool, CommuteTool
+├── channels.py             Channel registry + deliver() + pending_replies
+├── router.py               Router class — regex + LLM tier classification, FastToolRunner integration
+├── vram_manager.py         VRAMManager — flush/prewarm/poll Ollama VRAM
+├── agent_factory.py        _DirectModel (medium) / create_deep_agent (complex)
+├── cli.py                  Interactive CLI REPL — Rich Live streaming + Markdown render
+├── wiki_research.py        Batch wiki research pipeline (uses /message + SSE)
+├── benchmarks/
+│   ├── run_benchmark.py    Routing accuracy benchmark — 120 queries across 3 tiers
+│   ├── run_voice_benchmark.py  Voice path benchmark
+│   ├── benchmark.json      Query dataset (gitignored)
+│   └── results_latest.json Last run results (gitignored)
+├── .env                    TELEGRAM_BOT_TOKEN, ROUTECHECK_TOKEN, YANDEX_ROUTING_KEY (not committed)
+├── routecheck/
+│   ├── app.py              FastAPI: image captcha + /api/route Yandex proxy
+│   └── Dockerfile
+├── tests/
+│   ├── integration/        Standalone integration test scripts (common.py + test_*.py)
+│   └── use_cases/          Claude Code skill markdown files — Claude acts as user + evaluator
+├── openmemory/
+│   ├── server.py           FastMCP + mem0: add_memory, search_memory, get_all_memories
+│   └── Dockerfile
+└── grammy/
+    ├── bot.mjs             grammY Telegram bot + POST /send HTTP endpoint
+    ├── package.json
+    └── Dockerfile
+```
+
+## External Services (host ports, from openai/ stack)
+
+| Service | Host Port | Role |
+|---------|-----------|------|
+| LiteLLM | 4000 | LLM proxy — all inference goes through here (`LITELLM_URL` env var) |
+| Ollama GPU | 11436 | GPU inference backend + VRAM management (direct) + memory extraction |
+| Ollama CPU | 11435 | nomic-embed-text embeddings for openmemory |
+| Langfuse | 3200 | LLM observability — traces all requests via LiteLLM callbacks |
+| Qdrant | 6333 | Vector store for memories |
+| SearXNG | 11437 | Web search (used by `web_search` tool) |
--- a/agent.py
+++ b/agent.py
@@ -1,7 +1,9 @@
 import asyncio
+import json as _json_module
 import os
 import time
-from contextlib import asynccontextmanager
+from contextlib import asynccontextmanager, nullcontext
+from pathlib import Path

 from fastapi import FastAPI, BackgroundTasks, Request
 from fastapi.responses import JSONResponse, StreamingResponse
@@ -10,7 +12,14 @@ from pydantic import BaseModel
 import re as _re
 import httpx as _httpx

-from langchain_ollama import ChatOllama
+_URL_RE = _re.compile(r'https?://[^\s<>"\']+')
+
+
+def _extract_urls(text: str) -> list[str]:
+    return _URL_RE.findall(text)
+
+from openai import AsyncOpenAI
+from langchain_openai import ChatOpenAI
 from langchain_mcp_adapters.client import MultiServerMCPClient
 from langchain_community.utilities import SearxSearchWrapper
 from langchain_core.tools import Tool
@@ -18,23 +27,120 @@ from langchain_core.tools import Tool
 from vram_manager import VRAMManager
 from router import Router
 from agent_factory import build_medium_agent, build_complex_agent
+from fast_tools import FastToolRunner, WeatherTool, CommuteTool
 import channels

+# LiteLLM proxy — all LLM inference goes through here
+LITELLM_URL = os.getenv("LITELLM_URL", "http://host.docker.internal:4000/v1")
+LITELLM_API_KEY = os.getenv("LITELLM_API_KEY", "dummy")
+# Direct Ollama URL — used only by VRAMManager for flush/prewarm/poll
 OLLAMA_BASE_URL = os.getenv("OLLAMA_BASE_URL", "http://localhost:11434")
-ROUTER_MODEL = os.getenv("DEEPAGENTS_ROUTER_MODEL", "qwen2.5:0.5b")
+
+ROUTER_MODEL = os.getenv("DEEPAGENTS_ROUTER_MODEL", "qwen2.5:1.5b")
 MEDIUM_MODEL = os.getenv("DEEPAGENTS_MODEL", "qwen3:4b")
 COMPLEX_MODEL = os.getenv("DEEPAGENTS_COMPLEX_MODEL", "qwen3:8b")
 SEARXNG_URL = os.getenv("SEARXNG_URL", "http://host.docker.internal:11437")
 OPENMEMORY_URL = os.getenv("OPENMEMORY_URL", "http://openmemory:8765")
 CRAWL4AI_URL = os.getenv("CRAWL4AI_URL", "http://crawl4ai:11235")
+ROUTECHECK_URL = os.getenv("ROUTECHECK_URL", "http://routecheck:8090")
+ROUTECHECK_TOKEN = os.getenv("ROUTECHECK_TOKEN", "")

 MAX_HISTORY_TURNS = 5
 _conversation_buffers: dict[str, list] = {}

+# ── Interaction logging (RLHF data collection) ─────────────────────────────────
+_LOG_DIR = Path(os.getenv("ADOLF_LOG_DIR", "/app/logs"))
+_INTERACTIONS_LOG = _LOG_DIR / "interactions.jsonl"
+
+def _ensure_log_dir() -> None:
+    try:
+        _LOG_DIR.mkdir(parents=True, exist_ok=True)
+    except Exception as e:
+        print(f"[log] cannot create log dir {_LOG_DIR}: {e}", flush=True)
+
+
+async def _log_interaction(
+    session_id: str,
+    channel: str,
+    tier: str,
+    input_text: str,
+    response_text: str | None,
+    latency_ms: int,
+    metadata: dict | None = None,
+) -> None:
+    """Append one interaction record to the JSONL log for future RLHF/finetuning."""
+    record = {
+        "ts": time.time(),
+        "session_id": session_id,
+        "channel": channel,
+        "tier": tier,
+        "input": input_text,
+        "output": response_text or "",
+        "latency_ms": latency_ms,
+    }
+    if metadata:
+        record["metadata"] = metadata
+    try:
+        _ensure_log_dir()
+        with open(_INTERACTIONS_LOG, "a", encoding="utf-8") as f:
+            f.write(_json_module.dumps(record, ensure_ascii=False) + "\n")
+    except Exception as e:
+        print(f"[log] write error: {e}", flush=True)
+
+# Per-session streaming queues — filled during inference, read by /stream/{session_id}
+_stream_queues: dict[str, asyncio.Queue] = {}
+
+
+async def _push_stream_chunk(session_id: str, chunk: str) -> None:
+    q = _stream_queues.setdefault(session_id, asyncio.Queue())
+    await q.put(chunk)
+
+
+async def _end_stream(session_id: str) -> None:
+    q = _stream_queues.setdefault(session_id, asyncio.Queue())
+    await q.put("[DONE]")
+
+
+async def _crawl4ai_fetch_async(url: str) -> str:
+    """Async fetch via Crawl4AI — JS-rendered, bot-bypass, returns clean markdown."""
+    try:
+        async with _httpx.AsyncClient(timeout=60) as client:
+            r = await client.post(f"{CRAWL4AI_URL}/crawl", json={"urls": [url]})
+            r.raise_for_status()
+            results = r.json().get("results", [])
+            if not results or not results[0].get("success"):
+                return ""
+            md_obj = results[0].get("markdown") or {}
+            md = md_obj.get("raw_markdown") if isinstance(md_obj, dict) else str(md_obj)
+            return (md or "")[:5000]
+    except Exception as e:
+        return f"[fetch error: {e}]"
+
+
+async def _fetch_urls_from_message(message: str) -> str:
+    """If message contains URLs, fetch their content concurrently via Crawl4AI.
+    Returns a formatted context block, or '' if no URLs or all fetches fail."""
+    urls = _extract_urls(message)
+    if not urls:
+        return ""
+    # Fetch up to 3 URLs concurrently
+    results = await asyncio.gather(*[_crawl4ai_fetch_async(u) for u in urls[:3]])
+    parts = []
+    for url, content in zip(urls[:3], results):
+        if content and not content.startswith("[fetch error"):
+            parts.append(f"### {url}\n{content[:3000]}")
+    if not parts:
+        return ""
+    return "User's message contains URLs. Fetched content:\n\n" + "\n\n".join(parts)
+
+
+
+# /no_think at the start of the system prompt disables qwen3 chain-of-thought.
+# create_deep_agent prepends our system_prompt before BASE_AGENT_PROMPT, so
+# /no_think lands at position 0 and is respected by qwen3 models via Ollama.
 MEDIUM_SYSTEM_PROMPT = (
-    "You are a helpful AI assistant. "
-    "Use web_search for questions about current events or facts you don't know. "
-    "Reply concisely."
+    "You are a helpful AI assistant. Reply concisely. "
+    "If asked to remember a fact or name, simply confirm: 'Got it, I'll remember that.'"
 )

 COMPLEX_SYSTEM_PROMPT = (
@@ -49,11 +155,20 @@ COMPLEX_SYSTEM_PROMPT = (
    "NEVER invent URLs. End with: **Sources checked: N**"
 )

+medium_model = None
 medium_agent = None
 complex_agent = None
 router: Router = None
 vram_manager: VRAMManager = None
 mcp_client = None
+_memory_add_tool = None
+_memory_search_tool = None
+
+# Fast tools run before the LLM — classifier + context enricher
+_fast_tool_runner = FastToolRunner([
+    WeatherTool(),
+    CommuteTool(routecheck_url=ROUTECHECK_URL, internal_token=ROUTECHECK_TOKEN),
+])

 # GPU mutex: one LLM inference at a time
 _reply_semaphore = asyncio.Semaphore(1)
@@ -61,25 +176,37 @@ _reply_semaphore = asyncio.Semaphore(1)

@asynccontextmanager
 async def lifespan(app: FastAPI):
-    global medium_agent, complex_agent, router, vram_manager, mcp_client
+    global medium_model, medium_agent, complex_agent, router, vram_manager, mcp_client, \
+        _memory_add_tool, _memory_search_tool

    # Register channel adapters
    channels.register_defaults()

-    # Three model instances
-    router_model = ChatOllama(
-        model=ROUTER_MODEL, base_url=OLLAMA_BASE_URL, think=False, num_ctx=4096,
+    # All three models route through Bifrost → Ollama GPU.
+    router_model = ChatOpenAI(
+        model=f"ollama/{ROUTER_MODEL}",
+        base_url=LITELLM_URL,
+        api_key=LITELLM_API_KEY,
        temperature=0,
+        timeout=30,
    )
-    medium_model = ChatOllama(
-        model=MEDIUM_MODEL, base_url=OLLAMA_BASE_URL, think=False, num_ctx=8192
+    embedder = AsyncOpenAI(base_url=LITELLM_URL, api_key=LITELLM_API_KEY)
+    medium_model = ChatOpenAI(
+        model=f"ollama/{MEDIUM_MODEL}",
+        base_url=LITELLM_URL,
+        api_key=LITELLM_API_KEY,
+        timeout=180,
    )
-    complex_model = ChatOllama(
-        model=COMPLEX_MODEL, base_url=OLLAMA_BASE_URL, think=True, num_ctx=16384
+    complex_model = ChatOpenAI(
+        model=COMPLEX_MODEL,  # full model name — may be remote (OpenRouter) or local ollama/*
+        base_url=LITELLM_URL,
+        api_key=LITELLM_API_KEY,
+        timeout=600,
    )

    vram_manager = VRAMManager(base_url=OLLAMA_BASE_URL)
-    router = Router(model=router_model)
+    router = Router(model=router_model, embedder=embedder, fast_tool_runner=_fast_tool_runner)
+    await router.initialize()

    mcp_connections = {
        "openmemory": {"transport": "sse", "url": f"{OPENMEMORY_URL}/sse"},
@@ -97,6 +224,13 @@ async def lifespan(app: FastAPI):

    agent_tools = [t for t in mcp_tools if t.name not in ("add_memory", "search_memory", "get_all_memories")]

+    # Expose memory tools directly so run_agent_task can call them outside the agent loop
+    for t in mcp_tools:
+        if t.name == "add_memory":
+            _memory_add_tool = t
+        elif t.name == "search_memory":
+            _memory_search_tool = t
+
    searx = SearxSearchWrapper(searx_host=SEARXNG_URL)

    def _crawl4ai_fetch(url: str) -> str:
@@ -187,13 +321,15 @@ async def lifespan(app: FastAPI):
    )

    print(
-        f"[agent] three-tier: router={ROUTER_MODEL} | medium={MEDIUM_MODEL} | complex={COMPLEX_MODEL}",
+        f"[agent] litellm={LITELLM_URL} | router=semantic(ollama/{ROUTER_MODEL}+nomic-embed-text) | "
+        f"medium=ollama/{MEDIUM_MODEL} | complex={COMPLEX_MODEL}",
        flush=True,
    )
    print(f"[agent] agent tools: {[t.name for t in agent_tools]}", flush=True)

    yield

+    medium_model = None
    medium_agent = None
    complex_agent = None
    router = None
@@ -222,13 +358,19 @@ class ChatRequest(BaseModel):

 # ── helpers ────────────────────────────────────────────────────────────────────

+def _strip_think(text: str) -> str:
+    """Strip qwen3 chain-of-thought blocks that appear inline in content
+    when using Ollama's OpenAI-compatible endpoint (/v1/chat/completions)."""
+    return _re.sub(r"<think>.*?</think>", "", text, flags=_re.DOTALL).strip()
+
+
 def _extract_final_text(result) -> str | None:
    msgs = result.get("messages", [])
    for m in reversed(msgs):
        if type(m).__name__ == "AIMessage" and getattr(m, "content", ""):
-            return m.content
+            return _strip_think(m.content)
    if isinstance(result, dict) and result.get("output"):
-        return result["output"]
+        return _strip_think(result["output"])
    return None


@@ -244,60 +386,176 @@ def _log_messages(result):
            print(f"[agent]   {role} → {tc['name']}({tc['args']})", flush=True)


-# ── core task ──────────────────────────────────────────────────────────────────
+# ── memory helpers ─────────────────────────────────────────────────────────────

-async def run_agent_task(message: str, session_id: str, channel: str = "telegram"):
-    print(f"[agent] queued: {message[:80]!r} chat={session_id}", flush=True)
+def _resolve_user_id(session_id: str) -> str:
+    """Map any session_id to a canonical user identity for openmemory.
+    All channels share the same memory pool for the single user."""
+    return "alvis"

-    force_complex = False
-    clean_message = message
-    if message.startswith("/think "):
-        force_complex = True
-        clean_message = message[len("/think "):]
-        print("[agent] /think prefix → force_complex=True", flush=True)

-    async with _reply_semaphore:
+async def _store_memory(session_id: str, user_msg: str, assistant_reply: str) -> None:
+    """Store a conversation turn in openmemory (runs as a background task)."""
+    if _memory_add_tool is None:
+        return
+    t0 = time.monotonic()
+    try:
+        text = f"User: {user_msg}\nAssistant: {assistant_reply}"
+        user_id = _resolve_user_id(session_id)
+        await _memory_add_tool.ainvoke({"text": text, "user_id": user_id})
+        print(f"[memory] stored in {time.monotonic() - t0:.1f}s", flush=True)
+    except Exception as e:
+        print(f"[memory] error: {e}", flush=True)
+
+
+async def _retrieve_memories(message: str, session_id: str) -> str:
+    """Search openmemory for relevant context. Returns formatted string or ''."""
+    if _memory_search_tool is None:
+        return ""
+    try:
+        user_id = _resolve_user_id(session_id)
+        result = await _memory_search_tool.ainvoke({"query": message, "user_id": user_id})
+        if result and result.strip() and result.strip() != "[]":
+            return f"Relevant memories:\n{result}"
+    except Exception:
+        pass
+    return ""
+
+
+# ── core pipeline ──────────────────────────────────────────────────────────────
+
+from typing import AsyncGenerator
+
+async def _run_agent_pipeline(
+    message: str,
+    history: list[dict],
+    session_id: str,
+    tier_override: str | None = None,
+    no_inference: bool = False,
+    tier_capture: list | None = None,
+) -> AsyncGenerator[str, None]:
+    """Core pipeline: pre-flight → routing → inference. Yields text chunks.
+
+    tier_override: "light" | "medium" | "complex" | None (auto-route)
+    no_inference: if True, routing decision is still made but inference is skipped — yields "I don't know" immediately
+    Caller is responsible for scheduling _store_memory after consuming all chunks.
+    """
+    async with (nullcontext() if no_inference else _reply_semaphore):
        t0 = time.monotonic()
-        history = _conversation_buffers.get(session_id, [])
+        clean_message = message
        print(f"[agent] running: {clean_message[:80]!r}", flush=True)

-        tier, light_reply = await router.route(clean_message, history, force_complex)
-        print(f"[agent] tier={tier} message={clean_message[:60]!r}", flush=True)
+        # Fetch URL content, memories, and fast-tool context concurrently
+        # Skip preflight IO in no_inference mode — only routing decision needed
+        if no_inference:
+            url_context = memories = fast_context = None
+        else:
+            url_context, memories, fast_context = await asyncio.gather(
+                _fetch_urls_from_message(clean_message),
+                _retrieve_memories(clean_message, session_id),
+                _fast_tool_runner.run_matching(clean_message),
+            )
+            if url_context:
+                print(f"[agent] crawl4ai: {len(url_context)} chars fetched", flush=True)
+            if fast_context:
+                names = _fast_tool_runner.matching_names(clean_message)
+                print(f"[agent] fast_tools={names}: {len(fast_context)} chars injected", flush=True)
+
+        # Build enriched history
+        enriched_history = list(history)
+        if url_context:
+            enriched_history = [{"role": "system", "content": url_context}] + enriched_history
+        if fast_context:
+            enriched_history = [{"role": "system", "content": fast_context}] + enriched_history
+        if memories:
+            enriched_history = [{"role": "system", "content": memories}] + enriched_history

        final_text = None
-        try:
-            if tier == "light":
-                final_text = light_reply
-                llm_elapsed = time.monotonic() - t0
-                print(f"[agent] light path: answered by router", flush=True)
+        llm_elapsed = 0.0

-            elif tier == "medium":
-                system_prompt = MEDIUM_SYSTEM_PROMPT
-                result = await medium_agent.ainvoke({
-                    "messages": [
+        try:
+            # Short-circuit: fast tool already has the answer
+            if fast_context and tier_override is None and not url_context and not no_inference:
+                tier = "fast"
+                final_text = fast_context
+                llm_elapsed = time.monotonic() - t0
+                names = _fast_tool_runner.matching_names(clean_message)
+                print(f"[agent] tier=fast tools={names} — delivering directly", flush=True)
+                yield final_text
+
+            else:
+                # Determine tier
+                if tier_override in ("light", "medium", "complex"):
+                    tier = tier_override
+                    light_reply = None
+                    if tier_override == "light":
+                        tier, light_reply = await router.route(clean_message, enriched_history, no_inference=no_inference)
+                        tier = "light"
+                else:
+                    tier, light_reply = await router.route(clean_message, enriched_history, no_inference=no_inference)
+                    if url_context and tier == "light":
+                        tier = "medium"
+                        light_reply = None
+                        print("[agent] URL in message → upgraded light→medium", flush=True)
+
+                print(f"[agent] tier={tier} message={clean_message[:60]!r}", flush=True)
+                if tier_capture is not None:
+                    tier_capture.append(tier)
+
+                if no_inference:
+                    yield "I don't know"
+                    return
+
+                if tier == "light":
+                    final_text = light_reply
+                    llm_elapsed = time.monotonic() - t0
+                    print("[agent] light path: answered by router", flush=True)
+                    yield final_text
+
+                elif tier == "medium":
+                    system_prompt = MEDIUM_SYSTEM_PROMPT
+                    if memories:
+                        system_prompt += "\n\n" + memories
+                    if url_context:
+                        system_prompt += "\n\n" + url_context
+                    if fast_context:
+                        system_prompt += "\n\nLive web search results (use these to answer):\n\n" + fast_context
+
+                    in_think = False
+                    response_parts = []
+                    async for chunk in medium_model.astream([
                        {"role": "system", "content": system_prompt},
                        *history,
                        {"role": "user", "content": clean_message},
-                    ]
-                })
-                llm_elapsed = time.monotonic() - t0
-                _log_messages(result)
-                final_text = _extract_final_text(result)
+                    ]):
+                        token = chunk.content or ""
+                        if not token:
+                            continue
+                        if in_think:
+                            if "</think>" in token:
+                                in_think = False
+                                after = token.split("</think>", 1)[1]
+                                if after:
+                                    yield after
+                                    response_parts.append(after)
+                        else:
+                            if "<think>" in token:
+                                in_think = True
+                                before = token.split("<think>", 1)[0]
+                                if before:
+                                    yield before
+                                    response_parts.append(before)
+                            else:
+                                yield token
+                                response_parts.append(token)

-            else:  # complex
-                ok = await vram_manager.enter_complex_mode()
-                if not ok:
-                    print("[agent] complex→medium fallback (eviction timeout)", flush=True)
-                    tier = "medium"
-                    result = await medium_agent.ainvoke({
-                        "messages": [
-                            {"role": "system", "content": MEDIUM_SYSTEM_PROMPT},
-                            *history,
-                            {"role": "user", "content": clean_message},
-                        ]
-                    })
-                else:
+                    llm_elapsed = time.monotonic() - t0
+                    final_text = "".join(response_parts).strip() or None
+
+                else:  # complex — remote model, no VRAM management needed
                    system_prompt = COMPLEX_SYSTEM_PROMPT.format(user_id=session_id)
+                    if url_context:
+                        system_prompt += "\n\n[Pre-fetched URL content from user's message:]\n" + url_context
                    result = await complex_agent.ainvoke({
                        "messages": [
                            {"role": "system", "content": system_prompt},
@@ -305,38 +563,90 @@ async def run_agent_task(message: str, session_id: str, channel: str = "telegram
                            {"role": "user", "content": clean_message},
                        ]
                    })
-                    asyncio.create_task(vram_manager.exit_complex_mode())

-                llm_elapsed = time.monotonic() - t0
-                _log_messages(result)
-                final_text = _extract_final_text(result)
+                    llm_elapsed = time.monotonic() - t0
+                    _log_messages(result)
+                    final_text = _extract_final_text(result)
+                    if final_text:
+                        yield final_text

        except Exception as e:
            import traceback
            llm_elapsed = time.monotonic() - t0
-            print(f"[agent] error after {llm_elapsed:.1f}s for chat {session_id}: {e}", flush=True)
+            print(f"[agent] error after {llm_elapsed:.1f}s for {session_id}: {e}", flush=True)
            traceback.print_exc()

-        # Deliver reply through the originating channel
+        print(f"[agent] pipeline done in {time.monotonic() - t0:.1f}s tier={tier if 'tier' in dir() else '?'}", flush=True)
+
+        # Store memory as side-effect (non-blocking, best-effort)
        if final_text:
-            t1 = time.monotonic()
-            await channels.deliver(session_id, channel, final_text)
-            send_elapsed = time.monotonic() - t1
-            print(
-                f"[agent] replied in {time.monotonic() - t0:.1f}s "
-                f"(llm={llm_elapsed:.1f}s, send={send_elapsed:.1f}s) tier={tier}",
-                flush=True,
-            )
-            print(f"[agent] reply_text: {final_text}", flush=True)
+            asyncio.create_task(_store_memory(session_id, clean_message, final_text))
+
+
+# ── core task (Telegram / Matrix / CLI wrapper) ─────────────────────────────────
+
+async def run_agent_task(
+    message: str,
+    session_id: str,
+    channel: str = "telegram",
+    metadata: dict | None = None,
+):
+    print(f"[agent] queued: {message[:80]!r} chat={session_id}", flush=True)
+    t0 = time.monotonic()
+
+    meta = metadata or {}
+    no_inference = bool(meta.get("no_inference", False))
+    is_benchmark = bool(meta.get("benchmark", False))
+
+    history = _conversation_buffers.get(session_id, [])
+    final_text = None
+    actual_tier = "unknown"
+    tier_capture: list = []
+
+    async for chunk in _run_agent_pipeline(message, history, session_id, no_inference=no_inference, tier_capture=tier_capture):
+        await _push_stream_chunk(session_id, chunk)
+        if final_text is None:
+            final_text = chunk
        else:
-            print("[agent] warning: no text reply from agent", flush=True)
+            final_text += chunk
+
+    await _end_stream(session_id)
+    actual_tier = tier_capture[0] if tier_capture else "unknown"
+
+    elapsed_ms = int((time.monotonic() - t0) * 1000)
+
+    if final_text:
+        final_text = final_text.strip()
+
+        # Skip channel delivery for benchmark sessions (no Telegram spam)
+        if not is_benchmark:
+            try:
+                await channels.deliver(session_id, channel, final_text)
+            except Exception as e:
+                print(f"[agent] delivery error (non-fatal): {e}", flush=True)
+
+        print(f"[agent] replied in {elapsed_ms / 1000:.1f}s tier={actual_tier}", flush=True)
+        print(f"[agent] reply_text: {final_text}", flush=True)

        # Update conversation buffer
-        if final_text:
-            buf = _conversation_buffers.get(session_id, [])
-            buf.append({"role": "user", "content": clean_message})
-            buf.append({"role": "assistant", "content": final_text})
-            _conversation_buffers[session_id] = buf[-(MAX_HISTORY_TURNS * 2):]
+        buf = _conversation_buffers.get(session_id, [])
+        buf.append({"role": "user", "content": message})
+        buf.append({"role": "assistant", "content": final_text})
+        _conversation_buffers[session_id] = buf[-(MAX_HISTORY_TURNS * 2):]
+
+        # Log interaction for RLHF data collection (skip benchmark sessions to avoid noise)
+        if not is_benchmark:
+            asyncio.create_task(_log_interaction(
+                session_id=session_id,
+                channel=channel,
+                tier=actual_tier,
+                input_text=message,
+                response_text=final_text,
+                latency_ms=elapsed_ms,
+                metadata=meta if meta else None,
+            ))
+    else:
+        print("[agent] warning: no text reply from agent", flush=True)


 # ── endpoints ──────────────────────────────────────────────────────────────────
@@ -348,7 +658,7 @@ async def message(request: InboundMessage, background_tasks: BackgroundTasks):
        return JSONResponse(status_code=503, content={"error": "Agent not ready"})
    session_id = request.session_id
    channel = request.channel
-    background_tasks.add_task(run_agent_task, request.text, session_id, channel)
+    background_tasks.add_task(run_agent_task, request.text, session_id, channel, request.metadata)
    return JSONResponse(status_code=202, content={"status": "accepted"})


@@ -374,13 +684,132 @@ async def reply_stream(session_id: str):
        try:
            text = await asyncio.wait_for(q.get(), timeout=900)
            # Escape newlines so entire reply fits in one SSE data line
-            yield f"data: {text.replace(chr(10), '\\n').replace(chr(13), '')}\n\n"
+            yield f"data: {text.replace(chr(10), chr(92) + 'n').replace(chr(13), '')}\n\n"
        except asyncio.TimeoutError:
            yield "data: [timeout]\n\n"

    return StreamingResponse(event_generator(), media_type="text/event-stream")


+@app.get("/stream/{session_id}")
+async def stream_reply(session_id: str):
+    """
+    SSE endpoint — streams reply tokens as they are generated.
+    Each chunk: data: <token>\\n\\n
+    Signals completion: data: [DONE]\\n\\n
+
+    Medium tier: real token-by-token streaming (think blocks filtered out).
+    Light and complex tiers: full reply delivered as one chunk then [DONE].
+    """
+    q = _stream_queues.setdefault(session_id, asyncio.Queue())
+
+    async def event_generator():
+        try:
+            while True:
+                chunk = await asyncio.wait_for(q.get(), timeout=900)
+                escaped = chunk.replace("\n", "\\n").replace("\r", "")
+                yield f"data: {escaped}\n\n"
+                if chunk == "[DONE]":
+                    break
+        except asyncio.TimeoutError:
+            yield "data: [DONE]\n\n"
+
+    return StreamingResponse(event_generator(), media_type="text/event-stream")
+
+
@app.get("/health")
 async def health():
    return {"status": "ok", "agent_ready": medium_agent is not None}
+
+
+# ── OpenAI-compatible API (for OpenWebUI and other clients) ────────────────────
+
+_TIER_MAP = {
+    "adolf": None,
+    "adolf-light": "light",
+    "adolf-medium": "medium",
+    "adolf-deep": "complex",
+}
+
+
+@app.get("/v1/models")
+async def list_models():
+    return {
+        "object": "list",
+        "data": [
+            {"id": "adolf", "object": "model", "owned_by": "adolf"},
+            {"id": "adolf-light", "object": "model", "owned_by": "adolf"},
+            {"id": "adolf-medium", "object": "model", "owned_by": "adolf"},
+            {"id": "adolf-deep", "object": "model", "owned_by": "adolf"},
+        ],
+    }
+
+
+@app.post("/v1/chat/completions")
+async def chat_completions(request: Request):
+    if medium_agent is None:
+        return JSONResponse(status_code=503, content={"error": {"message": "Agent not ready", "type": "server_error"}})
+
+    body = await request.json()
+    model = body.get("model", "adolf")
+    messages = body.get("messages", [])
+    stream = body.get("stream", True)
+
+    # Extract current user message and history
+    user_messages = [m for m in messages if m.get("role") == "user"]
+    if not user_messages:
+        return JSONResponse(status_code=400, content={"error": {"message": "No user message", "type": "invalid_request_error"}})
+
+    current_message = user_messages[-1]["content"]
+    # History = everything before the last user message (excluding system messages from OpenWebUI)
+    last_user_idx = len(messages) - 1 - next(
+        i for i, m in enumerate(reversed(messages)) if m.get("role") == "user"
+    )
+    history = [m for m in messages[:last_user_idx] if m.get("role") in ("user", "assistant")]
+
+    session_id = request.headers.get("X-Session-Id", "owui-default")
+    tier_override = _TIER_MAP.get(model)
+
+    import json as _json
+    import uuid as _uuid
+
+    response_id = f"chatcmpl-{_uuid.uuid4().hex[:12]}"
+
+    if stream:
+        async def event_stream():
+            # Opening chunk with role
+            opening = {
+                "id": response_id, "object": "chat.completion.chunk",
+                "choices": [{"index": 0, "delta": {"role": "assistant"}, "finish_reason": None}]
+            }
+            yield f"data: {_json.dumps(opening)}\n\n"
+
+            async for chunk in _run_agent_pipeline(current_message, history, session_id, tier_override):
+                data = {
+                    "id": response_id, "object": "chat.completion.chunk",
+                    "choices": [{"index": 0, "delta": {"content": chunk}, "finish_reason": None}]
+                }
+                yield f"data: {_json.dumps(data)}\n\n"
+
+            # Final chunk
+            final = {
+                "id": response_id, "object": "chat.completion.chunk",
+                "choices": [{"index": 0, "delta": {}, "finish_reason": "stop"}]
+            }
+            yield f"data: {_json.dumps(final)}\n\n"
+            yield "data: [DONE]\n\n"
+
+        return StreamingResponse(event_stream(), media_type="text/event-stream")
+
+    else:
+        # Non-streaming: collect all chunks
+        parts = []
+        async for chunk in _run_agent_pipeline(current_message, history, session_id, tier_override):
+            if chunk:
+                parts.append(chunk)
+        full_text = "".join(parts).strip()
+        return {
+            "id": response_id, "object": "chat.completion",
+            "choices": [{"index": 0, "message": {"role": "assistant", "content": full_text}, "finish_reason": "stop"}],
+            "model": model,
+        }
--- a/agent_factory.py
+++ b/agent_factory.py
@@ -1,13 +1,21 @@
 from deepagents import create_deep_agent


+class _DirectModel:
+    """Thin wrapper: single LLM call, no tools, same ainvoke interface as a graph."""
+
+    def __init__(self, model):
+        self._model = model
+
+    async def ainvoke(self, input_dict: dict) -> dict:
+        messages = input_dict["messages"]
+        response = await self._model.ainvoke(messages)
+        return {"messages": list(messages) + [response]}
+
+
 def build_medium_agent(model, agent_tools: list, system_prompt: str):
-    """Medium agent: create_deep_agent with TodoList planning, no subagents."""
-    return create_deep_agent(
-        model=model,
-        tools=agent_tools,
-        system_prompt=system_prompt,
-    )
+    """Medium agent: single LLM call, no tools — fast ~3s response."""
+    return _DirectModel(model)


 def build_complex_agent(model, agent_tools: list, system_prompt: str):
--- a/benchmarks/benchmark.json
+++ b/benchmarks/benchmark.json
@@ -0,0 +1,137 @@
+{
+  "description": "Adolf routing benchmark — домашние сценарии, Alexa/Google-Home стиль, русский язык",
+  "tiers": {
+    "light": "Приветствия, прощания, подтверждения, простые разговорные фразы. Не требуют поиска или действий.",
+    "medium": "Управление домом, погода/пробки, таймеры, напоминания, покупки, личная память, быстрые вопросы.",
+    "complex": "Глубокое исследование, сравнение технологий, подробные руководства с несколькими источниками."
+  },
+  "queries": [
+    {"id": 1,  "tier": "light",   "category": "greetings", "query": "привет"},
+    {"id": 2,  "tier": "light",   "category": "greetings", "query": "пока"},
+    {"id": 3,  "tier": "light",   "category": "greetings", "query": "спасибо"},
+    {"id": 4,  "tier": "light",   "category": "greetings", "query": "привет, как дела?"},
+    {"id": 5,  "tier": "light",   "category": "greetings", "query": "окей"},
+    {"id": 6,  "tier": "light",   "category": "greetings", "query": "добрый вечер"},
+    {"id": 7,  "tier": "light",   "category": "greetings", "query": "доброе утро"},
+    {"id": 8,  "tier": "light",   "category": "greetings", "query": "добрый день"},
+    {"id": 9,  "tier": "light",   "category": "greetings", "query": "hi"},
+    {"id": 10, "tier": "light",   "category": "greetings", "query": "thanks"},
+    {"id": 11, "tier": "light",   "category": "greetings", "query": "отлично, спасибо"},
+    {"id": 12, "tier": "light",   "category": "greetings", "query": "понятно"},
+    {"id": 13, "tier": "light",   "category": "greetings", "query": "ясно"},
+    {"id": 14, "tier": "light",   "category": "greetings", "query": "ладно"},
+    {"id": 15, "tier": "light",   "category": "greetings", "query": "договорились"},
+    {"id": 16, "tier": "light",   "category": "greetings", "query": "good morning"},
+    {"id": 17, "tier": "light",   "category": "greetings", "query": "good night"},
+    {"id": 18, "tier": "light",   "category": "greetings", "query": "всё понятно"},
+    {"id": 19, "tier": "light",   "category": "greetings", "query": "да"},
+    {"id": 20, "tier": "light",   "category": "greetings", "query": "нет"},
+    {"id": 21, "tier": "light",   "category": "greetings", "query": "не нужно"},
+    {"id": 22, "tier": "light",   "category": "greetings", "query": "отмена"},
+    {"id": 23, "tier": "light",   "category": "greetings", "query": "стоп"},
+    {"id": 24, "tier": "light",   "category": "greetings", "query": "подожди"},
+    {"id": 25, "tier": "light",   "category": "greetings", "query": "повтори"},
+    {"id": 26, "tier": "light",   "category": "greetings", "query": "ты тут?"},
+    {"id": 27, "tier": "light",   "category": "greetings", "query": "слышишь меня?"},
+    {"id": 28, "tier": "light",   "category": "greetings", "query": "всё ок"},
+    {"id": 29, "tier": "light",   "category": "greetings", "query": "хорошо"},
+    {"id": 30, "tier": "light",   "category": "greetings", "query": "пожалуйста"},
+
+    {"id": 31, "tier": "medium",  "category": "weather_commute", "query": "какая сегодня погода в Балашихе"},
+    {"id": 32, "tier": "medium",  "category": "weather_commute", "query": "пойдет ли сегодня дождь"},
+    {"id": 33, "tier": "medium",  "category": "weather_commute", "query": "какая температура на улице сейчас"},
+    {"id": 34, "tier": "medium",  "category": "weather_commute", "query": "будет ли снег сегодня"},
+    {"id": 35, "tier": "medium",  "category": "weather_commute", "query": "погода на завтра"},
+    {"id": 36, "tier": "medium",  "category": "weather_commute", "query": "сколько ехать до Москвы сейчас"},
+    {"id": 37, "tier": "medium",  "category": "weather_commute", "query": "какие пробки на дороге до Москвы"},
+    {"id": 38, "tier": "medium",  "category": "weather_commute", "query": "время в пути на работу"},
+    {"id": 39, "tier": "medium",  "category": "weather_commute", "query": "есть ли пробки сейчас"},
+    {"id": 40, "tier": "medium",  "category": "weather_commute", "query": "стоит ли брать зонтик"},
+
+    {"id": 41, "tier": "medium",  "category": "smart_home_control", "query": "включи свет в гостиной"},
+    {"id": 42, "tier": "medium",  "category": "smart_home_control", "query": "выключи свет на кухне"},
+    {"id": 43, "tier": "medium",  "category": "smart_home_control", "query": "какая температура дома"},
+    {"id": 44, "tier": "medium",  "category": "smart_home_control", "query": "установи температуру 22 градуса"},
+    {"id": 45, "tier": "medium",  "category": "smart_home_control", "query": "включи свет в спальне на 50 процентов"},
+    {"id": 46, "tier": "medium",  "category": "smart_home_control", "query": "выключи все лампочки"},
+    {"id": 47, "tier": "medium",  "category": "smart_home_control", "query": "какие устройства сейчас включены"},
+    {"id": 48, "tier": "medium",  "category": "smart_home_control", "query": "закрыты ли все окна"},
+    {"id": 49, "tier": "medium",  "category": "smart_home_control", "query": "включи вентилятор в детской"},
+    {"id": 50, "tier": "medium",  "category": "smart_home_control", "query": "есть ли кто-нибудь дома"},
+    {"id": 51, "tier": "medium",  "category": "smart_home_control", "query": "включи ночной режим"},
+    {"id": 52, "tier": "medium",  "category": "smart_home_control", "query": "какое потребление электричества сегодня"},
+    {"id": 53, "tier": "medium",  "category": "smart_home_control", "query": "выключи телевизор"},
+    {"id": 54, "tier": "medium",  "category": "smart_home_control", "query": "открой шторы в гостиной"},
+    {"id": 55, "tier": "medium",  "category": "smart_home_control", "query": "установи будильник на 7 утра"},
+    {"id": 56, "tier": "medium",  "category": "smart_home_control", "query": "включи кофемашину"},
+    {"id": 57, "tier": "medium",  "category": "smart_home_control", "query": "выключи свет во всём доме"},
+    {"id": 58, "tier": "medium",  "category": "smart_home_control", "query": "сколько у нас датчиков движения"},
+    {"id": 59, "tier": "medium",  "category": "smart_home_control", "query": "состояние всех дверных замков"},
+    {"id": 60, "tier": "medium",  "category": "smart_home_control", "query": "включи режим кино в гостиной"},
+    {"id": 61, "tier": "medium",  "category": "smart_home_control", "query": "прибавь яркость в детской"},
+    {"id": 62, "tier": "medium",  "category": "smart_home_control", "query": "закрой все шторы"},
+    {"id": 63, "tier": "medium",  "category": "smart_home_control", "query": "кто последний открывал входную дверь"},
+    {"id": 64, "tier": "medium",  "category": "smart_home_control", "query": "заблокируй входную дверь"},
+    {"id": 65, "tier": "medium",  "category": "smart_home_control", "query": "покажи камеру у входа"},
+
+    {"id": 66, "tier": "medium",  "category": "timers_reminders", "query": "поставь таймер на 10 минут"},
+    {"id": 67, "tier": "medium",  "category": "timers_reminders", "query": "напомни мне позвонить врачу в 15:00"},
+    {"id": 68, "tier": "medium",  "category": "timers_reminders", "query": "поставь будильник на завтра в 6:30"},
+    {"id": 69, "tier": "medium",  "category": "timers_reminders", "query": "напомни выключить плиту через 20 минут"},
+    {"id": 70, "tier": "medium",  "category": "timers_reminders", "query": "сколько времени осталось на таймере"},
+
+    {"id": 71, "tier": "medium",  "category": "shopping_cooking", "query": "добавь молоко в список покупок"},
+    {"id": 72, "tier": "medium",  "category": "shopping_cooking", "query": "что есть в списке покупок"},
+    {"id": 73, "tier": "medium",  "category": "shopping_cooking", "query": "добавь хлеб и яйца в список покупок"},
+    {"id": 74, "tier": "medium",  "category": "shopping_cooking", "query": "сколько граммов муки нужно для блинов на 4 человека"},
+    {"id": 75, "tier": "medium",  "category": "shopping_cooking", "query": "какой рецепт борща ты знаешь"},
+
+    {"id": 76, "tier": "medium",  "category": "personal_memory", "query": "как меня зовут"},
+    {"id": 77, "tier": "medium",  "category": "personal_memory", "query": "где я живу"},
+    {"id": 78, "tier": "medium",  "category": "personal_memory", "query": "что мы обсуждали в прошлый раз"},
+    {"id": 79, "tier": "medium",  "category": "personal_memory", "query": "что ты знаешь о моем домашнем сервере"},
+    {"id": 80, "tier": "medium",  "category": "personal_memory", "query": "напомни, какие сервисы я запускаю"},
+    {"id": 81, "tier": "medium",  "category": "personal_memory", "query": "что я говорил о своей сети"},
+    {"id": 82, "tier": "medium",  "category": "personal_memory", "query": "что я просил тебя запомнить"},
+
+    {"id": 83, "tier": "medium",  "category": "quick_info", "query": "какой сейчас курс биткоина"},
+    {"id": 84, "tier": "medium",  "category": "quick_info", "query": "курс доллара к рублю сейчас"},
+    {"id": 85, "tier": "medium",  "category": "quick_info", "query": "есть ли проблемы у Cloudflare сегодня"},
+    {"id": 86, "tier": "medium",  "category": "quick_info", "query": "какая последняя версия Docker"},
+    {"id": 87, "tier": "medium",  "category": "quick_info", "query": "какие новые функции в Home Assistant 2024"},
+    {"id": 88, "tier": "medium",  "category": "quick_info", "query": "как проверить использование диска в Linux"},
+    {"id": 89, "tier": "medium",  "category": "quick_info", "query": "как перезапустить Docker контейнер"},
+    {"id": 90, "tier": "medium",  "category": "quick_info", "query": "как посмотреть логи Docker контейнера"},
+
+    {"id": 91,  "tier": "complex", "category": "infrastructure", "query": "исследуй и сравни Proxmox, Unraid и TrueNAS для домашней лаборатории"},
+    {"id": 92,  "tier": "complex", "category": "infrastructure", "query": "напиши подробное руководство по безопасности домашнего сервера, подключенного к интернету"},
+    {"id": 93,  "tier": "complex", "category": "infrastructure", "query": "исследуй все доступные дашборды для самохостинга и сравни их функции"},
+    {"id": 94,  "tier": "complex", "category": "infrastructure", "query": "исследуй лучший стек мониторинга для самохостинга в 2024 году со всеми вариантами"},
+    {"id": 95,  "tier": "complex", "category": "infrastructure", "query": "сравни все системы резервного копирования для Linux: Restic, Borg, Duplicati, Timeshift"},
+    {"id": 96,  "tier": "complex", "category": "infrastructure", "query": "напиши полное руководство по настройке обратного прокси Caddy для домашнего сервера с SSL"},
+    {"id": 97,  "tier": "complex", "category": "network", "query": "исследуй и сравни WireGuard, OpenVPN и Tailscale для домашней VPN с детальными плюсами и минусами"},
+    {"id": 98,  "tier": "complex", "category": "network", "query": "исследуй лучшие практики сегментации домашней сети с VLAN и правилами файрвола"},
+    {"id": 99,  "tier": "complex", "category": "network", "query": "изучи все самохостируемые DNS решения и их возможности"},
+    {"id": 100, "tier": "complex", "category": "network", "query": "исследуй лучшие самохостируемые системы мониторинга сети: Zabbix, Grafana, Prometheus, Netdata"},
+    {"id": 101, "tier": "complex", "category": "home_assistant", "query": "исследуй и сравни все платформы умного дома: Home Assistant, OpenHAB и Domoticz"},
+    {"id": 102, "tier": "complex", "category": "home_assistant", "query": "изучи лучшие Zigbee координаторы и их совместимость с Home Assistant в 2024 году"},
+    {"id": 103, "tier": "complex", "category": "home_assistant", "query": "напиши детальный отчет о поддержке протокола Matter и совместимых устройствах"},
+    {"id": 104, "tier": "complex", "category": "home_assistant", "query": "исследуй все способы интеграции умных ламп с Home Assistant: Zigbee, WiFi, Bluetooth"},
+    {"id": 105, "tier": "complex", "category": "home_assistant", "query": "найди и сравни все варианты датчиков движения для умного дома с оценками и ценами"},
+    {"id": 106, "tier": "complex", "category": "home_assistant", "query": "напиши подробное руководство по настройке автоматизаций в Home Assistant для умного освещения"},
+    {"id": 107, "tier": "complex", "category": "home_assistant", "query": "исследуй все варианты голосового управления умным домом на русском языке, включая локальные решения"},
+    {"id": 108, "tier": "complex", "category": "home_assistant", "query": "исследуй все протоколы умного дома и их плюсы и минусы: Zigbee, Z-Wave, WiFi, Thread, Bluetooth"},
+    {"id": 109, "tier": "complex", "category": "media_files", "query": "исследуй и сравни все самохостируемые решения для хранения фотографий с детальным сравнением функций"},
+    {"id": 110, "tier": "complex", "category": "media_files", "query": "изучи лучшие самохостируемые медиасерверы: Jellyfin, Plex и Emby — с характеристиками и отзывами"},
+    {"id": 111, "tier": "complex", "category": "media_files", "query": "сравни все самохостируемые облачные хранилища: Nextcloud, Seafile, Owncloud — производительность и функции"},
+    {"id": 112, "tier": "complex", "category": "research", "query": "исследуй последние достижения в локальном LLM инференсе и оборудовании для него"},
+    {"id": 113, "tier": "complex", "category": "research", "query": "изучи лучшие опенсорс альтернативы Google сервисов для приватного домашнего окружения"},
+    {"id": 114, "tier": "complex", "category": "research", "query": "изучи все варианты локального запуска языковых моделей на видеокарте 8 ГБ VRAM"},
+    {"id": 115, "tier": "complex", "category": "research", "query": "найди и сравни все фреймворки для создания локальных AI ассистентов с открытым исходным кодом"},
+    {"id": 116, "tier": "complex", "category": "research", "query": "изучи все доступные локальные ассистенты с голосовым управлением на русском языке"},
+    {"id": 117, "tier": "complex", "category": "infrastructure", "query": "изучи свежие CVE и уязвимости в популярном самохостируемом ПО: Gitea, Nextcloud, Jellyfin"},
+    {"id": 118, "tier": "complex", "category": "infrastructure", "query": "напиши детальное сравнение систем управления конфигурацией: Ansible, Salt, Puppet для домашнего окружения"},
+    {"id": 119, "tier": "complex", "category": "network", "query": "исследуй все самохостируемые решения для блокировки рекламы: Pi-hole, AdGuard Home, NextDNS"},
+    {"id": 120, "tier": "complex", "category": "research", "query": "напиши подробный отчет о технологиях синтеза речи с открытым исходным кодом на русском языке"}
+  ]
+}
--- a/benchmarks/run_benchmark.py
+++ b/benchmarks/run_benchmark.py
@@ -0,0 +1,316 @@
+#!/usr/bin/env python3
+"""
+Adolf routing benchmark.
+
+Sends each query to Adolf's /message endpoint, waits briefly for the routing
+decision to appear in docker logs, then records the actual tier.
+
+Usage:
+    python3 run_benchmark.py [options]
+    python3 run_benchmark.py --tier light|medium|complex
+    python3 run_benchmark.py --category <name>
+    python3 run_benchmark.py --ids 1,2,3
+    python3 run_benchmark.py --list-categories
+    python3 run_benchmark.py --no-inference    # skip all LLM inference — routing decisions only, all tiers
+
+IMPORTANT: Always check GPU is free before running. This script does it automatically.
+
+Adolf must be running at http://localhost:8000.
+"""
+
+import argparse
+import asyncio
+import json
+import re
+import subprocess
+import sys
+import time
+from pathlib import Path
+
+import httpx
+
+ADOLF_URL = "http://localhost:8000"
+OLLAMA_URL = "http://localhost:11436"  # GPU Ollama
+DATASET = Path(__file__).parent / "benchmark.json"
+RESULTS = Path(__file__).parent / "results_latest.json"
+
+# Max time to wait for each query to fully complete via SSE stream
+QUERY_TIMEOUT = 300  # seconds — generous to handle GPU semaphore waits
+
+# Memory thresholds
+MIN_FREE_RAM_MB = 1500   # abort if less than this is free
+MIN_FREE_VRAM_MB = 500   # warn if less than this is free on GPU
+
+
+# ── Pre-flight checks ──────────────────────────────────────────────────────────
+
+def check_ram() -> tuple[bool, str]:
+    """Check available system RAM. Returns (ok, message)."""
+    try:
+        with open("/proc/meminfo") as f:
+            info = {}
+            for line in f:
+                parts = line.split()
+                if len(parts) >= 2:
+                    info[parts[0].rstrip(":")] = int(parts[1])
+        free_mb = (info.get("MemAvailable", 0)) // 1024
+        total_mb = info.get("MemTotal", 0) // 1024
+        msg = f"RAM: {free_mb} MB free / {total_mb} MB total"
+        if free_mb < MIN_FREE_RAM_MB:
+            return False, f"CRITICAL: {msg} — need at least {MIN_FREE_RAM_MB} MB free"
+        return True, msg
+    except Exception as e:
+        return True, f"RAM check failed (non-fatal): {e}"
+
+
+def check_gpu() -> tuple[bool, str]:
+    """Check GPU VRAM via Ollama /api/ps. Returns (ok, message)."""
+    try:
+        r = httpx.get(f"{OLLAMA_URL}/api/ps", timeout=5)
+        r.raise_for_status()
+        data = r.json()
+        models = data.get("models", [])
+        if models:
+            names = [m.get("name", "?") for m in models]
+            sizes_mb = [m.get("size_vram", 0) // (1024 * 1024) for m in models]
+            loaded = ", ".join(f"{n} ({s}MB)" for n, s in zip(names, sizes_mb))
+            total_vram = sum(sizes_mb)
+            if total_vram > 7000:
+                return False, f"GPU BUSY: models loaded = {loaded} — total VRAM used {total_vram}MB. Wait for models to unload."
+            return True, f"GPU: models loaded = {loaded} (total {total_vram}MB VRAM)"
+        return True, "GPU: idle (no models loaded)"
+    except httpx.ConnectError:
+        return True, "GPU check skipped (Ollama not reachable at localhost:11436)"
+    except Exception as e:
+        return True, f"GPU check failed (non-fatal): {e}"
+
+
+def preflight_checks(skip_gpu_check: bool = False) -> bool:
+    """Run all pre-flight checks. Returns True if safe to proceed."""
+    print("\n── Pre-flight checks ──────────────────────────────────────────")
+
+    ram_ok, ram_msg = check_ram()
+    print(f"  {'✓' if ram_ok else '✗'} {ram_msg}")
+    if not ram_ok:
+        print("\nABORTING: not enough RAM. Free up memory before running benchmark.")
+        return False
+
+    if not skip_gpu_check:
+        gpu_ok, gpu_msg = check_gpu()
+        print(f"  {'✓' if gpu_ok else '✗'} {gpu_msg}")
+        if not gpu_ok:
+            print("\nABORTING: GPU is busy. Wait for current inference to finish, then retry.")
+            return False
+
+    print("  All checks passed.\n")
+    return True
+
+
+# ── Log helpers ────────────────────────────────────────────────────────────────
+
+def get_log_tail(n: int = 50) -> str:
+    result = subprocess.run(
+        ["docker", "logs", "deepagents", "--tail", str(n)],
+        capture_output=True, text=True,
+    )
+    return result.stdout + result.stderr
+
+
+def extract_tier_from_logs(logs_before: str, logs_after: str) -> str | None:
+    """Find new tier= lines that appeared after we sent the query."""
+    before_lines = set(logs_before.splitlines())
+    new_lines = [l for l in logs_after.splitlines() if l not in before_lines]
+    for line in new_lines:
+        m = re.search(r"tier=(\w+(?:\s*\(no-inference\))?)", line)
+        if m:
+            tier_raw = m.group(1)
+            # Normalise: "complex (no-inference)" → "complex"
+            return tier_raw.split()[0]
+    return None
+
+
+# ── Request helpers ────────────────────────────────────────────────────────────
+
+async def post_message(
+    client: httpx.AsyncClient,
+    query_id: int,
+    query: str,
+    no_inference: bool = False,
+) -> bool:
+    payload = {
+        "text": query,
+        "session_id": f"benchmark-{query_id}",
+        "channel": "cli",
+        "user_id": "benchmark",
+        "metadata": {"no_inference": no_inference, "benchmark": True},
+    }
+    try:
+        r = await client.post(f"{ADOLF_URL}/message", json=payload, timeout=10)
+        r.raise_for_status()
+        return True
+    except Exception as e:
+        print(f" POST_ERROR: {e}", end="")
+        return False
+
+
+# ── Dataset ────────────────────────────────────────────────────────────────────
+
+def load_dataset() -> list[dict]:
+    with open(DATASET) as f:
+        return json.load(f)["queries"]
+
+
+def filter_queries(queries, tier, category, ids):
+    if tier:
+        queries = [q for q in queries if q["tier"] == tier]
+    if category:
+        queries = [q for q in queries if q["category"] == category]
+    if ids:
+        queries = [q for q in queries if q["id"] in ids]
+    return queries
+
+
+# ── Main run ───────────────────────────────────────────────────────────────────
+
+async def run(queries: list[dict], no_inference: bool = False) -> list[dict]:
+    results = []
+
+    async with httpx.AsyncClient() as client:
+        try:
+            r = await client.get(f"{ADOLF_URL}/health", timeout=5)
+            r.raise_for_status()
+        except Exception as e:
+            print(f"ERROR: Adolf not reachable: {e}", file=sys.stderr)
+            sys.exit(1)
+
+        total = len(queries)
+        correct = 0
+
+        dry_label = " [NO-INFERENCE: routing only]" if no_inference else ""
+        print(f"\nRunning {total} queries{dry_label}\n")
+        print(f"{'ID':>3}  {'EXPECTED':8}  {'ACTUAL':8}  {'OK':3}  {'TIME':6}  {'CATEGORY':22}  QUERY")
+        print("─" * 110)
+
+        for q in queries:
+            qid = q["id"]
+            expected = q["tier"]
+            category = q["category"]
+            query_text = q["query"]
+
+            session_id = f"benchmark-{qid}"
+
+            print(f"{qid:>3}  {expected:8}  ", end="", flush=True)
+
+            logs_before = get_log_tail(300)
+            t0 = time.monotonic()
+
+            ok_post = await post_message(client, qid, query_text, no_inference=no_inference)
+            if not ok_post:
+                print(f"{'?':8}  {'ERR':3}  {'?':6}  {category:22}  {query_text[:40]}")
+                results.append({"id": qid, "expected": expected, "actual": None, "ok": False})
+                continue
+
+            # Wait for query to complete via SSE stream (handles GPU semaphore waits)
+            try:
+                async with client.stream(
+                    "GET", f"{ADOLF_URL}/stream/{session_id}", timeout=QUERY_TIMEOUT
+                ) as sse:
+                    async for line in sse.aiter_lines():
+                        if "data: [DONE]" in line:
+                            break
+            except Exception:
+                pass  # timeout or connection issue — check logs anyway
+
+            # Now the query is done — check logs for tier
+            await asyncio.sleep(0.3)
+            logs_after = get_log_tail(300)
+            actual = extract_tier_from_logs(logs_before, logs_after)
+
+            elapsed = time.monotonic() - t0
+            match = actual == expected or (actual == "fast" and expected == "medium")
+            if match:
+                correct += 1
+
+            mark = "✓" if match else "✗"
+            actual_str = actual or "?"
+            print(f"{actual_str:8}  {mark:3}  {elapsed:5.1f}s  {category:22}  {query_text[:40]}")
+
+            results.append({
+                "id": qid,
+                "expected": expected,
+                "actual": actual_str,
+                "ok": match,
+                "elapsed": round(elapsed, 1),
+                "category": category,
+                "query": query_text,
+                "no_inference": no_inference,
+            })
+
+        print("─" * 110)
+        accuracy = correct / total * 100 if total else 0
+        print(f"\nAccuracy: {correct}/{total} ({accuracy:.0f}%)")
+
+        for tier_name in ["light", "medium", "complex"]:
+            tier_qs = [r for r in results if r["expected"] == tier_name]
+            if tier_qs:
+                tier_ok = sum(1 for r in tier_qs if r["ok"])
+                print(f"  {tier_name:8}: {tier_ok}/{len(tier_qs)}")
+
+        wrong = [r for r in results if not r["ok"]]
+        if wrong:
+            print(f"\nMisclassified ({len(wrong)}):")
+            for r in wrong:
+                print(f"  id={r['id']:3}  expected={r['expected']:8}  actual={r['actual']:8}  {r['query'][:60]}")
+
+    with open(RESULTS, "w") as f:
+        json.dump(results, f, indent=2, ensure_ascii=False)
+    print(f"\nResults saved to {RESULTS}")
+
+    return results
+
+
+def main():
+    parser = argparse.ArgumentParser(
+        description="Adolf routing benchmark",
+        epilog="IMPORTANT: Always check GPU is free before running. This is done automatically."
+    )
+    parser.add_argument("--tier", choices=["light", "medium", "complex"])
+    parser.add_argument("--category")
+    parser.add_argument("--ids", help="Comma-separated IDs")
+    parser.add_argument("--list-categories", action="store_true")
+    parser.add_argument(
+        "--no-inference",
+        action="store_true",
+        help="Skip LLM inference for all tiers — only routing decisions are tested (no GPU/API cost)",
+    )
+    parser.add_argument(
+        "--skip-gpu-check",
+        action="store_true",
+        help="Skip GPU availability check (use only if you know GPU is free)",
+    )
+    args = parser.parse_args()
+
+    queries = load_dataset()
+
+    if args.list_categories:
+        cats = sorted(set(q["category"] for q in queries))
+        tiers = {t: sum(1 for q in queries if q["tier"] == t) for t in ["light", "medium", "complex"]}
+        print(f"Total: {len(queries)} | Tiers: {tiers}")
+        print(f"Categories: {cats}")
+        return
+
+    # ALWAYS check GPU and RAM before running
+    if not preflight_checks(skip_gpu_check=args.no_inference):
+        sys.exit(1)
+
+    ids = [int(i) for i in args.ids.split(",")] if args.ids else None
+    queries = filter_queries(queries, args.tier, args.category, ids)
+    if not queries:
+        print("No queries match filters.")
+        sys.exit(1)
+
+    asyncio.run(run(queries, no_inference=args.no_inference))
+
+
+if __name__ == "__main__":
+    main()
--- a/benchmarks/run_routing_benchmark.py
+++ b/benchmarks/run_routing_benchmark.py
@@ -0,0 +1,218 @@
+#!/usr/bin/env python3
+"""
+Adolf routing benchmark — tests routing decisions only, no LLM inference.
+
+Sends each query with no_inference=True, waits for the routing decision to
+appear in docker logs, and records whether the correct tier was selected.
+
+Usage:
+    python3 run_routing_benchmark.py [options]
+    python3 run_routing_benchmark.py --tier light|medium|complex
+    python3 run_routing_benchmark.py --category <name>
+    python3 run_routing_benchmark.py --ids 1,2,3
+    python3 run_routing_benchmark.py --list-categories
+
+No GPU check needed — inference is disabled for all queries.
+Adolf must be running at http://localhost:8000.
+"""
+
+import argparse
+import asyncio
+import json
+import re
+import subprocess
+import sys
+import time
+from pathlib import Path
+
+import httpx
+
+ADOLF_URL = "http://localhost:8000"
+DATASET = Path(__file__).parent / "benchmark.json"
+RESULTS = Path(__file__).parent / "routing_results_latest.json"
+QUERY_TIMEOUT = 1  # 1s strict deadline — routing must decide within 1 second
+
+
+# ── Log helpers ────────────────────────────────────────────────────────────────
+
+def get_log_tail(n: int = 50) -> str:
+    result = subprocess.run(
+        ["docker", "logs", "deepagents", "--tail", str(n)],
+        capture_output=True, text=True,
+    )
+    return result.stdout + result.stderr
+
+
+def extract_tier_from_logs(logs_before: str, logs_after: str) -> str | None:
+    """Find new tier= lines that appeared after we sent the query."""
+    before_lines = set(logs_before.splitlines())
+    new_lines = [line for line in logs_after.splitlines() if line not in before_lines]
+    for line in new_lines:
+        m = re.search(r"tier=(\w+(?:\s*\(no-inference\))?)", line)
+        if m:
+            tier_raw = m.group(1)
+            return tier_raw.split()[0]
+    return None
+
+
+# ── Request helpers ────────────────────────────────────────────────────────────
+
+async def post_message(client: httpx.AsyncClient, query_id: int, query: str) -> bool:
+    payload = {
+        "text": query,
+        "session_id": f"routing-bench-{query_id}",
+        "channel": "cli",
+        "user_id": "benchmark",
+        "metadata": {"no_inference": True, "benchmark": True},
+    }
+    try:
+        r = await client.post(f"{ADOLF_URL}/message", json=payload, timeout=10)
+        r.raise_for_status()
+        return True
+    except Exception as e:
+        print(f" POST_ERROR: {e}", end="")
+        return False
+
+
+# ── Dataset ────────────────────────────────────────────────────────────────────
+
+def load_dataset() -> list[dict]:
+    with open(DATASET) as f:
+        return json.load(f)["queries"]
+
+
+def filter_queries(queries, tier, category, ids):
+    if tier:
+        queries = [q for q in queries if q["tier"] == tier]
+    if category:
+        queries = [q for q in queries if q["category"] == category]
+    if ids:
+        queries = [q for q in queries if q["id"] in ids]
+    return queries
+
+
+# ── Main run ───────────────────────────────────────────────────────────────────
+
+async def run(queries: list[dict]) -> list[dict]:
+    results = []
+
+    async with httpx.AsyncClient() as client:
+        try:
+            r = await client.get(f"{ADOLF_URL}/health", timeout=5)
+            r.raise_for_status()
+        except Exception as e:
+            print(f"ERROR: Adolf not reachable: {e}", file=sys.stderr)
+            sys.exit(1)
+
+        total = len(queries)
+        correct = 0
+
+        print(f"\nRunning {total} queries [NO-INFERENCE: routing only]\n")
+        print(f"{'ID':>3}  {'EXPECTED':8}  {'ACTUAL':8}  {'OK':3}  {'TIME':6}  {'CATEGORY':22}  QUERY")
+        print("─" * 110)
+
+        for q in queries:
+            qid = q["id"]
+            expected = q["tier"]
+            category = q["category"]
+            query_text = q["query"]
+            session_id = f"routing-bench-{qid}"
+
+            print(f"{qid:>3}  {expected:8}  ", end="", flush=True)
+
+            logs_before = get_log_tail(300)
+            t0 = time.monotonic()
+
+            ok_post = await post_message(client, qid, query_text)
+            if not ok_post:
+                print(f"{'?':8}  {'ERR':3}  {'?':6}  {category:22}  {query_text[:40]}")
+                results.append({"id": qid, "expected": expected, "actual": None, "ok": False})
+                continue
+
+            try:
+                async with client.stream(
+                    "GET", f"{ADOLF_URL}/stream/{session_id}", timeout=QUERY_TIMEOUT
+                ) as sse:
+                    async for line in sse.aiter_lines():
+                        if "data: [DONE]" in line:
+                            break
+            except Exception:
+                pass  # timeout or connection issue — check logs anyway
+
+            logs_after = get_log_tail(300)
+            actual = extract_tier_from_logs(logs_before, logs_after)
+            if actual is None:
+                actual = "timeout"
+
+            elapsed = time.monotonic() - t0
+            match = actual == expected or (actual == "fast" and expected == "medium")
+            if match:
+                correct += 1
+
+            mark = "✓" if match else "✗"
+            actual_str = actual
+            print(f"{actual_str:8}  {mark:3}  {elapsed:5.1f}s  {category:22}  {query_text[:40]}")
+
+            results.append({
+                "id": qid,
+                "expected": expected,
+                "actual": actual_str,
+                "ok": match,
+                "elapsed": round(elapsed, 1),
+                "category": category,
+                "query": query_text,
+            })
+
+        print("─" * 110)
+        accuracy = correct / total * 100 if total else 0
+        print(f"\nAccuracy: {correct}/{total} ({accuracy:.0f}%)")
+
+        for tier_name in ["light", "medium", "complex"]:
+            tier_qs = [r for r in results if r["expected"] == tier_name]
+            if tier_qs:
+                tier_ok = sum(1 for r in tier_qs if r["ok"])
+                print(f"  {tier_name:8}: {tier_ok}/{len(tier_qs)}")
+
+        wrong = [r for r in results if not r["ok"]]
+        if wrong:
+            print(f"\nMisclassified ({len(wrong)}):")
+            for r in wrong:
+                print(f"  id={r['id']:3}  expected={r['expected']:8}  actual={r['actual']:8}  {r['query'][:60]}")
+
+    with open(RESULTS, "w") as f:
+        json.dump(results, f, indent=2, ensure_ascii=False)
+    print(f"\nResults saved to {RESULTS}")
+
+    return results
+
+
+def main():
+    parser = argparse.ArgumentParser(
+        description="Adolf routing benchmark — routing decisions only, no LLM inference",
+    )
+    parser.add_argument("--tier", choices=["light", "medium", "complex"])
+    parser.add_argument("--category")
+    parser.add_argument("--ids", help="Comma-separated IDs")
+    parser.add_argument("--list-categories", action="store_true")
+    args = parser.parse_args()
+
+    queries = load_dataset()
+
+    if args.list_categories:
+        cats = sorted(set(q["category"] for q in queries))
+        tiers = {t: sum(1 for q in queries if q["tier"] == t) for t in ["light", "medium", "complex"]}
+        print(f"Total: {len(queries)} | Tiers: {tiers}")
+        print(f"Categories: {cats}")
+        return
+
+    ids = [int(i) for i in args.ids.split(",")] if args.ids else None
+    queries = filter_queries(queries, args.tier, args.category, ids)
+    if not queries:
+        print("No queries match filters.")
+        sys.exit(1)
+
+    asyncio.run(run(queries))
+
+
+if __name__ == "__main__":
+    main()
--- a/benchmarks/run_voice_benchmark.py
+++ b/benchmarks/run_voice_benchmark.py
@@ -0,0 +1,425 @@
+#!/usr/bin/env python3
+"""
+Adolf voice benchmark.
+
+Pipeline for each query:
+  1. Synthesize query text → WAV via Silero TTS (localhost:8881)
+  2. Transcribe WAV → text via faster-whisper STT (localhost:8880)
+  3. Send transcription to Adolf → check routing tier
+  4. Report: WER per query, routing accuracy vs text baseline
+
+Usage:
+    python3 run_voice_benchmark.py [options]
+    python3 run_voice_benchmark.py --tier light|medium|complex
+    python3 run_voice_benchmark.py --ids 1,2,3
+    python3 run_voice_benchmark.py --no-inference  # skip LLM inference — routing only, all tiers
+
+IMPORTANT: Always check GPU is free before running. Done automatically.
+
+Services required:
+  - Adolf:         http://localhost:8000
+  - Silero TTS:    http://localhost:8881  (openai/silero-tts container)
+  - faster-whisper: http://localhost:8880  (faster-whisper container)
+"""
+
+import argparse
+import asyncio
+import io
+import json
+import re
+import subprocess
+import sys
+import tempfile
+import time
+import unicodedata
+from pathlib import Path
+
+import httpx
+
+ADOLF_URL = "http://localhost:8000"
+OLLAMA_URL = "http://localhost:11436"
+TTS_URL = "http://localhost:8881"       # Silero TTS — OpenAI-compatible /v1/audio/speech
+STT_URL = "http://localhost:8880"       # faster-whisper — OpenAI-compatible /v1/audio/transcriptions
+
+DATASET = Path(__file__).parent / "benchmark.json"
+RESULTS_DIR = Path(__file__).parent
+
+TIER_WAIT = 15        # seconds to wait for tier= in docker logs
+MIN_FREE_RAM_MB = 1500
+MIN_FREE_VRAM_MB = 500
+
+
+# ── Pre-flight ─────────────────────────────────────────────────────────────────
+
+def check_ram() -> tuple[bool, str]:
+    try:
+        with open("/proc/meminfo") as f:
+            info = {}
+            for line in f:
+                parts = line.split()
+                if len(parts) >= 2:
+                    info[parts[0].rstrip(":")] = int(parts[1])
+        free_mb = info.get("MemAvailable", 0) // 1024
+        total_mb = info.get("MemTotal", 0) // 1024
+        msg = f"RAM: {free_mb} MB free / {total_mb} MB total"
+        if free_mb < MIN_FREE_RAM_MB:
+            return False, f"CRITICAL: {msg} — need at least {MIN_FREE_RAM_MB} MB free"
+        return True, msg
+    except Exception as e:
+        return True, f"RAM check failed (non-fatal): {e}"
+
+
+def check_gpu() -> tuple[bool, str]:
+    try:
+        r = httpx.get(f"{OLLAMA_URL}/api/ps", timeout=5)
+        r.raise_for_status()
+        data = r.json()
+        models = data.get("models", [])
+        if models:
+            names = [m.get("name", "?") for m in models]
+            sizes_mb = [m.get("size_vram", 0) // (1024 * 1024) for m in models]
+            loaded = ", ".join(f"{n} ({s}MB)" for n, s in zip(names, sizes_mb))
+            total_vram = sum(sizes_mb)
+            if total_vram > 7000:
+                return False, f"GPU BUSY: {loaded} — {total_vram}MB VRAM used. Wait for models to unload."
+            return True, f"GPU: {loaded} ({total_vram}MB VRAM)"
+        return True, "GPU: idle"
+    except httpx.ConnectError:
+        return True, "GPU check skipped (Ollama not reachable)"
+    except Exception as e:
+        return True, f"GPU check failed (non-fatal): {e}"
+
+
+def check_services() -> tuple[bool, str]:
+    """Check TTS and STT are reachable."""
+    msgs = []
+    ok = True
+    for name, url, path in [("TTS", TTS_URL, "/"), ("STT", STT_URL, "/")]:
+        try:
+            r = httpx.get(url + path, timeout=5)
+            msgs.append(f"{name}: reachable (HTTP {r.status_code})")
+        except Exception as e:
+            msgs.append(f"{name}: NOT REACHABLE — {e}")
+            ok = False
+    return ok, " | ".join(msgs)
+
+
+def preflight_checks(skip_gpu_check: bool = False) -> bool:
+    print("\n── Pre-flight checks ──────────────────────────────────────────")
+    ram_ok, ram_msg = check_ram()
+    print(f"  {'✓' if ram_ok else '✗'} {ram_msg}")
+    if not ram_ok:
+        print("\nABORTING: not enough RAM.")
+        return False
+
+    if not skip_gpu_check:
+        gpu_ok, gpu_msg = check_gpu()
+        print(f"  {'✓' if gpu_ok else '✗'} {gpu_msg}")
+        if not gpu_ok:
+            print("\nABORTING: GPU is busy.")
+            return False
+
+    svc_ok, svc_msg = check_services()
+    print(f"  {'✓' if svc_ok else '✗'} {svc_msg}")
+    if not svc_ok:
+        print("\nABORTING: required voice services not running.")
+        print("Start them with: cd /home/alvis/agap_git/openai && docker compose up -d faster-whisper silero-tts")
+        return False
+
+    print("  All checks passed.\n")
+    return True
+
+
+# ── TTS ────────────────────────────────────────────────────────────────────────
+
+async def synthesize(client: httpx.AsyncClient, text: str) -> bytes | None:
+    """Synthesize text to WAV via Silero TTS (OpenAI-compatible /v1/audio/speech)."""
+    try:
+        r = await client.post(
+            f"{TTS_URL}/v1/audio/speech",
+            json={"model": "tts-1", "input": text, "voice": "alloy", "response_format": "wav"},
+            timeout=30,
+        )
+        r.raise_for_status()
+        return r.content
+    except Exception as e:
+        print(f"\n  [TTS error: {e}]", end="")
+        return None
+
+
+# ── STT ────────────────────────────────────────────────────────────────────────
+
+async def transcribe(client: httpx.AsyncClient, wav_bytes: bytes) -> str | None:
+    """Transcribe WAV to text via faster-whisper (OpenAI-compatible /v1/audio/transcriptions)."""
+    try:
+        files = {"file": ("audio.wav", wav_bytes, "audio/wav")}
+        data = {"model": "whisper-1", "language": "ru", "response_format": "json"}
+        r = await client.post(
+            f"{STT_URL}/v1/audio/transcriptions",
+            files=files,
+            data=data,
+            timeout=60,
+        )
+        r.raise_for_status()
+        result = r.json()
+        return result.get("text", "").strip()
+    except Exception as e:
+        print(f"\n  [STT error: {e}]", end="")
+        return None
+
+
+# ── WER ────────────────────────────────────────────────────────────────────────
+
+def normalize(text: str) -> str:
+    """Lowercase, strip punctuation, normalize unicode for WER calculation."""
+    text = unicodedata.normalize("NFC", text.lower())
+    text = re.sub(r"[^\w\s]", " ", text)
+    return re.sub(r"\s+", " ", text).strip()
+
+
+def word_error_rate(reference: str, hypothesis: str) -> float:
+    """Compute WER between reference and hypothesis."""
+    ref = normalize(reference).split()
+    hyp = normalize(hypothesis).split()
+    if not ref:
+        return 0.0 if not hyp else 1.0
+    # Dynamic programming edit distance
+    d = [[0] * (len(hyp) + 1) for _ in range(len(ref) + 1)]
+    for i in range(len(ref) + 1):
+        d[i][0] = i
+    for j in range(len(hyp) + 1):
+        d[0][j] = j
+    for i in range(1, len(ref) + 1):
+        for j in range(1, len(hyp) + 1):
+            if ref[i - 1] == hyp[j - 1]:
+                d[i][j] = d[i - 1][j - 1]
+            else:
+                d[i][j] = 1 + min(d[i - 1][j], d[i][j - 1], d[i - 1][j - 1])
+    return d[len(ref)][len(hyp)] / len(ref)
+
+
+# ── Adolf interaction ──────────────────────────────────────────────────────────
+
+def get_log_tail(n: int = 60) -> str:
+    result = subprocess.run(
+        ["docker", "logs", "deepagents", "--tail", str(n)],
+        capture_output=True, text=True,
+    )
+    return result.stdout + result.stderr
+
+
+def extract_tier_from_logs(logs_before: str, logs_after: str) -> str | None:
+    before_lines = set(logs_before.splitlines())
+    new_lines = [line for line in logs_after.splitlines() if line not in before_lines]
+    for line in new_lines:
+        m = re.search(r"tier=(\w+(?:\s*\(no-inference\))?)", line)
+        if m:
+            return m.group(1).split()[0]
+    return None
+
+
+async def post_to_adolf(
+    client: httpx.AsyncClient,
+    query_id: int,
+    text: str,
+    no_inference: bool = False,
+) -> bool:
+    payload = {
+        "text": text,
+        "session_id": f"voice-bench-{query_id}",
+        "channel": "cli",
+        "user_id": "benchmark",
+        "metadata": {"no_inference": no_inference, "benchmark": True, "voice": True},
+    }
+    try:
+        r = await client.post(f"{ADOLF_URL}/message", json=payload, timeout=10)
+        r.raise_for_status()
+        return True
+    except Exception as e:
+        print(f"\n  [Adolf error: {e}]", end="")
+        return False
+
+
+# ── Dataset ────────────────────────────────────────────────────────────────────
+
+def load_dataset() -> list[dict]:
+    with open(DATASET) as f:
+        return json.load(f)["queries"]
+
+
+def filter_queries(queries, tier, category, ids):
+    if tier:
+        queries = [q for q in queries if q["tier"] == tier]
+    if category:
+        queries = [q for q in queries if q["category"] == category]
+    if ids:
+        queries = [q for q in queries if q["id"] in ids]
+    return queries
+
+
+# ── Main run ───────────────────────────────────────────────────────────────────
+
+async def run(queries: list[dict], no_inference: bool = False, save_audio: bool = False) -> None:
+    async with httpx.AsyncClient() as client:
+        # Check Adolf
+        try:
+            r = await client.get(f"{ADOLF_URL}/health", timeout=5)
+            r.raise_for_status()
+        except Exception as e:
+            print(f"ERROR: Adolf not reachable: {e}", file=sys.stderr)
+            sys.exit(1)
+
+        total = len(queries)
+        results = []
+
+        dry_label = " [NO-INFERENCE: routing only]" if no_inference else ""
+        print(f"Voice benchmark: {total} queries{dry_label}\n")
+        print(f"{'ID':>3}  {'EXP':8}  {'ACT':8}  {'OK':3}  {'WER':5}  {'TRANSCRIPT'}")
+        print("─" * 100)
+
+        total_wer = 0.0
+        wer_count = 0
+        correct = 0
+
+        for q in queries:
+            qid = q["id"]
+            expected = q["tier"]
+            original = q["query"]
+            print(f"{qid:>3}  {expected:8}  ", end="", flush=True)
+
+            # Step 1: TTS
+            wav = await synthesize(client, original)
+            if wav is None:
+                print(f"{'?':8}  {'ERR':3}  {'?':5}  [TTS failed]")
+                results.append({"id": qid, "expected": expected, "actual": None, "ok": False, "wer": None, "error": "tts"})
+                continue
+
+            if save_audio:
+                audio_path = RESULTS_DIR / f"voice_audio" / f"{qid}.wav"
+                audio_path.parent.mkdir(exist_ok=True)
+                audio_path.write_bytes(wav)
+
+            # Step 2: STT
+            transcript = await transcribe(client, wav)
+            if transcript is None:
+                print(f"{'?':8}  {'ERR':3}  {'?':5}  [STT failed]")
+                results.append({"id": qid, "expected": expected, "actual": None, "ok": False, "wer": None, "error": "stt"})
+                continue
+
+            # Calculate WER
+            wer = word_error_rate(original, transcript)
+            total_wer += wer
+            wer_count += 1
+
+            # Step 3: Send to Adolf
+            logs_before = get_log_tail(60)
+            t0 = time.monotonic()
+
+            ok_post = await post_to_adolf(client, qid, transcript, no_inference=no_inference)
+            if not ok_post:
+                print(f"{'?':8}  {'ERR':3}  {wer:4.2f}  {transcript[:50]}")
+                results.append({"id": qid, "expected": expected, "actual": None, "ok": False, "wer": wer, "transcript": transcript})
+                continue
+
+            # Step 4: Wait for routing decision
+            actual = None
+            for _ in range(TIER_WAIT * 2):
+                await asyncio.sleep(0.5)
+                logs_after = get_log_tail(60)
+                actual = extract_tier_from_logs(logs_before, logs_after)
+                if actual and actual in ("light", "medium", "complex", "fast"):
+                    break
+
+            elapsed = time.monotonic() - t0
+            match = actual == expected or (actual == "fast" and expected == "medium")
+            if match:
+                correct += 1
+
+            mark = "✓" if match else "✗"
+            actual_str = actual or "?"
+            print(f"{actual_str:8}  {mark:3}  {wer:4.2f}  {transcript[:60]}")
+
+            results.append({
+                "id": qid,
+                "expected": expected,
+                "actual": actual_str,
+                "ok": match,
+                "wer": round(wer, 3),
+                "original": original,
+                "transcript": transcript,
+                "elapsed": round(elapsed, 1),
+                "no_inference": no_inference,
+            })
+
+            await asyncio.sleep(0.5)
+
+        print("─" * 100)
+
+        # Summary
+        accuracy = correct / total * 100 if total else 0
+        avg_wer = total_wer / wer_count * 100 if wer_count else 0
+        print(f"\nRouting accuracy: {correct}/{total} ({accuracy:.0f}%)")
+        print(f"Average WER:      {avg_wer:.1f}%  (lower is better; 0% = perfect transcription)")
+
+        for tier_name in ["light", "medium", "complex"]:
+            tier_qs = [r for r in results if r["expected"] == tier_name]
+            if tier_qs:
+                tier_ok = sum(1 for r in tier_qs if r["ok"])
+                tier_wers = [r["wer"] for r in tier_qs if r.get("wer") is not None]
+                avg = sum(tier_wers) / len(tier_wers) * 100 if tier_wers else 0
+                print(f"  {tier_name:8}: routing {tier_ok}/{len(tier_qs)}  avg WER {avg:.1f}%")
+
+        wrong = [r for r in results if not r["ok"]]
+        if wrong:
+            print(f"\nMisclassified after voice ({len(wrong)}):")
+            for r in wrong:
+                print(f"  id={r['id']:3}  expected={r.get('expected') or '?':8}  actual={r.get('actual') or '?':8}  transcript={r.get('transcript','')[:50]}")
+
+        high_wer = [r for r in results if r.get("wer") and r["wer"] > 0.3]
+        if high_wer:
+            print(f"\nHigh WER queries (>30%) — transcription quality issues:")
+            for r in high_wer:
+                print(f"  id={r['id']:3}  WER={r['wer']*100:.0f}%  original: {r.get('original','')[:50]}")
+                print(f"          transcript: {r.get('transcript','')[:50]}")
+
+        # Save results
+        ts = int(time.time())
+        out_path = RESULTS_DIR / f"voice_results_{ts}.json"
+        latest_path = RESULTS_DIR / "voice_results_latest.json"
+        with open(out_path, "w") as f:
+            json.dump({"summary": {"accuracy": accuracy, "avg_wer": avg_wer, "total": total}, "results": results}, f, indent=2, ensure_ascii=False)
+        with open(latest_path, "w") as f:
+            json.dump({"summary": {"accuracy": accuracy, "avg_wer": avg_wer, "total": total}, "results": results}, f, indent=2, ensure_ascii=False)
+        print(f"\nResults saved to {latest_path}")
+
+
+def main():
+    parser = argparse.ArgumentParser(
+        description="Adolf voice benchmark — TTS→STT→routing pipeline",
+        epilog="Requires: Silero TTS (port 8881) and faster-whisper (port 8880) running."
+    )
+    parser.add_argument("--tier", choices=["light", "medium", "complex"])
+    parser.add_argument("--category")
+    parser.add_argument("--ids", help="Comma-separated IDs")
+    parser.add_argument("--no-inference", action="store_true",
+                        help="Skip LLM inference for all tiers — routing decisions only (no GPU/API cost)")
+    parser.add_argument("--save-audio", action="store_true",
+                        help="Save synthesized WAV files to voice_audio/ directory")
+    parser.add_argument("--skip-gpu-check", action="store_true")
+    args = parser.parse_args()
+
+    if not preflight_checks(skip_gpu_check=args.skip_gpu_check or args.no_inference):
+        sys.exit(1)
+
+    queries = load_dataset()
+    ids = [int(i) for i in args.ids.split(",")] if args.ids else None
+    queries = filter_queries(queries, args.tier, args.category, ids)
+    if not queries:
+        print("No queries match filters.")
+        sys.exit(1)
+
+    asyncio.run(run(queries, no_inference=args.no_inference, save_audio=args.save_audio))
+
+
+if __name__ == "__main__":
+    main()
--- a/bifrost-config.json
+++ b/bifrost-config.json
@@ -0,0 +1,75 @@
+{
+  "auth_config": {
+    "is_enabled": true,
+    "admin_username": "admin",
+    "admin_password": "env.BIFROST_ADMIN_PASSWORD"
+  },
+  "config_store": {
+    "enabled": true,
+    "type": "postgres",
+    "config": {
+      "host": "bifrost-db",
+      "port": "5432",
+      "user": "bifrost",
+      "password": "bifrost",
+      "db_name": "bifrost",
+      "ssl_mode": "disable"
+    }
+  },
+  "client": {
+    "drop_excess_requests": false
+  },
+  "providers": {
+    "ollama": {
+      "keys": [
+        {
+          "name": "ollama-gpu",
+          "value": "dummy",
+          "models": [
+            "qwen2.5:0.5b",
+            "qwen2.5:1.5b",
+            "qwen3:4b",
+            "gemma3:4b",
+            "qwen3:8b"
+          ],
+          "weight": 1.0
+        }
+      ],
+      "network_config": {
+        "base_url": "http://host.docker.internal:11436",
+        "default_request_timeout_in_seconds": 300,
+        "max_retries": 2,
+        "retry_backoff_initial_ms": 500,
+        "retry_backoff_max_ms": 10000
+      }
+    },
+    "ollama-cpu": {
+      "keys": [
+        {
+          "name": "ollama-cpu-key",
+          "value": "dummy",
+          "models": [
+            "gemma3:1b",
+            "qwen2.5:1.5b",
+            "qwen2.5:3b"
+          ],
+          "weight": 1.0
+        }
+      ],
+      "network_config": {
+        "base_url": "http://host.docker.internal:11435",
+        "default_request_timeout_in_seconds": 120,
+        "max_retries": 2,
+        "retry_backoff_initial_ms": 500,
+        "retry_backoff_max_ms": 10000
+      },
+      "custom_provider_config": {
+        "base_provider_type": "openai",
+        "allowed_requests": {
+          "chat_completion": true,
+          "chat_completion_stream": true
+        }
+      }
+    }
+  }
+}
--- a/channels.py
+++ b/channels.py
@@ -49,6 +49,7 @@ async def deliver(session_id: str, channel: str, text: str) -> None:
 # ── built-in channel adapters ─────────────────────────────────────────────────

 GRAMMY_URL = os.getenv("GRAMMY_URL", "http://grammy:3001")
+MATRIX_URL = os.getenv("MATRIX_URL", "http://matrix:3002")


 async def _telegram_send(session_id: str, text: str) -> None:
@@ -64,12 +65,26 @@ async def _telegram_send(session_id: str, text: str) -> None:
            )


+async def _matrix_send(session_id: str, text: str) -> None:
+    """Send reply to Matrix via the matrix adapter POST /send endpoint."""
+    room_id = session_id.removeprefix("mx-")
+    MAX_MTX = 4000
+    chunks = [text[i:i + MAX_MTX] for i in range(0, len(text), MAX_MTX)]
+    async with httpx.AsyncClient(timeout=15) as client:
+        for chunk in chunks:
+            await client.post(
+                f"{MATRIX_URL}/send",
+                json={"room_id": room_id, "text": chunk},
+            )
+
+
 async def _cli_send(session_id: str, text: str) -> None:
    """CLI replies are handled entirely through the pending_replies queue — no-op here."""
    pass


 def register_defaults() -> None:
-    """Register the built-in Telegram and CLI channel adapters."""
+    """Register the built-in Telegram, Matrix, and CLI channel adapters."""
    register("telegram", _telegram_send)
+    register("matrix", _matrix_send)
    register("cli", _cli_send)
--- a/cli.py
+++ b/cli.py
@@ -1,9 +1,9 @@
 #!/usr/bin/env python3
 """
-Adolf CLI — interactive REPL for the multi-channel gateway.
+Adolf CLI — interactive REPL with Rich streaming display.

 Usage:
-    python3 cli.py [--url http://localhost:8000] [--session cli-alvis]
+    python3 cli.py [--url http://deepagents:8000] [--session cli-alvis]
 """

 import argparse
@@ -12,7 +12,13 @@ import os
 import sys
 import urllib.request

-GATEWAY = "http://localhost:8000"
+from rich.console import Console
+from rich.live import Live
+from rich.markdown import Markdown
+from rich.text import Text
+
+GATEWAY = "http://deepagents:8000"
+console = Console()


 def post_message(gateway: str, text: str, session_id: str) -> None:
@@ -20,7 +26,7 @@ def post_message(gateway: str, text: str, session_id: str) -> None:
        "text": text,
        "session_id": session_id,
        "channel": "cli",
-        "user_id": os.getlogin(),
+        "user_id": os.getenv("USER", "user"),
    }).encode()
    req = urllib.request.Request(
        f"{gateway}/message",
@@ -30,33 +36,49 @@ def post_message(gateway: str, text: str, session_id: str) -> None:
    )
    with urllib.request.urlopen(req, timeout=10) as r:
        if r.status != 202:
-            print(f"[error] gateway returned {r.status}", file=sys.stderr)
+            console.print(f"[red][error] gateway returned {r.status}[/red]")
            sys.exit(1)


-def wait_for_reply(gateway: str, session_id: str, timeout: int = 400) -> str:
-    """Open SSE stream and return first data event."""
+def stream_reply(gateway: str, session_id: str, timeout: int = 400) -> str:
+    """
+    Open the /stream/{session_id} SSE endpoint and display tokens live with
+    Rich. Returns the full assembled reply text.
+    """
    req = urllib.request.Request(
-        f"{gateway}/reply/{session_id}",
+        f"{gateway}/stream/{session_id}",
        headers={"Accept": "text/event-stream"},
    )
+    buffer = ""
    with urllib.request.urlopen(req, timeout=timeout + 5) as r:
-        for raw_line in r:
-            line = raw_line.decode("utf-8").rstrip("\n")
-            if line.startswith("data:"):
-                return line[5:].strip().replace("\\n", "\n")
-    return ""
+        with Live(Text(""), console=console, refresh_per_second=20, transient=True) as live:
+            for raw_line in r:
+                line = raw_line.decode("utf-8").rstrip("\n")
+                if not line.startswith("data:"):
+                    continue
+                chunk = line[5:].strip()
+                if chunk == "[DONE]":
+                    break
+                chunk = chunk.replace("\\n", "\n")
+                buffer += chunk
+                live.update(Text(buffer))
+
+    # Render the complete reply as Markdown once streaming is done
+    console.print(Markdown(buffer))
+    return buffer


 def main():
    parser = argparse.ArgumentParser(description="Adolf CLI")
    parser.add_argument("--url", default=GATEWAY, help="Gateway URL")
-    parser.add_argument("--session", default=f"cli-{os.getlogin()}", help="Session ID")
+    parser.add_argument("--session", default=f"cli-{os.getenv('USER', 'user')}",
+                        help="Session ID")
    parser.add_argument("--timeout", type=int, default=400, help="Reply timeout (seconds)")
    args = parser.parse_args()

-    print(f"Adolf CLI  (session={args.session}, gateway={args.url})")
-    print("Type your message and press Enter. Ctrl+C or Ctrl+D to exit.\n")
+    console.print(f"[bold]Adolf CLI[/bold]  (session=[cyan]{args.session}[/cyan], "
+                  f"gateway=[cyan]{args.url}[/cyan])")
+    console.print("Type your message and press Enter. Ctrl+C or Ctrl+D to exit.\n")

    try:
        while True:
@@ -68,12 +90,11 @@ def main():
                continue

            post_message(args.url, text, args.session)
-            print("...", end="", flush=True)
-            reply = wait_for_reply(args.url, args.session, timeout=args.timeout)
-            print(f"\r{reply}\n")
+            stream_reply(args.url, args.session, timeout=args.timeout)
+            console.print()

    except KeyboardInterrupt:
-        print("\nbye")
+        console.print("\n[dim]bye[/dim]")


 if __name__ == "__main__":
--- a/docker-compose.yml
+++ b/docker-compose.yml
@@ -6,19 +6,29 @@ services:
      - "8000:8000"
    environment:
      - PYTHONUNBUFFERED=1
+      # LiteLLM proxy — all LLM inference goes through here
+      - LITELLM_URL=http://host.docker.internal:4000/v1
+      - LITELLM_API_KEY=sk-fjQC1BxAiGFSMs
+      # Direct Ollama GPU URL — used only by VRAMManager for flush/prewarm
      - OLLAMA_BASE_URL=http://host.docker.internal:11436
      - DEEPAGENTS_MODEL=qwen3:4b
-      - DEEPAGENTS_COMPLEX_MODEL=qwen3:8b
+      - DEEPAGENTS_COMPLEX_MODEL=deepseek/deepseek-r1:free
      - DEEPAGENTS_ROUTER_MODEL=qwen2.5:1.5b
      - SEARXNG_URL=http://host.docker.internal:11437
      - GRAMMY_URL=http://grammy:3001
+      - MATRIX_URL=http://host.docker.internal:3002
      - CRAWL4AI_URL=http://crawl4ai:11235
+      - ROUTECHECK_URL=http://routecheck:8090
+      - ROUTECHECK_TOKEN=${ROUTECHECK_TOKEN}
+    volumes:
+      - ./logs:/app/logs
    extra_hosts:
      - "host.docker.internal:host-gateway"
    depends_on:
      - openmemory
      - grammy
      - crawl4ai
+      - routecheck
    restart: unless-stopped

  openmemory:
@@ -27,8 +37,9 @@ services:
    ports:
      - "8765:8765"
    environment:
-      # Extraction LLM (qwen2.5:1.5b) runs on GPU after reply — fast 2-5s extraction
+      # Extraction LLM runs on GPU — qwen2.5:1.5b for speed (~3s)
      - OLLAMA_GPU_URL=http://host.docker.internal:11436
+      - OLLAMA_EXTRACTION_MODEL=qwen2.5:1.5b
      # Embedding (nomic-embed-text) runs on CPU — fast enough for search (50-150ms)
      - OLLAMA_CPU_URL=http://host.docker.internal:11435
    extra_hosts:
@@ -45,6 +56,33 @@ services:
      - DEEPAGENTS_URL=http://deepagents:8000
    restart: unless-stopped

+  cli:
+    build:
+      context: .
+      dockerfile: Dockerfile.cli
+    container_name: cli
+    environment:
+      - DEEPAGENTS_URL=http://deepagents:8000
+    depends_on:
+      - deepagents
+    stdin_open: true
+    tty: true
+    profiles:
+      - tools
+
+  routecheck:
+    build: ./routecheck
+    container_name: routecheck
+    ports:
+      - "8090:8090"
+    environment:
+      - YANDEX_ROUTING_KEY=${YANDEX_ROUTING_KEY}
+      - INTERNAL_TOKEN=${ROUTECHECK_TOKEN}
+      - HTTPS_PROXY=http://host.docker.internal:56928
+    extra_hosts:
+      - "host.docker.internal:host-gateway"
+    restart: unless-stopped
+
  crawl4ai:
    image: unclecode/crawl4ai:latest
    container_name: crawl4ai
--- a/fast_tools.py
+++ b/fast_tools.py
@@ -0,0 +1,188 @@
+"""
+Fast Tools — pre-flight tools invoked by a classifier before the main LLM call.
+
+Each FastTool has:
+  - matches(message) → bool   : regex classifier that determines if this tool applies
+  - run(message) → str        : async fetch that returns enrichment context
+
+FastToolRunner holds a list of FastTools. The Router uses any_matches() to force
+the tier to medium before LLM classification. run_agent_task() calls run_matching()
+to build extra context that is injected into the system prompt.
+
+To add a new fast tool:
+  1. Subclass FastTool, implement name/matches/run
+  2. Add an instance to the list passed to FastToolRunner in agent.py
+"""
+
+import asyncio
+import re
+from abc import ABC, abstractmethod
+
+import httpx
+
+
+class FastTool(ABC):
+    """Base class for all pre-flight fast tools."""
+
+    @property
+    @abstractmethod
+    def name(self) -> str: ...
+
+    @abstractmethod
+    def matches(self, message: str) -> bool: ...
+
+    @abstractmethod
+    async def run(self, message: str) -> str: ...
+
+
+_WMO_CODES = {
+    0: "clear sky", 1: "mainly clear", 2: "partly cloudy", 3: "overcast",
+    45: "fog", 48: "icy fog",
+    51: "light drizzle", 53: "drizzle", 55: "heavy drizzle",
+    61: "light rain", 63: "rain", 65: "heavy rain",
+    71: "light snow", 73: "snow", 75: "heavy snow", 77: "snow grains",
+    80: "light showers", 81: "showers", 82: "heavy showers",
+    85: "snow showers", 86: "heavy snow showers",
+    95: "thunderstorm", 96: "thunderstorm with hail", 99: "thunderstorm with heavy hail",
+}
+
+
+class WeatherTool(FastTool):
+    """
+    Fetches current weather for Balashikha, Moscow region directly from open-meteo.com.
+    No API key required. Returns a ready-to-deliver reply — no LLM reformatting needed.
+    """
+
+    _PATTERN = re.compile(
+        r"\b(weather|forecast|temperature|rain(ing)?|snow(ing)?|humidity|wind\s*speed"
+        r"|холодно|тепло|погода|прогноз погоды"
+        r"|how (hot|cold|warm) is it|what.?s the (weather|temp)|dress for the weather)\b",
+        re.IGNORECASE,
+    )
+
+    _URL = (
+        "https://api.open-meteo.com/v1/forecast"
+        "?latitude=55.7963&longitude=37.9382"
+        "&current=temperature_2m,apparent_temperature,relative_humidity_2m"
+        ",wind_speed_10m,weather_code"
+        "&wind_speed_unit=ms"
+    )
+
+    @property
+    def name(self) -> str:
+        return "weather"
+
+    def matches(self, message: str) -> bool:
+        return bool(self._PATTERN.search(message))
+
+    async def run(self, message: str) -> str:
+        try:
+            async with httpx.AsyncClient(timeout=10) as client:
+                r = await client.get(self._URL)
+                r.raise_for_status()
+                c = r.json()["current"]
+        except Exception as e:
+            return f"[weather error: {e}]"
+
+        temp = c["temperature_2m"]
+        feels = c["apparent_temperature"]
+        humidity = c["relative_humidity_2m"]
+        wind = c["wind_speed_10m"]
+        condition = _WMO_CODES.get(c.get("weather_code", 0), "unknown")
+
+        return (
+            f"Balashikha: {condition}, {temp:+.0f}°C (feels like {feels:+.0f}°C), "
+            f"wind {wind:.1f} m/s, humidity {humidity}%."
+        )
+
+
+class CommuteTool(FastTool):
+    """
+    Returns real-time driving time from home (Balashikha) to a destination
+    using Yandex traffic data via the local routecheck service.
+
+    Triggered by queries about commute time, arrival, or road traffic.
+    The routecheck service handles Yandex API auth and the HTTPS proxy.
+    """
+
+    _PATTERN = re.compile(
+        r"\b(commute|arrival time|how long.{0,20}(drive|get|travel|reach)"
+        r"|сколько.{0,20}(ехать|добираться|минут)"
+        r"|пробки|traffic|road.{0,10}now|drive to (work|office|center|москва|moscow)"
+        r"|when (will i|do i) (arrive|get there|reach))\b",
+        re.IGNORECASE,
+    )
+
+    # Home: Balashikha. Default destination: Moscow city center.
+    _HOME = "55.7963,37.9382"
+    _DEST = "55.7558,37.6173"
+
+    def __init__(self, routecheck_url: str, internal_token: str):
+        self._url = routecheck_url.rstrip("/")
+        self._token = internal_token
+
+    @property
+    def name(self) -> str:
+        return "commute"
+
+    def matches(self, message: str) -> bool:
+        return bool(self._PATTERN.search(message))
+
+    async def run(self, message: str) -> str:
+        if not self._token:
+            return "[commute: ROUTECHECK_TOKEN not configured]"
+        try:
+            async with httpx.AsyncClient(timeout=15) as client:
+                r = await client.get(
+                    f"{self._url}/api/route",
+                    params={"from": self._HOME, "to": self._DEST, "token": self._token},
+                )
+                r.raise_for_status()
+                d = r.json()
+        except Exception as e:
+            return f"[commute error: {e}]"
+
+        traffic = d["duration_traffic_min"]
+        normal  = d["duration_min"]
+        dist    = d["distance_km"]
+        delay   = traffic - normal
+
+        lines = [
+            f"Current drive time from Balashikha to Moscow center:",
+            f"  With traffic: {traffic} min",
+            f"  Without traffic: {normal} min",
+            f"  Distance: {dist} km",
+        ]
+        if delay > 5:
+            lines.append(f"  Traffic delay: +{delay} min")
+        return "\n".join(lines)
+
+
+class FastToolRunner:
+    """
+    Classifier + executor for fast tools.
+
+    Used in two places:
+      - Router.route(): any_matches() forces medium tier before LLM classification
+      - run_agent_task(): run_matching() builds enrichment context in the pre-flight gather
+    """
+
+    def __init__(self, tools: list[FastTool]):
+        self._tools = tools
+
+    def any_matches(self, message: str) -> bool:
+        """True if any fast tool applies to this message."""
+        return any(t.matches(message) for t in self._tools)
+
+    def matching_names(self, message: str) -> list[str]:
+        """Names of tools that match this message (for logging)."""
+        return [t.name for t in self._tools if t.matches(message)]
+
+    async def run_matching(self, message: str) -> str:
+        """Run all matching tools concurrently and return combined context."""
+        matching = [t for t in self._tools if t.matches(message)]
+        if not matching:
+            return ""
+        results = await asyncio.gather(*[t.run(message) for t in matching])
+        parts = [r for r in results if r and not r.startswith("[")]
+        return "\n\n".join(parts)
--- a/openmemory/CLAUDE.md
+++ b/openmemory/CLAUDE.md
@@ -0,0 +1,26 @@
+# openmemory
+
+FastMCP server wrapping mem0 for persistent per-session memory, backed by Qdrant + nomic-embed-text.
+
+## Tools exposed (MCP)
+
+- `add_memory(text, user_id)` — extract facts from a conversation turn and store in Qdrant
+- `search_memory(query, user_id)` — semantic search, returns results with score ≥ 0.5
+- `get_all_memories(user_id)` — dump all stored memories for a session
+
+These are called directly by `agent.py` (outside the agent loop), never exposed to the LLM as tools.
+
+## Two Ollama instances
+
+- **GPU** (`OLLAMA_GPU_URL`, port 11436) — extraction model (`qwen2.5:1.5b`): pulls facts from conversation text
+- **CPU** (`OLLAMA_CPU_URL`, port 11435) — embedding model (`nomic-embed-text`): 50–150 ms per query
+
+## Prompts
+
+Custom `EXTRACTION_PROMPT` starts with `/no_think` to suppress qwen3 chain-of-thought and get clean JSON output. Custom `UPDATE_MEMORY_PROMPT` handles deduplication — mem0 merges new facts with existing ones rather than creating duplicates.
+
+## Notes
+
+- Qdrant collection is created automatically on first use
+- Memory is keyed by `user_id` which equals `session_id` in `agent.py`
+- Extraction runs after the reply is sent (background task) — GPU contention with medium model is avoided since the semaphore is released before `_store_memory()` is scheduled
--- a/openmemory/server.py
+++ b/openmemory/server.py
@@ -6,6 +6,7 @@ from mem0 import Memory
 # Extraction LLM — GPU Ollama (qwen3:4b, same model as medium agent)
 # Runs after reply when GPU is idle; spin-wait in agent.py prevents contention
 OLLAMA_GPU_URL = os.getenv("OLLAMA_GPU_URL", "http://host.docker.internal:11436")
+EXTRACTION_MODEL = os.getenv("OLLAMA_EXTRACTION_MODEL", "qwen2.5:1.5b")

 # Embedding — CPU Ollama (nomic-embed-text, 137 MB RAM)
 # Used for both search (50-150ms, acceptable) and store-time embedding
@@ -94,7 +95,7 @@ config = {
    "llm": {
        "provider": "ollama",
        "config": {
-            "model": "qwen3:4b",
+            "model": EXTRACTION_MODEL,
            "ollama_base_url": OLLAMA_GPU_URL,
            "temperature": 0.1,  # consistent JSON output
        },
--- a/pytest.ini
+++ b/pytest.ini
@@ -0,0 +1,4 @@
+[pytest]
+testpaths = tests/unit
+pythonpath = .
+asyncio_mode = auto
--- a/routecheck/CLAUDE.md
+++ b/routecheck/CLAUDE.md
@@ -0,0 +1,25 @@
+# routecheck
+
+FastAPI service providing a Yandex Routing API proxy behind an image captcha.
+
+## Purpose
+
+Yandex Routing API free tier requires a website that uses the API. This service is that website.
+It also exposes an internal endpoint (`/api/route`) used by `CommuteTool` in `fast_tools.py`.
+
+## Two access paths
+
+- **Web UI** (`/`): solve PIL arithmetic captcha → get a token → query any two lat/lon points
+- **Internal API**: `GET /api/route?from=lat,lon&to=lat,lon&token=$ROUTECHECK_TOKEN` — no captcha
+
+## Key env vars
+
+- `YANDEX_ROUTING_KEY` — from developer.tech.yandex.ru, Router API, free tier
+- `INTERNAL_TOKEN` — equals `ROUTECHECK_TOKEN` from root `.env`; shared with deepagents
+- `HTTPS_PROXY` — set to `http://host.docker.internal:56928`; container has no direct external internet
+
+## Notes
+
+- Captchas expire after 5 min, route tokens after 1 hour, both stored in-memory (restart clears them)
+- Yandex API expects `lon,lat` order (not `lat,lon`) — `app.py` swaps automatically
+- Captcha image endpoint: `GET /captcha/image/{id}` — regenerates on each call with random noise
--- a/routecheck/Dockerfile
+++ b/routecheck/Dockerfile
@@ -0,0 +1,6 @@
+FROM python:3.12-slim
+WORKDIR /app
+RUN apt-get update && apt-get install -y --no-install-recommends fonts-dejavu-core && rm -rf /var/lib/apt/lists/*
+RUN pip install --no-cache-dir fastapi uvicorn pillow httpx
+COPY app.py .
+CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8090"]
--- a/routecheck/app.py
+++ b/routecheck/app.py
@@ -0,0 +1,377 @@
+"""
+RouteCheck — local routing web service with image captcha.
+
+Endpoints:
+  GET  /                          — web UI
+  GET  /captcha/image/{id}        — PNG captcha image
+  POST /api/captcha/new           — create captcha, return {id}
+  POST /api/captcha/solve         — {id, answer} → {token} or 400
+  GET  /api/route                 — ?from=lat,lon&to=lat,lon&token=...
+                                    token = solved captcha token OR INTERNAL_TOKEN env var
+"""
+
+import io
+import math
+import os
+import random
+import string
+import time
+import uuid
+from typing import Optional
+
+import httpx
+from fastapi import FastAPI, HTTPException, Query
+from fastapi.responses import HTMLResponse, JSONResponse, StreamingResponse
+from PIL import Image, ImageDraw, ImageFilter, ImageFont
+from pydantic import BaseModel
+
+app = FastAPI(title="RouteCheck")
+
+# ── Config ─────────────────────────────────────────────────────────────────────
+YANDEX_KEY = os.getenv("YANDEX_ROUTING_KEY", "")
+INTERNAL_TOKEN = os.getenv("INTERNAL_TOKEN", "")
+HTTPS_PROXY = os.getenv("HTTPS_PROXY", "")
+CAPTCHA_TTL = 300          # seconds a captcha is valid
+TOKEN_TTL   = 3600         # seconds a solved token is valid
+
+# ── In-memory captcha store ────────────────────────────────────────────────────
+_captchas: dict[str, dict] = {}   # id → {answer, token, expires}
+_tokens:   dict[str, float] = {}  # token → expires
+
+
+def _purge():
+    now = time.time()
+    for k in list(_captchas.keys()):
+        if _captchas[k]["expires"] < now:
+            del _captchas[k]
+    for k in list(_tokens.keys()):
+        if _tokens[k] < now:
+            del _tokens[k]
+
+
+# ── Captcha image generation ───────────────────────────────────────────────────
+
+def _rand_color(dark=False):
+    if dark:
+        return tuple(random.randint(0, 100) for _ in range(3))
+    return tuple(random.randint(140, 255) for _ in range(3))
+
+
+def _make_captcha_image(text: str) -> bytes:
+    W, H = 220, 80
+    img = Image.new("RGB", (W, H), color=_rand_color())
+    draw = ImageDraw.Draw(img)
+
+    # Background noise: random lines
+    for _ in range(8):
+        x1, y1 = random.randint(0, W), random.randint(0, H)
+        x2, y2 = random.randint(0, W), random.randint(0, H)
+        draw.line([(x1, y1), (x2, y2)], fill=_rand_color(dark=True), width=2)
+
+    # Background noise: random dots
+    for _ in range(300):
+        x, y = random.randint(0, W), random.randint(0, H)
+        draw.point((x, y), fill=_rand_color(dark=True))
+
+    # Draw each character with slight random offset and rotation
+    try:
+        font = ImageFont.truetype("/usr/share/fonts/truetype/dejavu/DejaVuSans-Bold.ttf", 36)
+    except Exception:
+        font = ImageFont.load_default()
+
+    char_w = W // (len(text) + 2)
+    for i, ch in enumerate(text):
+        x = char_w + i * char_w + random.randint(-4, 4)
+        y = (H - 40) // 2 + random.randint(-6, 6)
+        # Draw shadow
+        draw.text((x + 2, y + 2), ch, font=font, fill=_rand_color(dark=True))
+        draw.text((x, y), ch, font=font, fill=_rand_color(dark=True))
+
+    # Wavy distortion via pixel manipulation
+    pixels = img.load()
+    for x in range(W):
+        shift = int(4 * math.sin(x / 15.0))
+        col = [pixels[x, y] for y in range(H)]
+        for y in range(H):
+            pixels[x, y] = col[(y - shift) % H]
+
+    img = img.filter(ImageFilter.SMOOTH)
+
+    buf = io.BytesIO()
+    img.save(buf, format="PNG")
+    return buf.getvalue()
+
+
+def _generate_problem() -> tuple[str, int]:
+    """Return (display_text, answer)."""
+    ops = [
+        lambda a, b: (f"{a} + {b} = ?", a + b),
+        lambda a, b: (f"{a} × {b} = ?", a * b),
+        lambda a, b: (f"{max(a,b)} − {min(a,b)} = ?", max(a, b) - min(a, b)),
+    ]
+    op = random.choice(ops)
+    a, b = random.randint(2, 9), random.randint(2, 9)
+    text, answer = op(a, b)
+    return text, answer
+
+
+# ── Routes ─────────────────────────────────────────────────────────────────────
+
+@app.get("/", response_class=HTMLResponse)
+async def index():
+    return HTML_PAGE
+
+
+@app.get("/captcha/image/{captcha_id}")
+async def captcha_image(captcha_id: str):
+    _purge()
+    entry = _captchas.get(captcha_id)
+    if not entry:
+        raise HTTPException(404, "Captcha not found or expired")
+    png = _make_captcha_image(entry["problem"])
+    return StreamingResponse(io.BytesIO(png), media_type="image/png",
+                             headers={"Cache-Control": "no-store"})
+
+
+class CaptchaNewResponse(BaseModel):
+    id: str
+
+
+@app.post("/api/captcha/new")
+async def captcha_new():
+    _purge()
+    problem_text, answer = _generate_problem()
+    cid = str(uuid.uuid4())
+    _captchas[cid] = {
+        "problem": problem_text,
+        "answer": answer,
+        "expires": time.time() + CAPTCHA_TTL,
+    }
+    return {"id": cid}
+
+
+class SolveRequest(BaseModel):
+    id: str
+    answer: int
+
+
+@app.post("/api/captcha/solve")
+async def captcha_solve(req: SolveRequest):
+    _purge()
+    entry = _captchas.get(req.id)
+    if not entry:
+        raise HTTPException(400, "Captcha expired or not found")
+    if entry["answer"] != req.answer:
+        raise HTTPException(400, "Wrong answer")
+    token = str(uuid.uuid4())
+    _tokens[token] = time.time() + TOKEN_TTL
+    del _captchas[req.id]
+    return {"token": token}
+
+
+@app.get("/api/route")
+async def route(
+    from_coords: str = Query(..., alias="from", description="lat,lon"),
+    to_coords:   str = Query(..., alias="to",   description="lat,lon"),
+    token: str = Query(...),
+):
+    _purge()
+
+    # Auth: internal service token or valid captcha token
+    if token != INTERNAL_TOKEN:
+        if token not in _tokens:
+            raise HTTPException(401, "Invalid or expired token — solve captcha first")
+
+    if not YANDEX_KEY:
+        raise HTTPException(503, "YANDEX_ROUTING_KEY not configured")
+
+    # Parse coords
+    try:
+        from_lat, from_lon = map(float, from_coords.split(","))
+        to_lat,   to_lon   = map(float, to_coords.split(","))
+    except ValueError:
+        raise HTTPException(400, "coords must be lat,lon")
+
+    # Yandex Routing API expects lon,lat order
+    waypoints = f"{from_lon},{from_lat}|{to_lon},{to_lat}"
+
+    transport = httpx.AsyncHTTPTransport(proxy=HTTPS_PROXY) if HTTPS_PROXY else None
+    async with httpx.AsyncClient(timeout=15, transport=transport) as client:
+        try:
+            r = await client.get(
+                "https://api.routing.yandex.net/v2/route",
+                params={"apikey": YANDEX_KEY, "waypoints": waypoints, "mode": "driving"},
+            )
+        except Exception as e:
+            raise HTTPException(502, f"Yandex API unreachable: {e}")
+
+    if r.status_code != 200:
+        raise HTTPException(502, f"Yandex API error {r.status_code}: {r.text[:200]}")
+
+    data = r.json()
+    try:
+        leg = data["route"]["legs"][0]
+        duration_s         = leg["duration"]
+        duration_traffic_s = leg.get("duration_in_traffic", duration_s)
+        distance_m         = leg["distance"]
+    except (KeyError, IndexError) as e:
+        raise HTTPException(502, f"Unexpected Yandex response: {e} — {str(data)[:200]}")
+
+    return {
+        "duration_min":         round(duration_s / 60),
+        "duration_traffic_min": round(duration_traffic_s / 60),
+        "distance_km":          round(distance_m / 1000, 1),
+    }
+
+
+# ── HTML ───────────────────────────────────────────────────────────────────────
+
+HTML_PAGE = """<!DOCTYPE html>
+<html lang="en">
+<head>
+<meta charset="UTF-8">
+<meta name="viewport" content="width=device-width, initial-scale=1">
+<title>RouteCheck</title>
+<style>
+  * { box-sizing: border-box; margin: 0; padding: 0; }
+  body { font-family: system-ui, sans-serif; background: #0f172a; color: #e2e8f0; min-height: 100vh;
+         display: flex; align-items: center; justify-content: center; }
+  .card { background: #1e293b; border-radius: 12px; padding: 2rem; width: 420px;
+          box-shadow: 0 20px 60px rgba(0,0,0,.5); }
+  h1 { font-size: 1.4rem; font-weight: 700; color: #38bdf8; margin-bottom: .3rem; }
+  .sub { color: #94a3b8; font-size: .85rem; margin-bottom: 1.5rem; }
+  label { display: block; font-size: .8rem; color: #94a3b8; margin-bottom: .3rem; margin-top: 1rem; }
+  input { width: 100%; background: #0f172a; border: 1px solid #334155; border-radius: 6px;
+          color: #e2e8f0; padding: .55rem .75rem; font-size: .95rem; outline: none; }
+  input:focus { border-color: #38bdf8; }
+  button { width: 100%; margin-top: 1.2rem; padding: .7rem; background: #0ea5e9;
+           border: none; border-radius: 6px; color: #fff; font-size: 1rem;
+           font-weight: 600; cursor: pointer; transition: background .2s; }
+  button:hover { background: #0284c7; }
+  button:disabled { background: #334155; cursor: default; }
+  .captcha-row { display: flex; gap: .75rem; align-items: center; margin-top: 1rem; }
+  .captcha-row img { border-radius: 6px; border: 1px solid #334155; cursor: pointer; }
+  .captcha-row input { flex: 1; }
+  .result { margin-top: 1.2rem; background: #0f172a; border-radius: 8px; padding: 1rem;
+            border-left: 3px solid #38bdf8; display: none; }
+  .result .big { font-size: 1.6rem; font-weight: 700; color: #38bdf8; }
+  .result .label { font-size: .8rem; color: #64748b; margin-top: .2rem; }
+  .result .row { display: flex; gap: 1.5rem; margin-top: .8rem; }
+  .result .metric { flex: 1; }
+  .result .metric .val { font-size: 1.1rem; font-weight: 600; }
+  .error { color: #f87171; margin-top: .8rem; font-size: .85rem; display: none; }
+  .step { display: none; }
+  .step.active { display: block; }
+  a.refresh { font-size: .75rem; color: #38bdf8; text-decoration: none; display: block;
+              margin-top: .4rem; }
+  a.refresh:hover { text-decoration: underline; }
+</style>
+</head>
+<body>
+<div class="card">
+  <h1>RouteCheck</h1>
+  <p class="sub">Real-time driving time with Yandex traffic data</p>
+
+  <!-- Step 1: captcha -->
+  <div class="step active" id="step-captcha">
+    <label>Prove you are human</label>
+    <div class="captcha-row">
+      <img id="captcha-img" src="" alt="captcha" width="160" height="60"
+           title="Click to refresh" onclick="loadCaptcha()">
+      <input id="captcha-ans" type="number" placeholder="Answer" min="0" max="999">
+    </div>
+    <a class="refresh" href="#" onclick="loadCaptcha();return false;">↻ New challenge</a>
+    <div class="error" id="captcha-err">Wrong answer, try again.</div>
+    <button id="captcha-btn" onclick="solveCaptcha()">Verify →</button>
+  </div>
+
+  <!-- Step 2: route query -->
+  <div class="step" id="step-route">
+    <label>From (lat, lon)</label>
+    <input id="from" type="text" placeholder="55.7963, 37.9382" value="55.7963, 37.9382">
+    <label>To (lat, lon)</label>
+    <input id="to" type="text" placeholder="55.7558, 37.6173" value="55.7558, 37.6173">
+    <button id="route-btn" onclick="queryRoute()">Get travel time</button>
+    <div class="error" id="route-err"></div>
+    <div class="result" id="result">
+      <div class="big" id="res-traffic"></div>
+      <div class="label">with current traffic</div>
+      <div class="row">
+        <div class="metric"><div class="val" id="res-normal"></div>
+          <div class="label">without traffic</div></div>
+        <div class="metric"><div class="val" id="res-dist"></div>
+          <div class="label">distance</div></div>
+      </div>
+    </div>
+  </div>
+</div>
+
+<script>
+let captchaId = null;
+let routeToken = null;
+
+async function loadCaptcha() {
+  const r = await fetch('/api/captcha/new', {method: 'POST'});
+  const d = await r.json();
+  captchaId = d.id;
+  document.getElementById('captcha-img').src = '/captcha/image/' + captchaId + '?t=' + Date.now();
+  document.getElementById('captcha-ans').value = '';
+  document.getElementById('captcha-err').style.display = 'none';
+}
+
+async function solveCaptcha() {
+  const ans = parseInt(document.getElementById('captcha-ans').value);
+  if (isNaN(ans)) return;
+  const btn = document.getElementById('captcha-btn');
+  btn.disabled = true;
+  const r = await fetch('/api/captcha/solve', {
+    method: 'POST',
+    headers: {'Content-Type': 'application/json'},
+    body: JSON.stringify({id: captchaId, answer: ans})
+  });
+  if (r.ok) {
+    const d = await r.json();
+    routeToken = d.token;
+    document.getElementById('step-captcha').classList.remove('active');
+    document.getElementById('step-route').classList.add('active');
+  } else {
+    document.getElementById('captcha-err').style.display = 'block';
+    loadCaptcha();
+  }
+  btn.disabled = false;
+}
+
+async function queryRoute() {
+  const from = document.getElementById('from').value.trim();
+  const to   = document.getElementById('to').value.trim();
+  const btn  = document.getElementById('route-btn');
+  const err  = document.getElementById('route-err');
+  err.style.display = 'none';
+  document.getElementById('result').style.display = 'none';
+  btn.disabled = true;
+  btn.textContent = 'Fetching…';
+  const r = await fetch(`/api/route?from=${encodeURIComponent(from)}&to=${encodeURIComponent(to)}&token=${routeToken}`);
+  btn.disabled = false;
+  btn.textContent = 'Get travel time';
+  if (!r.ok) {
+    const d = await r.json();
+    err.textContent = d.detail || 'Error';
+    err.style.display = 'block';
+    return;
+  }
+  const d = await r.json();
+  document.getElementById('res-traffic').textContent = d.duration_traffic_min + ' min';
+  document.getElementById('res-normal').textContent  = d.duration_min + ' min';
+  document.getElementById('res-dist').textContent    = d.distance_km + ' km';
+  document.getElementById('result').style.display = 'block';
+}
+
+loadCaptcha();
+
+document.getElementById('captcha-ans').addEventListener('keydown', e => {
+  if (e.key === 'Enter') solveCaptcha();
+});
+</script>
+</body>
+</html>
+"""
--- a/router.py
+++ b/router.py
@@ -1,10 +1,38 @@
+import asyncio
 import re
+import math
 from typing import Optional
+from openai import AsyncOpenAI
 from langchain_core.messages import SystemMessage, HumanMessage
+from fast_tools import FastToolRunner

-# ── Regex pre-classifier ──────────────────────────────────────────────────────
-# Catches obvious light-tier patterns before calling the LLM.
-# Keyed by regex → compiled pattern.
+# ── Regex pre-classifiers ─────────────────────────────────────────────────────
+
+# Complex: keyword triggers that reliably signal deep multi-source research
+_COMPLEX_PATTERNS = re.compile(
+    r"(?:^|\s)("
+    r"research|investigate|deep.dive|think carefully"
+    r"|write a (?:detailed|comprehensive|full|thorough|complete)"
+    r"|compare all|find and (?:compare|summarize|analyze)"
+    r"|in[- ]depth analysis|comprehensive guide"
+    r"|detailed (?:report|analysis|comparison|breakdown|overview)"
+    r"|everything about|all (?:major|available|self-hosted|open.source)"
+    r"|pros and cons|with (?:sources|citations|references)"
+    # Russian complex research keywords (no trailing \b — stems like подробн match подробное/подробный)
+    r"|исследуй|изучи все|сравни все|найди и сравни|найди и опиши"
+    r"|напиши подробн|напиши детальн|напиши полн"
+    r"|подробный отчет|детальн\w+ (?:анализ|сравнение|отчет)"
+    r"|подробное (?:руководство|сравнение)|полное руководство"
+    r"|все варианты|все способы|все доступные|все самохостируемые|все платформы"
+    r"|лучшие практики|все инструменты|все решения|все протоколы"
+    r"|найди детальн|найди и кратко опиши"
+    r"|изучи свежие|изучи лучши|изучи все"
+    r"|сравни все\b"
+    r")",
+    re.IGNORECASE,
+)
+
+# Light: trivial queries that need no tools or memory
 _LIGHT_PATTERNS = re.compile(
    r"^("
    # Greetings / farewells
@@ -14,35 +42,316 @@ _LIGHT_PATTERNS = re.compile(
    r"|thanks?|thank you|thx|ty|ok|okay|k|cool|great|awesome|perfect|sounds good|got it|nice|sure"
    r"|how are you|how are you\?|how are you doing(\s+today)?[?!.]*"
    r"|what.?s up"
-    # Calendar facts: "what day comes after X?" / "what comes after X?"
+    # Calendar facts
    r"|what\s+day\s+(comes\s+after|follows|is\s+after)\s+\w+[?!.]*"
    r"|what\s+comes\s+after\s+\w+[?!.]*"
-    # Acronym expansions: "what does X stand for?"
+    # Acronym expansions
    r"|what\s+does\s+\w+\s+stand\s+for[?!.]*"
+    # Russian greetings / farewells / acknowledgements
+    r"|привет|пока|спасибо|здравствуй|здравствуйте|добрый день|добрый вечер|доброе утро"
+    r"|окей|хорошо|отлично|понятно|ок|ладно|договорились|спс|благодарю"
+    r"|пожалуйста|не за что|всё понятно|ясно"
+    r"|как дела|как ты|как жизнь|всё хорошо|всё ок"
+    # Assistant control words / confirmations
+    r"|да|нет|стоп|отмена|отменить|подожди|повтори|повторить|не нужно|не надо"
+    r"|слышишь\s+меня|ты\s+тут|отлично[,!]?\s+спасибо"
+    r"|yes|no|stop|cancel|wait|repeat"
+    # Russian tech definitions — static knowledge (no tools needed)
+    r"|что\s+такое\s+\S+"
+    r"|что\s+означает\s+\S+"
+    r"|сколько\s+(?:бит|байт|байтов|мегабайт|мегабайтов|гигабайт|гигабайтов)(?:\s+\w+)*"
+    # Compound Russian greetings
+    r"|привет[,!]?\s+как\s+дела"
+    r"|добрый\s+(?:день|вечер|утро)[,!]?\s+как\s+дела"
    r")[\s!.?]*$",
    re.IGNORECASE,
 )

-# ── LLM classification prompt ─────────────────────────────────────────────────
-CLASSIFY_PROMPT = """Classify the message. Output ONLY one word: light, medium, or complex.
+# ── Semantic router utterances ────────────────────────────────────────────────
+# These are embedded at startup. New messages are classified by cosine
+# similarity — whichever tier's centroid is closest wins.
+_LIGHT_UTTERANCES = [
+    # General facts (English)
+    "what is 2+2",
+    "what is the capital of France",
+    "name the three primary colors",
+    "tell me a short joke",
+    "is the sky blue",
+    "is water wet",
+    "how many days in a week",
+    "what is the speed of light",
+    "what is the boiling point of water",
+    "spell the word beautiful",
+    "what color is the ocean",
+    "how many inches in a foot",
+    "who wrote hamlet",
+    "what is pi",
+    "what year did world war two end",
+    "what is the largest planet",
+    "how many continents are there",
+    "what does DNA stand for",
+    "what language do they speak in Brazil",
+    "what is the square root of 144",
+    # Tech definitions — static knowledge (English)
+    "what is Docker",
+    "what is a VPN",
+    "what is SSH",
+    "what is a reverse proxy",
+    "what is an API",
+    "what is a firewall",
+    "what is a container",
+    "what is DNS",
+    "what is HTTPS",
+    "what is a load balancer",
+    "what is Kubernetes",
+    "what is Git",
+    "what is a network port",
+    "what is an IP address",
+    "what is a subnet mask",
+    "what is the OSI model",
+    "how many bits in a byte",
+    "how many bytes in a gigabyte",
+    "what is TCP",
+    "what is a REST API",
+    # Russian — static facts and definitions
+    "что такое IP-адрес",
+    "что такое VPN",
+    "что такое Docker",
+    "что такое DNS",
+    "что такое SSH",
+    "что означает API",
+    "сколько байт в гигабайте",
+    "сколько бит в байте",
+    "что такое Zigbee",
+    "что такое Z-Wave",
+    "что такое брандмауэр",
+    "что такое виртуальная машина",
+    "что такое обратный прокси",
+    "привет",
+    "пока",
+    "спасибо",
+    "как дела",
+    "что такое Matter протокол",
+    "сколько планет в солнечной системе",
+    "чему равно число Пи",
+    # Russian — more static definitions
+    "что такое TCP/IP",
+    "что такое подсеть",
+    "скорость света",
+    "сколько дней в году",
+    "что такое Kubernetes",
+    "что такое Git",
+    "что такое REST API",
+    "что такое TCP",
+    "что такое UDP",
+    "что такое VLAN",
+    "сколько мегабайт в гигабайте",
+    "что такое процессор",
+    "что такое оперативная память",
+    "что такое виртуализация",
+    "что такое Linux",
+    "что такое умный дом",
+    "что такое Home Assistant",
+    "что такое Matter",
+]

-LIGHT = answerable from general knowledge, no internet needed:
-  what is 2+2 / what is the capital of France / name the three primary colors
-  tell me a short joke / is the sky blue / is water wet
+_MEDIUM_UTTERANCES = [
+    # English — current data, memory, actions
+    "what is the weather today",
+    "what is the bitcoin price right now",
+    "what are the latest news",
+    "what did we talk about last time",
+    "what is my name",
+    "where do I live",
+    "what do you know about me",
+    "what did I tell you before",
+    "what is the current temperature outside",
+    "remind me what I said about my project",
+    "search for the latest iPhone release",
+    "find me a restaurant nearby",
+    "turn on the lights in the living room",
+    "turn off all lights",
+    "set temperature to 22 degrees",
+    "what is the current traffic to Moscow",
+    "check if anyone is home",
+    "what devices are currently on",
+    "look up my public IP address",
+    "show me recent news about Proxmox",
+    # Russian — weather and commute
+    "какая сегодня погода в Балашихе",
+    "пойдет ли сегодня дождь",
+    "какая температура на улице сейчас",
+    "погода на завтра",
+    "будет ли снег сегодня",
+    "сколько ехать до Москвы сейчас",
+    "какие пробки на дороге до Москвы",
+    "время в пути на работу",
+    "есть ли пробки сейчас",
+    "стоит ли брать зонтик",
+    # Russian — smart home control
+    "включи свет в гостиной",
+    "выключи свет на кухне",
+    "какая температура дома",
+    "установи температуру 22 градуса",
+    "выключи все лампочки",
+    "какие устройства сейчас включены",
+    "включи ночной режим",
+    "открой шторы в гостиной",
+    "включи свет в спальне на 50 процентов",
+    "выключи свет во всём доме",
+    "включи вентилятор в детской",
+    "закрыты ли все окна",
+    "выключи телевизор",
+    "какое потребление электричества сегодня",
+    "включи кофемашину",
+    "сколько у нас датчиков движения",
+    "состояние всех дверных замков",
+    "есть ли кто-нибудь дома",
+    "установи будильник на 7 утра",
+    # Russian — personal memory
+    "как меня зовут",
+    "где я живу",
+    "что мы обсуждали в прошлый раз",
+    "что ты знаешь о моем домашнем сервере",
+    "напомни, какие сервисы я запускаю",
+    "что я просил тебя запомнить",
+    "что я говорил о своей сети",
+    # Russian — current info lookups requiring network/tools
+    "какой сейчас курс биткоина",
+    "курс доллара к рублю сейчас",
+    "какая последняя версия Docker",
+    "как перезапустить Docker контейнер",
+    "как посмотреть логи Docker контейнера",
+    "какие новые функции в Home Assistant 2024",
+    "есть ли проблемы у Cloudflare сегодня",
+    "какие новые Zigbee устройства вышли в 2024 году",
+    "найди хороший опенсорс менеджер фотографий",
+    "последние новости Proxmox",
+    "напиши bash команду для поиска больших файлов",
+    "как вывести список всех запущенных контейнеров",
+    "как проверить использование диска в Linux",
+]

-MEDIUM = requires web search or the user's stored memories:
-  current weather / today's news / Bitcoin price / what did we talk about
-  what is my name / where do I live / what is my job / do I have any pets
-  what do you know about me / what are my preferences / what did I tell you
+_COMPLEX_UTTERANCES = [
+    # English
+    "research everything about Elon Musk's recent projects and investments",
+    "write a detailed report on climate change solutions with sources",
+    "investigate the history and current state of quantum computing",
+    "find and summarize the latest academic papers on transformer architectures",
+    "analyze in depth the pros and cons of nuclear energy with citations",
+    "research the background and controversies around this person",
+    "compare all major cloud providers with detailed pricing and features",
+    "write a comprehensive biography of this historical figure",
+    "investigate what caused the 2008 financial crisis with multiple sources",
+    "research the best programming languages in 2024 with detailed comparison",
+    "find everything published about this medical condition and treatments",
+    "do a deep dive into the latest developments in artificial general intelligence",
+    "research and compare all options for starting a business in Europe",
+    "investigate recent news and controversies around this company",
+    "write a thorough analysis of geopolitical tensions in the Middle East",
+    "find detailed information on the side effects and studies for this medication",
+    "research the top 10 JavaScript frameworks with benchmarks and community data",
+    "investigate who is funding AI research and what their goals are",
+    "write a detailed market analysis for the electric vehicle industry",
+    "research everything you can find about this startup or technology",
+    # Russian — deep research
+    "исследуй и сравни все варианты умного домашнего освещения",
+    "напиши подробный отчет о протоколах умного дома",
+    "изучи все самохостируемые медиасерверы и сравни их",
+    "исследуй лучшие практики безопасности домашнего сервера",
+    "сравни все системы резервного копирования для Linux",
+    "напиши детальное сравнение WireGuard и OpenVPN",
+    "исследуй все варианты голосового управления на русском языке",
+    "изучи все опенсорс альтернативы Google сервисам",
+    "напиши подробный анализ локальных языковых моделей",
+    "исследуй лучшие инструменты мониторинга для домашнего сервера",
+    # Russian — more deep research queries matching benchmark
+    "исследуй и сравни Proxmox, Unraid и TrueNAS для домашней лаборатории",
+    "напиши подробное руководство по безопасности домашнего сервера",
+    "исследуй все доступные дашборды для самохостинга и сравни их",
+    "найди детальные бенчмарки ARM одноплатных компьютеров для домашней лаборатории",
+    "исследуй лучший стек мониторинга для самохостинга в 2024 году",
+    "исследуй и сравни WireGuard, OpenVPN и Tailscale для домашней сети",
+    "исследуй лучшие практики сегментации домашней сети с VLAN",
+    "изучи все самохостируемые DNS решения и их возможности",
+    "исследуй и сравни все платформы умного дома: Home Assistant и другие",
+    "изучи лучшие Zigbee координаторы и их совместимость с Home Assistant",
+    "напиши детальный отчет о поддержке протокола Matter и совместимости устройств",
+    "исследуй все способы интеграции умных ламп с Home Assistant",
+    "найди и сравни все варианты датчиков движения для умного дома",
+    "исследуй и сравни все самохостируемые решения для хранения фотографий",
+    "изучи лучшие самохостируемые медиасерверы: Jellyfin, Plex и Emby",
+    "исследуй последние достижения в локальном LLM инференсе и обзор моделей",
+    "изучи лучшие опенсорс альтернативы Google сервисов для приватности",
+    "найди и кратко опиши все крупные самохостируемые менеджеры паролей",
+    "напиши детальный анализ текущего состояния AI ассистентов для самохостинга",
+    "исследуй и сравни все инструменты оркестрации контейнеров для домашней лаборатории",
+    "изучи лучшие подходы к автоматическому резервному копированию в Linux",
+    "исследуй и сравни все самохостируемые инструменты личных финансов",
+    "изучи свежие CVE и уязвимости в популярном самохостируемом ПО",
+    "напиши подробное руководство по настройке автоматизаций в Home Assistant",
+    "исследуй все варианты голосового управления умным домом на русском языке",
+    "сравни все системы резервного копирования для Linux: Restic, BorgBackup и другие",
+    "исследуй лучшие самохостируемые системы мониторинга сети: Zabbix, Grafana",
+    "изучи все варианты локального запуска языковых моделей на видеокарте",
+    "напиши подробный отчет о технологиях синтеза речи с открытым исходным кодом",
+    "исследуй все способы интеграции умных розеток с мониторингом потребления",
+    "напиши полное руководство по настройке обратного прокси Caddy",
+    "исследуй лучшие практики написания Docker Compose файлов для продакшена",
+    "сравни все самохостируемые облачные хранилища: Nextcloud, Seafile и другие",
+    "изучи все доступные локальные ассистенты с голосовым управлением",
+    "исследуй все самохостируемые решения для блокировки рекламы: Pi-hole, AdGuard",
+    "напиши детальное сравнение систем управления конфигурацией: Ansible, Puppet",
+    "исследуй все протоколы умного дома и их плюсы и минусы: Zigbee, Z-Wave, Matter",
+    "найди и сравни все фреймворки для создания локальных AI ассистентов",
+    "исследуй лучшие решения для автоматического управления медиатекой",
+    "изучи все варианты самохостируемых систем учёта расходов с возможностью импорта",
+    "напиши сравнение всех вариантов самохостинга для хранения и синхронизации файлов",
+    "исследуй все открытые протоколы для умного дома и их экосистемы",
+    "изучи лучшие инструменты для автоматизации домашней инфраструктуры",
+]

-COMPLEX = /think prefix only:
-  /think compare frameworks / /think plan a trip
-
-Message: {message}
-Output (one word only — light, medium, or complex):"""
+# Medium: queries that require tools, actions, or real-time data (not static knowledge)
+_MEDIUM_PATTERNS = re.compile(
+    r"(?:"
+    # Russian smart home commands — always need HA integration
+    r"(?:включи|выключи|открой|закрой|установи|поставь|убавь|прибавь|переключи)\s"
+    r"|(?:какая|какой|какое|каково)\s+(?:температура|влажность|потребление|состояние|статус)\s"
+    r"|(?:сколько|есть ли)\s.*(?:датчик|устройств|замк)"
+    # Russian memory queries
+    r"|как меня зовут|где я живу|что мы обсуждали|что я говорил|что я просил"
+    r"|напомни\b|что ты знаешь обо мне"
+    # Russian current info
+    r"|курс (?:доллара|биткоина|евро|рубл)"
+    r"|(?:последние |свежие )?новости\b"
+    r"|(?:погода|температура)\s+(?:на завтра|на неделю)"
+    # Smart home commands that don't use verb-first pattern
+    r"|(?:свет|лампочк|освещени)\w*\s+(?:включ|выключ|убавь|прибавь)"
+    r"|(?:дома|в доме|по всему дому)\s+(?:свет|лампочк)"
+    r"|(?:режим|сцена)\s+(?:ночной|утренний|вечерний|кинотеатр)"
+    r")",
+    re.IGNORECASE,
+)

 LIGHT_REPLY_PROMPT = """You are a helpful Telegram assistant. Answer briefly and naturally (1-3 sentences). Be friendly."""

+_EMBED_MODEL = "ollama/nomic-embed-text"
+
+
+def _cosine(a: list[float], b: list[float]) -> float:
+    dot = sum(x * y for x, y in zip(a, b))
+    norm_a = math.sqrt(sum(x * x for x in a))
+    norm_b = math.sqrt(sum(x * x for x in b))
+    if norm_a == 0 or norm_b == 0:
+        return 0.0
+    return dot / (norm_a * norm_b)
+
+
+def _centroid(embeddings: list[list[float]]) -> list[float]:
+    n = len(embeddings)
+    dim = len(embeddings[0])
+    return [sum(embeddings[i][d] for i in range(n)) / n for d in range(dim)]
+

 def _format_history(history: list[dict]) -> str:
    if not history:
@@ -55,64 +364,97 @@ def _format_history(history: list[dict]) -> str:
    return "\n".join(lines)


-def _parse_tier(text: str) -> str:
-    """Extract tier from raw model output. Default to medium."""
-    t = text.strip().lower()
-    snippet = t[:60]
-    if "complex" in snippet:
-        return "complex"
-    if "medium" in snippet:
-        return "medium"
-    if "light" in snippet:
-        return "light"
-    # Model invented a descriptive category (e.g. "simplefact", "trivial", "basic") →
-    # treat as light since it recognised the question doesn't need tools
-    if any(w in snippet for w in ("simple", "fact", "trivial", "basic", "easy", "general")):
-        return "light"
-    return "medium"  # safe default
-
-
 class Router:
-    def __init__(self, model):
-        self.model = model
+    def __init__(
+        self,
+        model,
+        embedder: AsyncOpenAI,
+        fast_tool_runner: FastToolRunner | None = None,
+    ):
+        self.model = model  # qwen2.5:1.5b — used only for generating light replies
+        self._embedder = embedder
+        self._fast_tool_runner = fast_tool_runner
+        self._light_centroid: list[float] | None = None
+        self._medium_centroid: list[float] | None = None
+        self._complex_centroid: list[float] | None = None
+
+    async def initialize(self) -> None:
+        """Pre-compute utterance embeddings. Call once at startup. Retries until LiteLLM is ready."""
+        print("[router] embedding utterances for semantic classifier...", flush=True)
+        texts = _LIGHT_UTTERANCES + _MEDIUM_UTTERANCES + _COMPLEX_UTTERANCES
+        for attempt in range(10):
+            try:
+                resp = await self._embedder.embeddings.create(model=_EMBED_MODEL, input=texts)
+                embeddings = [item.embedding for item in resp.data]
+                n_light = len(_LIGHT_UTTERANCES)
+                n_medium = len(_MEDIUM_UTTERANCES)
+                self._light_centroid = _centroid(embeddings[:n_light])
+                self._medium_centroid = _centroid(embeddings[n_light:n_light + n_medium])
+                self._complex_centroid = _centroid(embeddings[n_light + n_medium:])
+                print("[router] semantic classifier ready (3-tier)", flush=True)
+                return
+            except Exception as e:
+                print(f"[router] embedding attempt {attempt+1}/10 failed: {e}", flush=True)
+                await asyncio.sleep(3)
+        print("[router] WARNING: could not initialize semantic classifier — will default to medium", flush=True)
+
+    async def _classify_by_embedding(self, message: str) -> str:
+        """Embed message and return 'light', 'medium', or 'complex' based on centroid similarity."""
+        if self._light_centroid is None or self._medium_centroid is None or self._complex_centroid is None:
+            return "medium"
+        try:
+            resp = await self._embedder.embeddings.create(model=_EMBED_MODEL, input=[message])
+            emb = resp.data[0].embedding
+            score_light = _cosine(emb, self._light_centroid)
+            score_medium = _cosine(emb, self._medium_centroid)
+            score_complex = _cosine(emb, self._complex_centroid)
+            tier = max(
+                [("light", score_light), ("medium", score_medium), ("complex", score_complex)],
+                key=lambda x: x[1],
+            )[0]
+            print(
+                f"[router] semantic: light={score_light:.3f} medium={score_medium:.3f} "
+                f"complex={score_complex:.3f} → {tier}",
+                flush=True,
+            )
+            return tier
+        except Exception as e:
+            print(f"[router] embedding classify error, defaulting to medium: {e}", flush=True)
+            return "medium"

    async def route(
        self,
        message: str,
        history: list[dict],
-        force_complex: bool = False,
+        no_inference: bool = False,
    ) -> tuple[str, Optional[str]]:
        """
        Returns (tier, reply_or_None).
-        For light tier: also generates the reply with a second call.
+        For light tier: also generates the reply inline (unless no_inference=True).
        For medium/complex: reply is None.
        """
-        if force_complex:
-            return "complex", None
-
-        # Step 0: regex pre-classification for obvious light patterns
-        if _LIGHT_PATTERNS.match(message.strip()):
-            print(f"[router] regex→light", flush=True)
-            return await self._generate_light_reply(message, history)
-
-        # Step 1: LLM classification with raw text output
-        try:
-            classify_response = await self.model.ainvoke([
-                HumanMessage(content=CLASSIFY_PROMPT.format(message=message)),
-            ])
-            raw = classify_response.content or ""
-            raw = re.sub(r"<think>.*?</think>", "", raw, flags=re.DOTALL).strip()
-            tier = _parse_tier(raw)
-
-            if tier == "complex" and not message.startswith("/think"):
-                tier = "medium"
-
-            print(f"[router] raw={raw[:30]!r} → tier={tier}", flush=True)
-        except Exception as e:
-            print(f"[router] classify error, defaulting to medium: {e}", flush=True)
+        if self._fast_tool_runner and self._fast_tool_runner.any_matches(message.strip()):
+            names = self._fast_tool_runner.matching_names(message.strip())
+            print(f"[router] fast_tool_match={names} → medium", flush=True)
            return "medium", None

-        if tier != "light":
+        if _LIGHT_PATTERNS.match(message.strip()):
+            print("[router] regex→light", flush=True)
+            if no_inference:
+                return "light", None
+            return await self._generate_light_reply(message, history)
+
+        if _COMPLEX_PATTERNS.search(message.strip()):
+            print("[router] regex→complex", flush=True)
+            return "complex", None
+
+        if _MEDIUM_PATTERNS.search(message.strip()):
+            print("[router] regex→medium", flush=True)
+            return "medium", None
+
+        tier = await self._classify_by_embedding(message)
+
+        if tier != "light" or no_inference:
            return tier, None

        return await self._generate_light_reply(message, history)
@@ -120,7 +462,7 @@ class Router:
    async def _generate_light_reply(
        self, message: str, history: list[dict]
    ) -> tuple[str, Optional[str]]:
-        """Generate a short reply using the router model for light-tier messages."""
+        """Generate a short reply using qwen2.5:1.5b for light-tier messages."""
        history_text = _format_history(history)
        context = f"\nConversation history:\n{history_text}" if history else ""
        try:
--- a/test_pipeline.py
+++ b/test_pipeline.py
--- a/tests/init.py
+++ b/tests/init.py
--- a/tests/integration/init.py
+++ b/tests/integration/init.py
--- a/tests/integration/common.py
+++ b/tests/integration/common.py
@@ -0,0 +1,259 @@
+"""
+Shared config, helpers, and utilities for Adolf integration tests.
+"""
+
+import http.client
+import json
+import re
+import subprocess
+import time
+import urllib.request
+
+# ── config ────────────────────────────────────────────────────────────────────
+DEEPAGENTS   = "http://localhost:8000"
+LITELLM      = "http://localhost:4000"
+OPENMEMORY   = "http://localhost:8765"
+GRAMMY_HOST  = "localhost"
+GRAMMY_PORT  = 3001
+OLLAMA_GPU   = "http://localhost:11436"
+OLLAMA_CPU   = "http://localhost:11435"
+QDRANT       = "http://localhost:6333"
+SEARXNG      = "http://localhost:11437"
+COMPOSE_FILE = "/home/alvis/adolf/docker-compose.yml"
+DEFAULT_CHAT_ID = "346967270"
+
+NAMES = [
+    "Maximilian", "Cornelius", "Zephyr", "Archibald", "Balthazar",
+    "Ignatius", "Lysander", "Octavian", "Reginald", "Sylvester",
+]
+
+BENCHMARK = {
+    "easy": [
+        "hi",
+        "what is 2+2?",
+        "what is the capital of France?",
+        "tell me a short joke",
+        "how are you doing today?",
+        "thanks!",
+        "what day comes after Wednesday?",
+        "name the three primary colors",
+        "is the sky blue?",
+        "what does CPU stand for?",
+    ],
+    "medium": [
+        "what is the current weather in Berlin?",
+        "find the latest news about artificial intelligence",
+        "what is the current price of Bitcoin?",
+        "search for a good pasta carbonara recipe",
+        "what movies are in theaters this week?",
+        "find Python tutorials for beginners",
+        "who won the last FIFA World Cup?",
+        "do you remember what we talked about before?",
+        "search for the best coffee shops in Tokyo",
+        "what is happening in the tech industry this week?",
+        "what's the weather like today?",
+    ],
+    "hard": [
+        "/think compare the top 3 Python web frameworks (Django, FastAPI, Flask) and recommend one for a production REST API",
+        "/think research the history of artificial intelligence and create a timeline of key milestones",
+        "/think plan a 7-day trip to Japan with daily itinerary, accommodation suggestions, and budget breakdown",
+        "/think analyze microservices vs monolithic architecture: pros, cons, and when to choose each",
+        "/think write a Python script that reads a CSV file, cleans the data, and generates summary statistics",
+        "/think research quantum computing: explain the key concepts and how it differs from classical computing",
+        "/think compare PostgreSQL, MongoDB, and Redis — when to use each and what are the trade-offs?",
+        "/think create a comprehensive Docker deployment guide covering best practices for production",
+        "/think research climate change: summarize the latest IPCC findings and key data points",
+        "/think design a REST API with authentication, rate limiting, and proper error handling — provide architecture and code outline",
+    ],
+}
+
+# ── terminal colours ──────────────────────────────────────────────────────────
+PASS = "\033[32mPASS\033[0m"
+FAIL = "\033[31mFAIL\033[0m"
+INFO = "\033[36mINFO\033[0m"
+WARN = "\033[33mWARN\033[0m"
+
+
+# ── result helpers ────────────────────────────────────────────────────────────
+
+def report(results: list, name: str, ok: bool, detail: str = ""):
+    tag = PASS if ok else FAIL
+    print(f"  [{tag}] {name}" + (f" — {detail}" if detail else ""))
+    results.append((name, ok))
+
+
+def print_summary(results: list):
+    print(f"\n{'─'*55}")
+    total  = len(results)
+    passed = sum(1 for _, ok in results if ok)
+    failed = total - passed
+    print(f"Results: {passed}/{total} passed", end="")
+    if failed:
+        print(f"  ({failed} failed)\n")
+        print("Failed checks:")
+        for name, ok in results:
+            if not ok:
+                print(f"  - {name}")
+    else:
+        print(" — all good")
+    print()
+
+
+def tf(v):
+    """Format timing value."""
+    return f"{v:6.2f}s" if v is not None else "   n/a"
+
+
+# ── HTTP helpers ──────────────────────────────────────────────────────────────
+
+def get(url, timeout=5):
+    with urllib.request.urlopen(urllib.request.Request(url), timeout=timeout) as r:
+        return r.status, r.read().decode()
+
+
+def post_json(url, payload, timeout=10):
+    data = json.dumps(payload).encode()
+    req = urllib.request.Request(
+        url, data=data,
+        headers={"Content-Type": "application/json"},
+        method="POST",
+    )
+    with urllib.request.urlopen(req, timeout=timeout) as r:
+        return r.status, json.loads(r.read().decode())
+
+
+def check_sse(host, port, path):
+    try:
+        conn = http.client.HTTPConnection(host, port, timeout=5)
+        conn.request("GET", path, headers={"Accept": "text/event-stream"})
+        r = conn.getresponse()
+        conn.close()
+        return r.status == 200, f"HTTP {r.status}"
+    except Exception as e:
+        return False, str(e)
+
+
+def qdrant_count():
+    try:
+        _, body = get(f"{QDRANT}/collections/adolf_memories")
+        return json.loads(body).get("result", {}).get("points_count", 0)
+    except Exception:
+        return 0
+
+
+# ── log helpers ───────────────────────────────────────────────────────────────
+
+def fetch_logs(since_s=600):
+    """Return deepagents log lines from the last since_s seconds."""
+    try:
+        r = subprocess.run(
+            ["docker", "compose", "-f", COMPOSE_FILE, "logs", "deepagents",
+             f"--since={int(since_s)}s", "--no-log-prefix"],
+            capture_output=True, text=True, timeout=15,
+        )
+        return r.stdout.splitlines()
+    except Exception:
+        return []
+
+
+def parse_run_block(lines, msg_prefix):
+    """
+    Scan log lines for the LAST '[agent] running: <msg_prefix>' block.
+    Extracts reply timing, tier, and memory timing from that block.
+
+    Returns dict or None if the reply has not appeared in logs yet.
+    Dict keys:
+      reply_total, llm, send, tier, reply_text  — from "[agent] replied in ..."
+      memory_s                                  — from "[memory] stored in ..."
+      memory_error                              — True if "[memory] error" found
+    """
+    search = msg_prefix[:50]
+    start_idx = None
+    for i, line in enumerate(lines):
+        if "[agent] running:" in line and search in line:
+            start_idx = i  # keep updating — we want the LAST occurrence
+
+    if start_idx is None:
+        return None
+
+    block = lines[start_idx:]
+    last_ai_text = None
+    reply_data = None
+
+    for j, line in enumerate(block):
+        if "AIMessage:" in line and "→" not in line:
+            txt = line.split("AIMessage:", 1)[-1].strip()
+            if txt:
+                last_ai_text = txt
+
+        m = re.search(r"replied in ([\d.]+)s(?:\s+tier=(\w+))?", line)
+        if m:
+            tier = m.group(2) if m.group(2) else "unknown"
+            reply_data = {
+                "reply_total": float(m.group(1)),
+                "llm":         None,
+                "send":        None,
+                "tier":        tier,
+                "reply_text":  last_ai_text,
+                "memory_s":    None,
+                "memory_error": False,
+                "_j": j,
+            }
+            break
+
+    if reply_data is not None:
+        next_lines = block[reply_data["_j"] + 1: reply_data["_j"] + 3]
+        for line in next_lines:
+            if line.startswith("[agent] reply_text:"):
+                reply_data["reply_text"] = line[len("[agent] reply_text:"):].strip()
+                break
+
+    if reply_data is None:
+        return None
+
+    for line in block[reply_data["_j"] + 1:]:
+        mm = re.search(r"\[memory\] stored in ([\d.]+)s", line)
+        if mm:
+            reply_data["memory_s"] = float(mm.group(1))
+            break
+        if "[memory] error" in line:
+            reply_data["memory_error"] = True
+            break
+
+    return reply_data
+
+
+def wait_for(label, msg_prefix, timeout_s=200, need_memory=True):
+    """
+    Poll deepagents logs until the message is fully processed.
+    Shows a live progress line. Returns timing dict or None on timeout.
+    """
+    t_start = time.monotonic()
+    deadline = t_start + timeout_s
+    tick = 0
+    last_result = None
+
+    while time.monotonic() < deadline:
+        since = int(time.monotonic() - t_start) + 90
+        lines = fetch_logs(since_s=since)
+        result = parse_run_block(lines, msg_prefix)
+
+        if result:
+            last_result = result
+            has_mem = result["memory_s"] is not None or result["memory_error"]
+            if (not need_memory) or has_mem:
+                elapsed = time.monotonic() - t_start
+                print(f"\r  [{label}] done after {elapsed:.0f}s{' ' * 30}")
+                return result
+
+        time.sleep(4)
+        tick += 1
+        rem = int(deadline - time.monotonic())
+        if last_result:
+            phase = "waiting for memory..." if need_memory else "done"
+        else:
+            phase = "waiting for LLM reply..."
+        print(f"\r  [{label}] {tick*4}s elapsed, {rem}s left — {phase}  ", end="", flush=True)
+
+    print(f"\r  [{label}] TIMEOUT after {timeout_s}s{' ' * 30}")
+    return None
--- a/tests/integration/test_health.py
+++ b/tests/integration/test_health.py
@@ -0,0 +1,214 @@
+#!/usr/bin/env python3
+"""
+Adolf service health integration tests.
+
+Checks:
+  1.  deepagents /health — agent_ready
+  1b. openmemory /sse reachable
+  1c. grammy /sse reachable
+  2.  Bifrost /health, /v1/models, direct inference, deepagents startup log
+  3.  GPU Ollama — reachable, qwen3:8b present
+  4.  CPU Ollama — reachable, nomic-embed-text present
+  5.  Qdrant — reachable, adolf_memories collection, vector dims=768
+  6.  SearXNG — reachable, JSON results, latency < 5s
+
+Usage:
+    python3 test_health.py
+"""
+
+import json
+import sys
+import time
+import urllib.request
+
+from common import (
+    DEEPAGENTS, BIFROST, GRAMMY_HOST, GRAMMY_PORT,
+    OLLAMA_GPU, OLLAMA_CPU, QDRANT, SEARXNG, COMPOSE_FILE,
+    INFO, FAIL,
+    report, print_summary, tf,
+    get, post_json, check_sse, fetch_logs,
+)
+
+results = []
+timings = {}
+
+
+# ── 1. Service health ─────────────────────────────────────────────────────────
+print(f"\n[{INFO}] 1. Service health")
+t0 = time.monotonic()
+
+try:
+    status, body = get(f"{DEEPAGENTS}/health")
+    data = json.loads(body)
+    ok = status == 200 and data.get("agent_ready") is True
+    report(results, "deepagents /health — agent_ready", ok,
+           f"agent_ready={data.get('agent_ready')}")
+except Exception as e:
+    report(results, "deepagents /health", False, str(e))
+
+ok, detail = check_sse("localhost", 8765, "/sse")
+report(results, "openmemory /sse reachable", ok, detail)
+
+ok, detail = check_sse(GRAMMY_HOST, GRAMMY_PORT, "/sse")
+report(results, "grammy /sse reachable", ok, detail)
+
+timings["health_check"] = time.monotonic() - t0
+
+
+# ── 2. Bifrost gateway ────────────────────────────────────────────────────────
+print(f"\n[{INFO}] 2. Bifrost gateway (port 8080)")
+t0 = time.monotonic()
+
+try:
+    status, body = get(f"{BIFROST}/health", timeout=5)
+    report(results, "Bifrost /health reachable", status == 200, f"HTTP {status}")
+except Exception as e:
+    report(results, "Bifrost /health reachable", False, str(e))
+
+try:
+    status, body = get(f"{BIFROST}/v1/models", timeout=5)
+    data = json.loads(body)
+    model_ids = [m.get("id", "") for m in data.get("data", [])]
+    gpu_models = [m for m in model_ids if m.startswith("ollama/")]
+    report(results, "Bifrost lists ollama GPU models", len(gpu_models) > 0,
+           f"found: {gpu_models}")
+    for expected in ["ollama/qwen3:4b", "ollama/qwen3:8b", "ollama/qwen2.5:1.5b"]:
+        report(results, f"  model {expected} listed", expected in model_ids)
+except Exception as e:
+    report(results, "Bifrost /v1/models", False, str(e))
+
+print(f"  [bifrost-infer] POST /v1/chat/completions → ollama/qwen2.5:0.5b ...")
+t_infer = time.monotonic()
+try:
+    infer_payload = {
+        "model": "ollama/qwen2.5:0.5b",
+        "messages": [{"role": "user", "content": "Reply with exactly one word: pong"}],
+        "max_tokens": 16,
+    }
+    data = json.dumps(infer_payload).encode()
+    req = urllib.request.Request(
+        f"{BIFROST}/v1/chat/completions",
+        data=data,
+        headers={"Content-Type": "application/json"},
+        method="POST",
+    )
+    with urllib.request.urlopen(req, timeout=60) as r:
+        infer_status = r.status
+        infer_body = json.loads(r.read().decode())
+    infer_elapsed = time.monotonic() - t_infer
+    reply_content = infer_body.get("choices", [{}])[0].get("message", {}).get("content", "")
+    used_model = infer_body.get("model", "")
+    report(results, "Bifrost → Ollama GPU inference succeeds",
+           infer_status == 200 and bool(reply_content),
+           f"{infer_elapsed:.1f}s  model={used_model!r}  reply={reply_content[:60]!r}")
+    timings["bifrost_direct_infer"] = infer_elapsed
+except Exception as e:
+    report(results, "Bifrost → Ollama GPU inference succeeds", False, str(e))
+    timings["bifrost_direct_infer"] = None
+
+try:
+    import subprocess
+    r = subprocess.run(
+        ["docker", "compose", "-f", COMPOSE_FILE, "logs", "deepagents",
+         "--since=3600s", "--no-log-prefix"],
+        capture_output=True, text=True, timeout=10,
+    )
+    log_lines = r.stdout.splitlines()
+    bifrost_line = next(
+        (l for l in log_lines if "[agent] bifrost=" in l and "bifrost:8080" in l),
+        None,
+    )
+    report(results, "deepagents startup log confirms bifrost URL",
+           bifrost_line is not None,
+           bifrost_line.strip() if bifrost_line else "line not found in logs")
+    if bifrost_line:
+        has_prefix = "router=ollama/" in bifrost_line and "medium=ollama/" in bifrost_line
+        report(results, "deepagents model names use ollama/ prefix", has_prefix,
+               bifrost_line.strip())
+except Exception as e:
+    report(results, "deepagents startup log check", False, str(e))
+
+timings["bifrost_check"] = time.monotonic() - t0
+
+
+# ── 3. GPU Ollama ─────────────────────────────────────────────────────────────
+print(f"\n[{INFO}] 3. GPU Ollama (port 11436)")
+t0 = time.monotonic()
+
+try:
+    status, body = get(f"{OLLAMA_GPU}/api/tags")
+    models = [m["name"] for m in json.loads(body).get("models", [])]
+    has_qwen = any("qwen3" in m for m in models)
+    report(results, "GPU Ollama reachable", True, f"models: {models}")
+    report(results, "qwen3:8b present", has_qwen)
+except Exception as e:
+    report(results, "GPU Ollama reachable", False, str(e))
+    report(results, "qwen3:8b present", False, "skipped")
+
+timings["gpu_ollama_ping"] = time.monotonic() - t0
+
+
+# ── 4. CPU Ollama ─────────────────────────────────────────────────────────────
+print(f"\n[{INFO}] 4. CPU Ollama (port 11435)")
+t0 = time.monotonic()
+
+try:
+    status, body = get(f"{OLLAMA_CPU}/api/tags")
+    models = [m["name"] for m in json.loads(body).get("models", [])]
+    has_embed = any("nomic-embed-text" in m for m in models)
+    report(results, "CPU Ollama reachable", True, f"models: {models}")
+    report(results, "nomic-embed-text present", has_embed)
+except Exception as e:
+    report(results, "CPU Ollama reachable", False, str(e))
+    report(results, "nomic-embed-text present", False, "skipped")
+
+timings["cpu_ollama_ping"] = time.monotonic() - t0
+
+
+# ── 5. Qdrant ─────────────────────────────────────────────────────────────────
+print(f"\n[{INFO}] 5. Qdrant (port 6333)")
+t0 = time.monotonic()
+
+try:
+    status, body = get(f"{QDRANT}/collections")
+    cols = [c["name"] for c in json.loads(body).get("result", {}).get("collections", [])]
+    report(results, "Qdrant reachable", True, f"collections: {cols}")
+    report(results, "adolf_memories collection exists", "adolf_memories" in cols)
+except Exception as e:
+    report(results, "Qdrant reachable", False, str(e))
+    report(results, "adolf_memories collection exists", False, "skipped")
+
+try:
+    status, body = get(f"{QDRANT}/collections/adolf_memories")
+    info = json.loads(body).get("result", {})
+    dims = info.get("config", {}).get("params", {}).get("vectors", {}).get("size")
+    report(results, "vector dims = 768", dims == 768, f"got {dims}")
+except Exception as e:
+    report(results, "adolf_memories collection info", False, str(e))
+
+timings["qdrant_ping"] = time.monotonic() - t0
+
+
+# ── 6. SearXNG ────────────────────────────────────────────────────────────────
+print(f"\n[{INFO}] 6. SearXNG (port 11437)")
+t0 = time.monotonic()
+
+try:
+    status, body = get(f"{SEARXNG}/search?q=test&format=json", timeout=15)
+    elapsed = time.monotonic() - t0
+    n = len(json.loads(body).get("results", []))
+    report(results, "SearXNG reachable + JSON results", status == 200 and n > 0,
+           f"{n} results in {elapsed:.1f}s")
+    report(results, "SearXNG response < 5s", elapsed < 5, f"{elapsed:.2f}s")
+    timings["searxng_latency"] = elapsed
+except Exception as e:
+    report(results, "SearXNG reachable", False, str(e))
+    report(results, "SearXNG response < 5s", False, "skipped")
+    timings["searxng_latency"] = None
+
+timings["searxng_check"] = time.monotonic() - t0
+
+
+# ── summary ───────────────────────────────────────────────────────────────────
+print_summary(results)
+sys.exit(0 if all(ok for _, ok in results) else 1)
--- a/tests/integration/test_memory.py
+++ b/tests/integration/test_memory.py
@@ -0,0 +1,437 @@
+#!/usr/bin/env python3
+"""
+Adolf memory integration tests.
+
+Tests:
+  1. Name store   — POST "remember that your name is <RandomName>"
+  2. Qdrant point — verifies a new vector was written after store
+  3. Name recall  — POST "what is your name?" → reply must contain <RandomName>
+  4. LiteLLM      — verifies LiteLLM proxy is reachable (replaced Bifrost)
+  5. Timing profile — breakdown of store and recall latencies
+  6. Memory benchmark — store 5 personal facts, recall with 10 questions
+  7. Dedup test   — same fact stored twice must not grow Qdrant by 2 points
+
+Usage:
+    python3 test_memory.py [--chat-id CHAT_ID] [--name-only] [--bench-only] [--dedup-only]
+"""
+
+import argparse
+import json
+import random
+import subprocess
+import sys
+import time
+import urllib.request
+
+from common import (
+    DEEPAGENTS, LITELLM, QDRANT, COMPOSE_FILE, DEFAULT_CHAT_ID,
+    NAMES,
+    INFO, PASS, FAIL, WARN,
+    report, print_summary, tf,
+    get, post_json, qdrant_count, fetch_logs,
+    parse_run_block, wait_for,
+)
+
+# ── args ──────────────────────────────────────────────────────────────────────
+parser = argparse.ArgumentParser(description="Adolf memory integration tests")
+parser.add_argument("--chat-id", default=DEFAULT_CHAT_ID)
+parser.add_argument("--name-only",  action="store_true", help="Run only the name store/recall test")
+parser.add_argument("--bench-only", action="store_true", help="Run only the memory benchmark")
+parser.add_argument("--dedup-only", action="store_true", help="Run only the deduplication test")
+args = parser.parse_args()
+
+CHAT_ID = args.chat_id
+_only = args.name_only or args.bench_only or args.dedup_only
+_run_name  = not _only or args.name_only
+_run_bench = not _only or args.bench_only
+_run_dedup = not _only or args.dedup_only
+
+results = []
+timings = {}
+
+random_name = random.choice(NAMES)
+TEST_CHAT_ID = f"{CHAT_ID}-{random_name.lower()}"
+
+if _run_name:
+    print(f"\n  Test name : \033[1m{random_name}\033[0m")
+    print(f"  Chat ID   : {TEST_CHAT_ID}")
+
+
+# ── 1–4. Name store / recall pipeline ────────────────────────────────────────
+if _run_name:
+    print(f"\n[{INFO}] 1. Name store / recall pipeline")
+
+    store_msg  = f"remember that your name is {random_name}"
+    recall_msg = "what is your name?"
+
+    # Clear memories so each run starts clean
+    try:
+        post_json(f"{QDRANT}/collections/adolf_memories/points/delete",
+                  {"filter": {}}, timeout=5)
+    except Exception:
+        pass
+
+    pts_before = qdrant_count()
+    print(f"  Qdrant points before: {pts_before}")
+
+    # ── 1. Store ──────────────────────────────────────────────────────────────
+    print(f"\n  [store] '{store_msg}'")
+    t_store = time.monotonic()
+
+    try:
+        status, _ = post_json(f"{DEEPAGENTS}/chat",
+                              {"message": store_msg, "chat_id": TEST_CHAT_ID}, timeout=5)
+        t_accept = time.monotonic() - t_store
+        report(results, "POST /chat (store) returns 202 immediately",
+               status == 202 and t_accept < 1, f"status={status}, t={t_accept:.3f}s")
+        timings["store_http_accept"] = t_accept
+    except Exception as e:
+        report(results, "POST /chat (store)", False, str(e))
+        print_summary(results)
+        sys.exit(1)
+
+    store = wait_for("store", store_msg, timeout_s=220, need_memory=True)
+
+    if store:
+        timings.update({
+            "store_llm":    store["llm"],
+            "store_send":   store["send"],
+            "store_reply":  store["reply_total"],
+            "store_memory": store["memory_s"],
+        })
+        report(results, "Agent replied to store message", True,
+               f"{store['reply_total']:.1f}s total  llm={store['llm']:.1f}s  "
+               f"send={store['send']:.1f}s  tier={store['tier']}")
+        if store["memory_s"] is not None:
+            report(results, "Memory stored without error", True, f"{store['memory_s']:.1f}s")
+        elif store["memory_error"]:
+            report(results, "Memory stored without error", False, "error in [memory] log")
+        else:
+            report(results, "Memory stored without error", False, "not found in logs")
+        print(f"    Store reply: {store['reply_text']!r}")
+    else:
+        report(results, "Agent replied to store message", False, "timeout")
+        report(results, "Memory stored without error", False, "timeout")
+        print_summary(results)
+        sys.exit(1)
+
+    # ── 2. Qdrant point check ─────────────────────────────────────────────────
+    pts_after = qdrant_count()
+    new_pts = pts_after - pts_before
+    report(results, "New memory point(s) added to Qdrant", new_pts > 0,
+           f"{pts_before} → {pts_after} (+{new_pts})")
+    timings["qdrant_new_points"] = new_pts
+
+    # ── 3. Recall ─────────────────────────────────────────────────────────────
+    print(f"\n  [recall] '{recall_msg}'")
+    t_recall = time.monotonic()
+
+    try:
+        status, _ = post_json(f"{DEEPAGENTS}/chat",
+                              {"message": recall_msg, "chat_id": TEST_CHAT_ID}, timeout=5)
+        t_accept2 = time.monotonic() - t_recall
+        report(results, "POST /chat (recall) returns 202 immediately",
+               status == 202 and t_accept2 < 1, f"status={status}, t={t_accept2:.3f}s")
+        timings["recall_http_accept"] = t_accept2
+    except Exception as e:
+        report(results, "POST /chat (recall)", False, str(e))
+
+    recall = wait_for("recall", recall_msg, timeout_s=160, need_memory=False)
+
+    if recall:
+        timings.update({
+            "recall_llm":   recall["llm"],
+            "recall_send":  recall["send"],
+            "recall_reply": recall["reply_total"],
+        })
+        report(results, "Agent replied to recall message", True,
+               f"{recall['reply_total']:.1f}s total  llm={recall['llm']:.1f}s  "
+               f"send={recall['send']:.1f}s  tier={recall['tier']}")
+        reply_text = recall["reply_text"] or ""
+        name_in_reply = random_name.lower() in reply_text.lower()
+        report(results, f"Reply contains '{random_name}'", name_in_reply,
+               f"reply: {reply_text[:120]!r}")
+    else:
+        report(results, "Agent replied to recall message", False, "timeout")
+        report(results, f"Reply contains '{random_name}'", False, "no reply")
+
+    # ── 4. LiteLLM proxy reachable (replaced Bifrost) ─────────────────────────
+    try:
+        status, _ = get(f"{LITELLM}/health", timeout=5)
+        litellm_ok = status == 200
+    except Exception:
+        litellm_ok = False
+    report(results, "LiteLLM proxy reachable", litellm_ok)
+
+    # ── 5. Timing profile ─────────────────────────────────────────────────────
+    print(f"\n[{INFO}] 5. Timing profile")
+    W = 36
+    print(f"\n  {'Stage':<{W}}  {'Time':>8}")
+    print(f"  {'─'*W}  {'─'*8}")
+
+    for label, key in [
+        ("[GPU] HTTP accept — store turn",        "store_http_accept"),
+        ("[GPU] qwen3:Xb inference — store turn", "store_llm"),
+        ("[GPU] Telegram send — store turn",      "store_send"),
+        ("[GPU] Total reply latency — store",     "store_reply"),
+        ("[GPU] qwen2.5:1.5b+embed — async mem",  "store_memory"),
+    ]:
+        print(f"  {label:<{W}}  {tf(timings.get(key)):>8}")
+
+    print(f"  {'─'*W}  {'─'*8}")
+
+    for label, key in [
+        ("[GPU] HTTP accept — recall turn",    "recall_http_accept"),
+        ("[GPU] qwen3:Xb inference — recall",  "recall_llm"),
+        ("[GPU] Telegram send — recall turn",  "recall_send"),
+        ("[GPU] Total reply latency — recall", "recall_reply"),
+    ]:
+        print(f"  {label:<{W}}  {tf(timings.get(key)):>8}")
+
+    print(f"\n  Bottleneck analysis (each █ ≈ 5s):")
+    print(f"  {'─'*(W+12)}")
+    candidates = [
+        ("[GPU] qwen3:Xb — store reply ", timings.get("store_llm")   or 0),
+        ("[GPU] qwen3:Xb — recall reply", timings.get("recall_llm")  or 0),
+        ("[GPU] qwen2.5:1.5b+embed (async)", timings.get("store_memory") or 0),
+    ]
+    candidates.sort(key=lambda x: x[1], reverse=True)
+    for label, t in candidates:
+        bar = "█" * min(int(t / 5), 24)
+        total_pipeline = (timings.get("store_reply") or 0) + (timings.get("store_memory") or 0)
+        pct = f"  {t/total_pipeline*100:4.0f}%" if total_pipeline > 0 else ""
+        print(f"  {label}  {t:6.1f}s  {bar}{pct}")
+    print()
+
+
+# ── 6. Memory benchmark ───────────────────────────────────────────────────────
+if _run_bench:
+    _mem_name     = random.choice(["Alice", "Bruno", "Camille", "Diego", "Elena",
+                                   "Farid", "Greta", "Hiroshi", "Irina", "Jonas"])
+    _mem_city     = random.choice(["Tokyo", "Berlin", "Cairo", "Sydney", "Oslo",
+                                   "Nairobi", "Lisbon", "Seoul", "Montreal", "Bangkok"])
+    _mem_allergy  = random.choice(["nuts", "gluten", "dairy", "shellfish", "eggs"])
+    _mem_job      = random.choice([
+        ("software engineer", "startup"),
+        ("data scientist", "research lab"),
+        ("product manager", "tech company"),
+        ("DevOps engineer", "cloud provider"),
+    ])
+    _mem_lang     = random.choice(["Python", "Rust", "Go", "TypeScript", "Kotlin"])
+    _mem_pet_name = random.choice(["Whiskers", "Biscuit", "Mango", "Pebble", "Shadow",
+                                   "Noodle", "Cheddar", "Cosmo", "Pippin", "Ziggy"])
+
+    print(f"\n[{INFO}] 6. Memory benchmark")
+    print(f"  name={_mem_name}  city={_mem_city}  allergy={_mem_allergy}  "
+          f"job={_mem_job[0]}@{_mem_job[1]}  lang={_mem_lang}  pet={_mem_pet_name}")
+    print(f"  Storing 5 facts, then querying with 10 recall questions")
+    print(f"  Chat ID: {CHAT_ID}")
+    print()
+
+    # Wipe collection and restart openmemory for a clean slate
+    try:
+        req = urllib.request.Request(f"{QDRANT}/collections/adolf_memories", method="DELETE")
+        with urllib.request.urlopen(req, timeout=5):
+            pass
+        print(f"  [{INFO}] Wiped adolf_memories collection")
+    except Exception as e:
+        print(f"  [{WARN}] Could not wipe collection: {e}")
+
+    try:
+        subprocess.run(
+            ["docker", "compose", "-f", COMPOSE_FILE, "restart", "openmemory"],
+            capture_output=True, timeout=30,
+        )
+        time.sleep(6)
+        print(f"  [{INFO}] Restarted openmemory — fresh collection ready")
+    except Exception as e:
+        print(f"  [{WARN}] Could not restart openmemory: {e}")
+
+    MEMORY_FACTS = [
+        f"My name is {_mem_name} and I live in {_mem_city}",
+        f"I prefer vegetarian food and I'm allergic to {_mem_allergy}",
+        f"I work as a {_mem_job[0]} at a {_mem_job[1]}",
+        f"My favorite programming language is {_mem_lang}",
+        f"I have a cat named {_mem_pet_name}",
+    ]
+
+    MEMORY_RECALLS = [
+        ("What is my name?",                       [_mem_name.lower()]),
+        ("Where do I live?",                       [_mem_city.lower()]),
+        ("Do I have any food allergies?",          [_mem_allergy.lower()]),
+        ("What is my job?",                        [_mem_job[0].split()[0].lower()]),
+        ("What programming language do I prefer?", [_mem_lang.lower()]),
+        ("Do I have any pets?",                    [_mem_pet_name.lower()]),
+        ("Am I vegetarian or do I eat meat?",      ["vegetarian"]),
+        ("What city am I in?",                     [_mem_city.lower()]),
+        ("Tell me what you know about me",         [_mem_name.lower(), _mem_city.lower()]),
+        ("What's the name of my pet?",             [_mem_pet_name.lower()]),
+    ]
+
+    STORE_TIMEOUT  = 180
+    RECALL_TIMEOUT = 180
+
+    print(f"  Storing {len(MEMORY_FACTS)} facts...")
+    store_ok = 0
+    for i, fact in enumerate(MEMORY_FACTS, 1):
+        print(f"  [mem-store-{i:02d}] {fact!r}")
+        try:
+            status, _ = post_json(f"{DEEPAGENTS}/chat",
+                                  {"message": fact, "chat_id": CHAT_ID}, timeout=5)
+            if status != 202:
+                print(f"              → [{FAIL}] POST returned {status}")
+                continue
+        except Exception as e:
+            print(f"              → [{FAIL}] POST error: {e}")
+            continue
+
+        found = wait_for(f"mem-store-{i:02d}", fact, timeout_s=STORE_TIMEOUT, need_memory=True)
+        if found:
+            store_ok += 1
+            print(f"              → [{PASS}] stored  tier={found['tier']}  mem={found['memory_s']}s")
+        else:
+            print(f"              → [{FAIL}] timeout")
+
+    report(results, f"All memory facts stored ({store_ok}/{len(MEMORY_FACTS)})",
+           store_ok == len(MEMORY_FACTS))
+
+    # Wait for async extraction to settle
+    print(f"\n  Waiting for memory extraction to settle (up to 60s)...")
+    _prev_count = -1
+    _stable_ticks = 0
+    _cur_count = 0
+    for _ in range(30):
+        time.sleep(2)
+        try:
+            _, body = get(f"{QDRANT}/collections/adolf_memories")
+            _cur_count = json.loads(body).get("result", {}).get("points_count", 0)
+        except Exception:
+            _cur_count = _prev_count
+        if _cur_count == _prev_count:
+            _stable_ticks += 1
+            if _stable_ticks >= 3:
+                break
+        else:
+            _stable_ticks = 0
+        _prev_count = _cur_count
+    print(f"  Memory settled: {_cur_count} points in Qdrant")
+
+    print(f"\n  Querying with {len(MEMORY_RECALLS)} recall questions...")
+    recall_results = []
+
+    for i, (question, keywords) in enumerate(MEMORY_RECALLS, 1):
+        print(f"  [mem-recall-{i:02d}] {question!r}")
+        try:
+            status, _ = post_json(f"{DEEPAGENTS}/chat",
+                                  {"message": question, "chat_id": CHAT_ID}, timeout=5)
+            if status != 202:
+                print(f"               → [{FAIL}] POST returned {status}")
+                recall_results.append((question, keywords, None, False))
+                continue
+        except Exception as e:
+            print(f"               → [{FAIL}] POST error: {e}")
+            recall_results.append((question, keywords, None, False))
+            continue
+
+        t_start = time.monotonic()
+        found = None
+        while time.monotonic() - t_start < RECALL_TIMEOUT:
+            since = int(time.monotonic() - t_start) + 30
+            lines = fetch_logs(since_s=since)
+            found = parse_run_block(lines, question)
+            if found:
+                break
+            time.sleep(2)
+
+        if not found:
+            print(f"               → [{FAIL}] timeout")
+            recall_results.append((question, keywords, None, False))
+            continue
+
+        reply_text = (found.get("reply_text") or "").lower()
+        hit_keywords = [kw for kw in keywords if kw.lower() in reply_text]
+        passed = len(hit_keywords) == len(keywords)
+        tag_str = PASS if passed else WARN
+        missing = [kw for kw in keywords if kw.lower() not in reply_text]
+        detail = f"tier={found['tier']}  lat={found['reply_total']:.1f}s"
+        if missing:
+            detail += f"  missing keywords: {missing}"
+        print(f"               → [{tag_str}] {detail}")
+        recall_results.append((question, keywords, found.get("reply_text"), passed))
+        time.sleep(1)
+
+    print(f"\n  {'#':<4}  {'Pass':<5}  {'Question':<45}  {'Keywords'}")
+    print(f"  {'─'*4}  {'─'*5}  {'─'*45}  {'─'*30}")
+    for idx, (q, kws, reply, ok) in enumerate(recall_results, 1):
+        ok_str = "✓" if ok else "✗"
+        print(f"  {ok_str} {idx:<3}  {'yes' if ok else 'no':<5}  {q[:45]:<45}  {kws}")
+
+    recall_pass = sum(1 for _, _, _, ok in recall_results if ok)
+    total_recall = len(recall_results)
+    print(f"\n  Memory recall score: {recall_pass}/{total_recall}")
+    report(results, f"Memory recall ({recall_pass}/{total_recall} keywords found)",
+           recall_pass == total_recall,
+           f"{recall_pass}/{total_recall} questions had all expected keywords in reply")
+
+
+# ── 7. Deduplication test ─────────────────────────────────────────────────────
+if _run_dedup:
+    print(f"\n[{INFO}] 7. Memory deduplication test")
+    print(f"  Sends the same fact twice — Qdrant point count must not increase by 2")
+    print(f"  Chat ID: {CHAT_ID}")
+    print()
+
+    DEDUP_TIMEOUT = 120
+    _dedup_fact = f"My lucky number is {random.randint(1000, 9999)}"
+    print(f"  Fact: {_dedup_fact!r}")
+
+    pts_before = qdrant_count()
+    print(f"  Qdrant points before: {pts_before}")
+
+    print(f"  [dedup-1] sending fact (first time)")
+    found1 = None
+    try:
+        status, _ = post_json(f"{DEEPAGENTS}/chat",
+                              {"message": _dedup_fact, "chat_id": CHAT_ID}, timeout=5)
+        if status != 202:
+            report(results, "Dedup: first POST accepted", False, f"status={status}")
+        else:
+            found1 = wait_for("dedup-1", _dedup_fact, timeout_s=DEDUP_TIMEOUT, need_memory=True)
+            if found1:
+                print(f"  [dedup-1] stored  tier={found1['tier']}  mem={found1['memory_s']}s")
+            else:
+                print(f"  [dedup-1] timeout")
+    except Exception as e:
+        report(results, "Dedup: first POST accepted", False, str(e))
+
+    pts_after_first = qdrant_count()
+    new_first = pts_after_first - pts_before
+    print(f"  Qdrant after first send: {pts_before} → {pts_after_first} (+{new_first})")
+
+    print(f"  [dedup-2] sending same fact (second time)")
+    try:
+        status, _ = post_json(f"{DEEPAGENTS}/chat",
+                              {"message": _dedup_fact, "chat_id": CHAT_ID}, timeout=5)
+        if status != 202:
+            report(results, "Dedup: second POST accepted", False, f"status={status}")
+        else:
+            found2 = wait_for("dedup-2", _dedup_fact, timeout_s=DEDUP_TIMEOUT, need_memory=True)
+            if found2:
+                print(f"  [dedup-2] stored  tier={found2['tier']}  mem={found2['memory_s']}s")
+            else:
+                print(f"  [dedup-2] timeout")
+    except Exception as e:
+        report(results, "Dedup: second POST accepted", False, str(e))
+
+    pts_after_second = qdrant_count()
+    new_second = pts_after_second - pts_after_first
+    print(f"  Qdrant after second send: {pts_after_first} → {pts_after_second} (+{new_second})")
+
+    dedup_ok = new_second <= new_first
+    report(results, "Deduplication: second identical fact not added to Qdrant", dedup_ok,
+           f"first send: +{new_first} pts, second send: +{new_second} pts (want second ≤ first)")
+
+
+# ── summary ───────────────────────────────────────────────────────────────────
+print_summary(results)
+sys.exit(0 if all(ok for _, ok in results) else 1)
--- a/tests/integration/test_routing.py
+++ b/tests/integration/test_routing.py
@@ -0,0 +1,317 @@
+#!/usr/bin/env python3
+"""
+Adolf tier routing benchmark.
+
+Tests:
+  easy   — 10 questions that must route to 'light' tier
+  medium — 11 questions that must route to 'medium' (light acceptable for some; complex = fail)
+  hard   — 10 /think questions that must route to 'complex' (medium fallback acceptable)
+
+Usage:
+    python3 test_routing.py [--chat-id CHAT_ID]
+                            [--easy-only]    # only easy benchmark
+                            [--medium-only]  # only medium benchmark
+                            [--hard-only]    # only hard benchmark
+"""
+
+import argparse
+import sys
+import time
+
+from common import (
+    DEEPAGENTS, COMPOSE_FILE, DEFAULT_CHAT_ID,
+    BENCHMARK,
+    INFO, PASS, FAIL, WARN,
+    report, print_summary,
+    post_json, fetch_logs,
+    parse_run_block,
+)
+
+# ── args ──────────────────────────────────────────────────────────────────────
+parser = argparse.ArgumentParser(description="Adolf routing benchmark")
+parser.add_argument("--chat-id",     default=DEFAULT_CHAT_ID)
+parser.add_argument("--easy-only",   action="store_true")
+parser.add_argument("--medium-only", action="store_true")
+parser.add_argument("--hard-only",   action="store_true")
+args = parser.parse_args()
+
+CHAT_ID = args.chat_id
+_only = args.easy_only or args.medium_only or args.hard_only
+_run_easy   = not _only or args.easy_only
+_run_medium = not _only or args.medium_only
+_run_hard   = not _only or args.hard_only
+
+results = []
+
+
+# ── easy benchmark ────────────────────────────────────────────────────────────
+if _run_easy:
+    print(f"\n[{INFO}] Easy routing benchmark")
+    print(f"  {len(BENCHMARK['easy'])} questions — all must route to 'light'")
+    print(f"  Chat ID: {CHAT_ID}")
+    print()
+
+    bench_results = []
+    LIGHT_TIMEOUT = 60
+
+    for i, question in enumerate(BENCHMARK["easy"], 1):
+        tag = f"easy-{i:02d}"
+        print(f"  [{tag}] {question[:55]!r}")
+
+        t_send = time.monotonic()
+        try:
+            status, _ = post_json(f"{DEEPAGENTS}/chat",
+                                  {"message": question, "chat_id": CHAT_ID}, timeout=5)
+            if status != 202:
+                print(f"          → [{FAIL}] POST returned {status}")
+                bench_results.append((question, "?", None, False))
+                continue
+        except Exception as e:
+            print(f"          → [{FAIL}] POST error: {e}")
+            bench_results.append((question, "?", None, False))
+            continue
+
+        t_start = time.monotonic()
+        found = None
+        while time.monotonic() - t_start < LIGHT_TIMEOUT:
+            since = int(time.monotonic() - t_start) + 30
+            lines = fetch_logs(since_s=since)
+            found = parse_run_block(lines, question)
+            if found:
+                break
+            time.sleep(1)
+
+        if not found:
+            print(f"          → [{FAIL}] no reply within {LIGHT_TIMEOUT}s")
+            bench_results.append((question, "timeout", None, False))
+            continue
+
+        tier = found.get("tier", "unknown")
+        is_light = (tier == "light")
+        tag_str = PASS if is_light else FAIL
+        print(f"          → [{tag_str}] tier={tier}  latency={found['reply_total']:.1f}s  llm={found['llm']:.1f}s")
+        bench_results.append((question, tier, found["reply_total"], is_light))
+        time.sleep(1)
+
+    print(f"\n  {'#':<4}  {'Tier':<8}  {'Latency':>8}  {'Question'}")
+    print(f"  {'─'*4}  {'─'*8}  {'─'*8}  {'─'*50}")
+    for idx, (q, tier, lat, ok) in enumerate(bench_results, 1):
+        lat_str = f"{lat:.1f}s" if lat is not None else "timeout"
+        ok_str = "✓" if ok else "✗"
+        print(f"  {ok_str} {idx:<3}  {tier:<8}  {lat_str:>8}  {q[:50]!r}")
+
+    light_count = sum(1 for _, _, _, ok in bench_results if ok)
+    total_bench = len(bench_results)
+    lats = [lat for _, _, lat, ok in bench_results if ok and lat is not None]
+    avg_lat = sum(lats) / len(lats) if lats else 0
+
+    print(f"\n  Light-path score: {light_count}/{total_bench}")
+    if lats:
+        print(f"  Avg latency (light): {avg_lat:.1f}s  min={min(lats):.1f}s  max={max(lats):.1f}s")
+
+    report(results, f"All easy questions routed to light ({light_count}/{total_bench})",
+           light_count == total_bench,
+           f"{light_count}/{total_bench} via light path, avg {avg_lat:.1f}s")
+
+
+# ── medium benchmark ──────────────────────────────────────────────────────────
+if _run_medium:
+    print(f"\n[{INFO}] Medium routing benchmark")
+    print(f"  {len(BENCHMARK['medium'])} questions — must route to medium (light ok for some; complex = fail)")
+    print(f"  Chat ID: {CHAT_ID}")
+    print()
+
+    LIGHT_ACCEPTABLE = {
+        "who won the last FIFA World Cup?",
+        "search for a good pasta carbonara recipe",
+        "find Python tutorials for beginners",
+        "search for the best coffee shops in Tokyo",
+    }
+
+    med_results = []
+    MEDIUM_TIMEOUT = 120
+
+    for i, question in enumerate(BENCHMARK["medium"], 1):
+        tag = f"med-{i:02d}"
+        print(f"  [{tag}] {question[:60]!r}")
+
+        t_send = time.monotonic()
+        try:
+            status, _ = post_json(f"{DEEPAGENTS}/chat",
+                                  {"message": question, "chat_id": CHAT_ID}, timeout=5)
+            if status != 202:
+                print(f"          → [{FAIL}] POST returned {status}")
+                med_results.append((question, "?", None, False))
+                continue
+        except Exception as e:
+            print(f"          → [{FAIL}] POST error: {e}")
+            med_results.append((question, "?", None, False))
+            continue
+
+        t_start = time.monotonic()
+        found = None
+        while time.monotonic() - t_start < MEDIUM_TIMEOUT:
+            since = int(time.monotonic() - t_start) + 60
+            lines = fetch_logs(since_s=since)
+            found = parse_run_block(lines, question)
+            if found:
+                break
+            time.sleep(3)
+
+        if not found:
+            print(f"          → [{FAIL}] no reply within {MEDIUM_TIMEOUT}s")
+            med_results.append((question, "timeout", None, False))
+            continue
+
+        tier = found.get("tier", "unknown")
+        light_ok = question in LIGHT_ACCEPTABLE
+
+        if tier == "medium":
+            correct, label, note = True, PASS, "medium ✓"
+        elif tier == "light":
+            correct = light_ok
+            label = PASS if light_ok else WARN
+            note = "light (acceptable)" if light_ok else "light (should be medium)"
+        elif tier == "complex":
+            correct, label, note = False, FAIL, "complex — wrong escalation"
+        else:
+            correct, label, note = False, FAIL, f"unknown tier {tier!r}"
+
+        print(f"          → [{label}] {note}  latency={found['reply_total']:.1f}s  llm={found['llm']:.1f}s")
+        med_results.append((question, tier, found["reply_total"], correct))
+        time.sleep(1)
+
+    print(f"\n  {'#':<4}  {'Tier':<8}  {'Latency':>8}  {'Question'}")
+    print(f"  {'─'*4}  {'─'*8}  {'─'*8}  {'─'*55}")
+    for idx, (q, tier, lat, ok) in enumerate(med_results, 1):
+        lat_str = f"{lat:.1f}s" if lat is not None else "timeout"
+        ok_str = "✓" if ok else ("~" if tier == "light" else "✗")
+        print(f"  {ok_str} {idx:<3}  {tier:<8}  {lat_str:>8}  {q[:55]!r}")
+
+    total_med     = len(med_results)
+    medium_count  = sum(1 for _, tier, _, _ in med_results if tier == "medium")
+    light_count   = sum(1 for _, tier, _, _ in med_results if tier == "light")
+    complex_count = sum(1 for _, tier, _, _ in med_results if tier == "complex")
+    timeout_count = sum(1 for _, tier, _, _ in med_results if tier == "timeout")
+    light_misroute = sum(1 for q, tier, _, _ in med_results
+                         if tier == "light" and q not in LIGHT_ACCEPTABLE)
+    lats = [lat for _, _, lat, _ in med_results if lat is not None]
+
+    print(f"\n  Breakdown: medium={medium_count}  light={light_count}  "
+          f"complex={complex_count}  timeout={timeout_count}")
+    if light_misroute:
+        print(f"  [{WARN}] {light_misroute} question(s) answered via light when medium expected")
+    if lats:
+        print(f"  Avg latency: {sum(lats)/len(lats):.1f}s  min={min(lats):.1f}s  max={max(lats):.1f}s")
+
+    report(results,
+           f"Medium questions: no complex escalation ({medium_count + light_count}/{total_med} routed)",
+           complex_count == 0,
+           f"medium={medium_count} light={light_count} complex={complex_count} timeout={timeout_count}")
+    if timeout_count:
+        report(results, f"Medium questions: all completed within {MEDIUM_TIMEOUT}s", False,
+               f"{timeout_count} question(s) timed out")
+
+
+# ── hard benchmark ────────────────────────────────────────────────────────────
+if _run_hard:
+    print(f"\n[{INFO}] Hard routing benchmark")
+    print(f"  {len(BENCHMARK['hard'])} /think questions — must route to 'complex'")
+    print(f"  Acceptable fallback: 'medium' if VRAM eviction timed out")
+    print(f"  Fail condition: tier=light or timeout")
+    print(f"  Chat ID: {CHAT_ID}")
+    print()
+
+    hard_results  = []
+    COMPLEX_TIMEOUT = 300
+    _VRAM_ENTER = "[vram] enter_complex_mode"
+    _VRAM_EXIT  = "[vram] exit_complex_mode"
+
+    for i, question in enumerate(BENCHMARK["hard"], 1):
+        tag = f"hard-{i:02d}"
+        short_q = question[len("/think "):].strip()[:60]
+        print(f"  [{tag}] /think {short_q!r}")
+
+        t_send = time.monotonic()
+        try:
+            status, _ = post_json(f"{DEEPAGENTS}/chat",
+                                  {"message": question, "chat_id": CHAT_ID}, timeout=5)
+            if status != 202:
+                print(f"          → [{FAIL}] POST returned {status}")
+                hard_results.append((question, "?", None, False))
+                continue
+        except Exception as e:
+            print(f"          → [{FAIL}] POST error: {e}")
+            hard_results.append((question, "?", None, False))
+            continue
+
+        t_start = time.monotonic()
+        found = None
+        while time.monotonic() - t_start < COMPLEX_TIMEOUT:
+            since = int(time.monotonic() - t_start) + 90
+            lines = fetch_logs(since_s=since)
+            found = parse_run_block(lines, question[len("/think "):].strip())
+            if found:
+                break
+            time.sleep(5)
+
+        elapsed = time.monotonic() - t_send
+
+        if not found:
+            print(f"          → [{FAIL}] no reply within {COMPLEX_TIMEOUT}s")
+            hard_results.append((question, "timeout", None, False))
+            continue
+
+        tier = found.get("tier", "unknown")
+
+        if tier == "complex":
+            ok, label, note = True, PASS, "complex ✓"
+        elif tier == "medium":
+            ok, label, note = True, WARN, "medium (VRAM fallback — check [vram] logs)"
+        else:
+            ok, label, note = False, FAIL, f"tier={tier} — unexpected"
+
+        lines_block = fetch_logs(since_s=int(elapsed) + 120)
+        recent = "\n".join(lines_block[-200:])
+        vram_enter_seen = _VRAM_ENTER in recent
+        vram_note = ""
+        if tier == "complex":
+            vram_note = " [vram:flush✓]" if vram_enter_seen else f" [{WARN}:no vram flush log]"
+
+        print(f"          → [{label}] {note}  latency={found['reply_total']:.1f}s  llm={found['llm']:.1f}s{vram_note}")
+        hard_results.append((question, tier, found["reply_total"], ok))
+        time.sleep(5)
+
+    print(f"\n  {'#':<4}  {'Tier':<8}  {'Latency':>8}  {'Question (/think ...)'}")
+    print(f"  {'─'*4}  {'─'*8}  {'─'*8}  {'─'*55}")
+    for idx, (q, tier, lat, ok) in enumerate(hard_results, 1):
+        lat_str = f"{lat:.1f}s" if lat is not None else "timeout"
+        ok_str = "✓" if tier == "complex" else ("~" if tier == "medium" else "✗")
+        short = q[len("/think "):].strip()[:55]
+        print(f"  {ok_str} {idx:<3}  {tier:<8}  {lat_str:>8}  {short!r}")
+
+    total_hard    = len(hard_results)
+    complex_count = sum(1 for _, t, _, _ in hard_results if t == "complex")
+    medium_fb     = sum(1 for _, t, _, _ in hard_results if t == "medium")
+    light_count   = sum(1 for _, t, _, _ in hard_results if t == "light")
+    timeout_count = sum(1 for _, t, _, _ in hard_results if t == "timeout")
+    lats = [lat for _, _, lat, _ in hard_results if lat is not None]
+
+    print(f"\n  Breakdown: complex={complex_count}  medium(fallback)={medium_fb}  "
+          f"light={light_count}  timeout={timeout_count}")
+    if medium_fb:
+        print(f"  [{WARN}] {medium_fb} question(s) fell back to medium (VRAM eviction timeout)")
+    if light_count:
+        print(f"  [{FAIL}] {light_count} question(s) routed to light — /think prefix not detected")
+    if lats:
+        print(f"  Avg latency: {sum(lats)/len(lats):.1f}s  min={min(lats):.1f}s  max={max(lats):.1f}s")
+
+    report(results,
+           f"Hard questions routed to complex (not light) ({complex_count + medium_fb}/{total_hard})",
+           light_count == 0 and timeout_count == 0,
+           f"complex={complex_count} medium_fallback={medium_fb} light={light_count} timeout={timeout_count}")
+
+
+# ── summary ───────────────────────────────────────────────────────────────────
+print_summary(results)
+sys.exit(0 if all(ok for _, ok in results) else 1)
--- a/tests/requirements.txt
+++ b/tests/requirements.txt
@@ -0,0 +1,2 @@
+pytest>=8.0
+pytest-asyncio>=0.23
--- a/tests/unit/init.py
+++ b/tests/unit/init.py
--- a/tests/unit/conftest.py
+++ b/tests/unit/conftest.py
@@ -0,0 +1,80 @@
+"""
+Stub out all third-party packages that Adolf's source modules import.
+This lets the unit tests run without a virtualenv or Docker environment.
+Stubs are installed into sys.modules before any test file is collected.
+"""
+
+import sys
+from unittest.mock import MagicMock
+
+
+# ── helpers ────────────────────────────────────────────────────────────────────
+
+def _mock(name: str) -> MagicMock:
+    m = MagicMock(name=name)
+    sys.modules[name] = m
+    return m
+
+
+# ── pydantic: BaseModel must be a real class so `class Foo(BaseModel)` works ──
+
+class _FakeBaseModel:
+    model_fields: dict = {}
+
+    def __init_subclass__(cls, **kwargs):
+        pass
+
+    def __init__(self, **data):
+        for k, v in data.items():
+            setattr(self, k, v)
+
+
+_pydantic = _mock("pydantic")
+_pydantic.BaseModel = _FakeBaseModel
+
+# ── httpx: used by channels.py, vram_manager.py, agent.py ────────────────────
+
+_mock("httpx")
+
+# ── fastapi ───────────────────────────────────────────────────────────────────
+
+_fastapi = _mock("fastapi")
+_mock("fastapi.responses")
+
+# ── langchain stack ───────────────────────────────────────────────────────────
+
+_mock("langchain_openai")
+
+_lc_core = _mock("langchain_core")
+_lc_msgs = _mock("langchain_core.messages")
+_mock("langchain_core.tools")
+
+# Provide real-ish message classes so router.py can instantiate them
+class _FakeMsg:
+    def __init__(self, content=""):
+        self.content = content
+
+class SystemMessage(_FakeMsg):
+    pass
+
+class HumanMessage(_FakeMsg):
+    pass
+
+class AIMessage(_FakeMsg):
+    def __init__(self, content="", tool_calls=None):
+        super().__init__(content)
+        self.tool_calls = tool_calls or []
+
+_lc_msgs.SystemMessage = SystemMessage
+_lc_msgs.HumanMessage = HumanMessage
+_lc_msgs.AIMessage = AIMessage
+
+_mock("langchain_mcp_adapters")
+_mock("langchain_mcp_adapters.client")
+_mock("langchain_community")
+_mock("langchain_community.utilities")
+
+# ── deepagents (agent_factory.py) ─────────────────────────────────────────────
+
+_mock("deepagents")
+
--- a/tests/unit/test_agent_helpers.py
+++ b/tests/unit/test_agent_helpers.py
@@ -0,0 +1,198 @@
+"""
+Unit tests for agent.py helper functions:
+  - _strip_think(text)
+  - _extract_final_text(result)
+
+agent.py has heavy FastAPI/LangChain imports; conftest.py stubs them out so
+these pure functions can be imported and tested in isolation.
+"""
+
+import pytest
+
+# conftest.py has already installed all stubs into sys.modules.
+# The FastAPI app is instantiated at module level in agent.py —
+# with the mocked fastapi, that just creates a MagicMock() object
+# and the route decorators are no-ops.
+from agent import _strip_think, _extract_final_text, _extract_urls
+
+
+# ── _strip_think ───────────────────────────────────────────────────────────────
+
+class TestStripThink:
+    def test_removes_single_think_block(self):
+        text = "<think>internal reasoning</think>Final answer."
+        assert _strip_think(text) == "Final answer."
+
+    def test_removes_multiline_think_block(self):
+        text = "<think>\nLine one.\nLine two.\n</think>\nResult here."
+        assert _strip_think(text) == "Result here."
+
+    def test_no_think_block_unchanged(self):
+        text = "This is a plain answer with no think block."
+        assert _strip_think(text) == text
+
+    def test_removes_multiple_think_blocks(self):
+        text = "<think>step 1</think>middle<think>step 2</think>end"
+        assert _strip_think(text) == "middleend"
+
+    def test_strips_surrounding_whitespace(self):
+        text = "  <think>stuff</think>  answer  "
+        assert _strip_think(text) == "answer"
+
+    def test_empty_think_block(self):
+        text = "<think></think>Hello."
+        assert _strip_think(text) == "Hello."
+
+    def test_empty_string(self):
+        assert _strip_think("") == ""
+
+    def test_only_think_block_returns_empty(self):
+        text = "<think>nothing useful</think>"
+        assert _strip_think(text) == ""
+
+    def test_think_block_with_nested_tags(self):
+        text = "<think>I should use <b>bold</b> here</think>Done."
+        assert _strip_think(text) == "Done."
+
+    def test_preserves_markdown(self):
+        text = "<think>plan</think>## Report\n\n- Point one\n- Point two"
+        result = _strip_think(text)
+        assert result == "## Report\n\n- Point one\n- Point two"
+
+
+# ── _extract_final_text ────────────────────────────────────────────────────────
+
+class TestExtractFinalText:
+    def _ai_msg(self, content: str, tool_calls=None):
+        """Create a minimal AIMessage-like object."""
+        class AIMessage:
+            pass
+        m = AIMessage()
+        m.content = content
+        m.tool_calls = tool_calls or []
+        return m
+
+    def _human_msg(self, content: str):
+        class HumanMessage:
+            pass
+        m = HumanMessage()
+        m.content = content
+        return m
+
+    def test_returns_last_ai_message_content(self):
+        result = {
+            "messages": [
+                self._human_msg("what is 2+2"),
+                self._ai_msg("The answer is 4."),
+            ]
+        }
+        assert _extract_final_text(result) == "The answer is 4."
+
+    def test_returns_last_of_multiple_ai_messages(self):
+        result = {
+            "messages": [
+                self._ai_msg("First response."),
+                self._human_msg("follow-up"),
+                self._ai_msg("Final response."),
+            ]
+        }
+        assert _extract_final_text(result) == "Final response."
+
+    def test_skips_empty_ai_messages(self):
+        result = {
+            "messages": [
+                self._ai_msg("Real answer."),
+                self._ai_msg(""),  # empty — should be skipped
+            ]
+        }
+        assert _extract_final_text(result) == "Real answer."
+
+    def test_strips_think_tags_from_ai_message(self):
+        result = {
+            "messages": [
+                self._ai_msg("<think>reasoning here</think>Clean reply."),
+            ]
+        }
+        assert _extract_final_text(result) == "Clean reply."
+
+    def test_falls_back_to_output_field(self):
+        result = {
+            "messages": [],
+            "output": "Fallback output.",
+        }
+        assert _extract_final_text(result) == "Fallback output."
+
+    def test_strips_think_from_output_field(self):
+        result = {
+            "messages": [],
+            "output": "<think>thoughts</think>Actual output.",
+        }
+        assert _extract_final_text(result) == "Actual output."
+
+    def test_returns_none_when_no_content(self):
+        result = {"messages": []}
+        assert _extract_final_text(result) is None
+
+    def test_returns_none_when_no_messages_and_no_output(self):
+        result = {"messages": [], "output": ""}
+        # output is falsy → returns None
+        assert _extract_final_text(result) is None
+
+    def test_skips_non_ai_messages(self):
+        result = {
+            "messages": [
+                self._human_msg("user question"),
+            ]
+        }
+        assert _extract_final_text(result) is None
+
+    def test_handles_ai_message_with_tool_calls_but_no_content(self):
+        """AIMessage that only has tool_calls (no content) should be skipped."""
+        msg = self._ai_msg("", tool_calls=[{"name": "web_search", "args": {}}])
+        result = {"messages": [msg]}
+        assert _extract_final_text(result) is None
+
+    def test_multiline_think_stripped_correctly(self):
+        result = {
+            "messages": [
+                self._ai_msg("<think>\nLong\nreasoning\nblock\n</think>\n## Report\n\nSome content."),
+            ]
+        }
+        assert _extract_final_text(result) == "## Report\n\nSome content."
+
+
+# ── _extract_urls ──────────────────────────────────────────────────────────────
+
+class TestExtractUrls:
+    def test_single_url(self):
+        assert _extract_urls("check this out https://example.com please") == ["https://example.com"]
+
+    def test_multiple_urls(self):
+        urls = _extract_urls("see https://foo.com and https://bar.org/path?q=1")
+        assert urls == ["https://foo.com", "https://bar.org/path?q=1"]
+
+    def test_no_urls(self):
+        assert _extract_urls("no links here at all") == []
+
+    def test_http_and_https(self):
+        urls = _extract_urls("http://old.site and https://new.site")
+        assert "http://old.site" in urls
+        assert "https://new.site" in urls
+
+    def test_url_at_start_of_message(self):
+        assert _extract_urls("https://example.com is interesting") == ["https://example.com"]
+
+    def test_url_only(self):
+        assert _extract_urls("https://example.com/page") == ["https://example.com/page"]
+
+    def test_url_with_path_and_query(self):
+        url = "https://example.com/articles/123?ref=home&page=2"
+        assert _extract_urls(url) == [url]
+
+    def test_empty_string(self):
+        assert _extract_urls("") == []
+
+    def test_does_not_include_surrounding_quotes(self):
+        # URLs inside quotes should not include the quote character
+        urls = _extract_urls('visit "https://example.com" today')
+        assert urls == ["https://example.com"]
--- a/tests/unit/test_channels.py
+++ b/tests/unit/test_channels.py
@@ -0,0 +1,125 @@
+"""Unit tests for channels.py — register, deliver, pending_replies queue."""
+
+import asyncio
+import pytest
+from unittest.mock import AsyncMock, patch
+
+import channels
+
+
+@pytest.fixture(autouse=True)
+def reset_channels_state():
+    """Clear module-level state before and after every test."""
+    channels._callbacks.clear()
+    channels.pending_replies.clear()
+    yield
+    channels._callbacks.clear()
+    channels.pending_replies.clear()
+
+
+# ── register ───────────────────────────────────────────────────────────────────
+
+class TestRegister:
+    def test_register_stores_callback(self):
+        cb = AsyncMock()
+        channels.register("test_channel", cb)
+        assert channels._callbacks["test_channel"] is cb
+
+    def test_register_overwrites_existing(self):
+        cb1 = AsyncMock()
+        cb2 = AsyncMock()
+        channels.register("ch", cb1)
+        channels.register("ch", cb2)
+        assert channels._callbacks["ch"] is cb2
+
+    def test_register_multiple_channels(self):
+        cb_a = AsyncMock()
+        cb_b = AsyncMock()
+        channels.register("a", cb_a)
+        channels.register("b", cb_b)
+        assert channels._callbacks["a"] is cb_a
+        assert channels._callbacks["b"] is cb_b
+
+
+# ── deliver ────────────────────────────────────────────────────────────────────
+
+class TestDeliver:
+    async def test_deliver_enqueues_reply(self):
+        channels.register("cli", AsyncMock())
+        await channels.deliver("cli-alvis", "cli", "hello world")
+        q = channels.pending_replies["cli-alvis"]
+        assert not q.empty()
+        assert await q.get() == "hello world"
+
+    async def test_deliver_calls_channel_callback(self):
+        cb = AsyncMock()
+        channels.register("telegram", cb)
+        await channels.deliver("tg-123", "telegram", "reply text")
+        cb.assert_awaited_once_with("tg-123", "reply text")
+
+    async def test_deliver_unknown_channel_still_enqueues(self):
+        """No registered callback for channel → reply still goes to the queue."""
+        await channels.deliver("cli-bob", "nonexistent", "fallback reply")
+        q = channels.pending_replies["cli-bob"]
+        assert await q.get() == "fallback reply"
+
+    async def test_deliver_unknown_channel_does_not_raise(self):
+        """Missing callback must not raise an exception."""
+        await channels.deliver("cli-x", "ghost_channel", "msg")
+
+    async def test_deliver_creates_queue_if_absent(self):
+        channels.register("cli", AsyncMock())
+        assert "cli-new" not in channels.pending_replies
+        await channels.deliver("cli-new", "cli", "hi")
+        assert "cli-new" in channels.pending_replies
+
+    async def test_deliver_reuses_existing_queue(self):
+        """Second deliver to the same session appends to the same queue."""
+        channels.register("cli", AsyncMock())
+        await channels.deliver("cli-alvis", "cli", "first")
+        await channels.deliver("cli-alvis", "cli", "second")
+        q = channels.pending_replies["cli-alvis"]
+        assert await q.get() == "first"
+        assert await q.get() == "second"
+
+    async def test_deliver_telegram_sends_to_callback(self):
+        sent = []
+
+        async def fake_tg(session_id, text):
+            sent.append((session_id, text))
+
+        channels.register("telegram", fake_tg)
+        await channels.deliver("tg-999", "telegram", "test message")
+        assert sent == [("tg-999", "test message")]
+
+
+# ── register_defaults ──────────────────────────────────────────────────────────
+
+class TestRegisterDefaults:
+    def test_registers_telegram_and_cli(self):
+        channels.register_defaults()
+        assert "telegram" in channels._callbacks
+        assert "cli" in channels._callbacks
+
+    async def test_cli_callback_is_noop(self):
+        """CLI send callback does nothing (replies are handled via SSE queue)."""
+        channels.register_defaults()
+        cb = channels._callbacks["cli"]
+        # Should not raise and should return None
+        result = await cb("cli-alvis", "some reply")
+        assert result is None
+
+    async def test_telegram_callback_chunks_long_messages(self):
+        """Telegram callback splits messages > 4000 chars into chunks."""
+        channels.register_defaults()
+        cb = channels._callbacks["telegram"]
+        long_text = "x" * 9000  # > 4000 chars → should produce 3 chunks
+        with patch("channels.httpx.AsyncClient") as mock_client_cls:
+            mock_client = AsyncMock()
+            mock_client.__aenter__ = AsyncMock(return_value=mock_client)
+            mock_client.__aexit__ = AsyncMock(return_value=False)
+            mock_client.post = AsyncMock()
+            mock_client_cls.return_value = mock_client
+            await cb("tg-123", long_text)
+            # 9000 chars / 4000 per chunk = 3 POST calls
+            assert mock_client.post.await_count == 3
--- a/tests/unit/test_router.py
+++ b/tests/unit/test_router.py
@@ -0,0 +1,200 @@
+"""Unit tests for router.py — Router, _parse_tier, _format_history, _LIGHT_PATTERNS."""
+
+import pytest
+from unittest.mock import AsyncMock, MagicMock, patch
+
+from router import Router, _parse_tier, _format_history, _LIGHT_PATTERNS
+
+
+# ── _LIGHT_PATTERNS regex ──────────────────────────────────────────────────────
+
+class TestLightPatterns:
+    @pytest.mark.parametrize("text", [
+        "hi", "Hi", "HI",
+        "hello", "hey", "yo", "sup",
+        "good morning", "good evening", "good night", "good afternoon",
+        "bye", "goodbye", "see you", "cya", "later", "ttyl",
+        "thanks", "thank you", "thx", "ty",
+        "ok", "okay", "k", "cool", "great", "awesome", "perfect",
+        "sounds good", "got it", "nice", "sure",
+        "how are you", "how are you?", "how are you doing today?",
+        "what's up",
+        "what day comes after Monday?",
+        "what day follows Friday?",
+        "what comes after summer?",
+        "what does NASA stand for?",
+        "what does AI stand for?",
+        # with trailing punctuation
+        "hi!", "hello.", "thanks!",
+    ])
+    def test_matches(self, text):
+        assert _LIGHT_PATTERNS.match(text.strip()), f"Expected light match for: {text!r}"
+
+    @pytest.mark.parametrize("text", [
+        "what is the capital of France",
+        "tell me about bitcoin",
+        "what is 2+2",
+        "write me a poem",
+        "search for news about the election",
+        "what did we talk about last time",
+        "what is my name",
+        "/think compare these frameworks",
+        "how do I install Python",
+        "explain machine learning",
+        "",  # empty string doesn't match the pattern
+    ])
+    def test_no_match(self, text):
+        assert not _LIGHT_PATTERNS.match(text.strip()), f"Expected NO light match for: {text!r}"
+
+
+# ── _parse_tier ────────────────────────────────────────────────────────────────
+
+class TestParseTier:
+    @pytest.mark.parametrize("raw,expected", [
+        ("light", "light"),
+        ("Light", "light"),
+        ("LIGHT\n", "light"),
+        ("medium", "medium"),
+        ("Medium.", "medium"),
+        ("complex", "complex"),
+        ("Complex!", "complex"),
+        # descriptive words → light
+        ("simplefact", "light"),
+        ("trivial question", "light"),
+        ("basic", "light"),
+        ("easy answer", "light"),
+        ("general knowledge", "light"),
+        # unknown → medium
+        ("unknown_category", "medium"),
+        ("", "medium"),
+        ("I don't know", "medium"),
+        # complex only if 'complex' appears in first 60 chars
+        ("this is a complex query requiring search", "complex"),
+        # _parse_tier checks "complex" before "medium", so complex wins even if medium appears first
+        ("medium complexity, not complex", "complex"),
+    ])
+    def test_parse_tier(self, raw, expected):
+        assert _parse_tier(raw) == expected
+
+
+# ── _format_history ────────────────────────────────────────────────────────────
+
+class TestFormatHistory:
+    def test_empty(self):
+        assert _format_history([]) == "(none)"
+
+    def test_single_user_message(self):
+        history = [{"role": "user", "content": "hello there"}]
+        result = _format_history(history)
+        assert "user: hello there" in result
+
+    def test_multiple_turns(self):
+        history = [
+            {"role": "user", "content": "What is Python?"},
+            {"role": "assistant", "content": "Python is a programming language."},
+        ]
+        result = _format_history(history)
+        assert "user: What is Python?" in result
+        assert "assistant: Python is a programming language." in result
+
+    def test_truncates_long_content(self):
+        long_content = "x" * 300
+        history = [{"role": "user", "content": long_content}]
+        result = _format_history(history)
+        # content is truncated to 200 chars in _format_history
+        assert len(result) < 250
+
+    def test_missing_keys_handled(self):
+        # Should not raise — uses .get() with defaults
+        history = [{"role": "user"}]  # no content key
+        result = _format_history(history)
+        assert "user:" in result
+
+
+# ── Router.route() ─────────────────────────────────────────────────────────────
+
+class TestRouterRoute:
+    def _make_router(self, classify_response: str, reply_response: str = "Sure!") -> Router:
+        """Return a Router with a mock model that returns given classification and reply."""
+        model = MagicMock()
+        classify_msg = MagicMock()
+        classify_msg.content = classify_response
+        reply_msg = MagicMock()
+        reply_msg.content = reply_response
+        # First ainvoke call → classification; second → reply
+        model.ainvoke = AsyncMock(side_effect=[classify_msg, reply_msg])
+        return Router(model=model)
+
+    async def test_force_complex_bypasses_classification(self):
+        router = self._make_router("medium")
+        tier, reply = await router.route("some question", [], force_complex=True)
+        assert tier == "complex"
+        assert reply is None
+        # Model should NOT have been called
+        router.model.ainvoke.assert_not_called()
+
+    async def test_regex_light_skips_llm_classification(self):
+        # Regex match bypasses classification entirely; the only ainvoke call is the reply.
+        model = MagicMock()
+        reply_msg = MagicMock()
+        reply_msg.content = "I'm doing great!"
+        model.ainvoke = AsyncMock(return_value=reply_msg)
+        router = Router(model=model)
+        tier, reply = await router.route("how are you", [], force_complex=False)
+        assert tier == "light"
+        assert reply == "I'm doing great!"
+        # Exactly one model call — no classification step
+        assert router.model.ainvoke.call_count == 1
+
+    async def test_llm_classifies_medium(self):
+        router = self._make_router("medium")
+        tier, reply = await router.route("what is the bitcoin price?", [], force_complex=False)
+        assert tier == "medium"
+        assert reply is None
+
+    async def test_llm_classifies_light_generates_reply(self):
+        router = self._make_router("light", "Paris is the capital of France.")
+        tier, reply = await router.route("what is the capital of France?", [], force_complex=False)
+        assert tier == "light"
+        assert reply == "Paris is the capital of France."
+
+    async def test_llm_classifies_complex_downgraded_to_medium(self):
+        # Without /think prefix, complex classification → downgraded to medium
+        router = self._make_router("complex")
+        tier, reply = await router.route("compare React and Vue", [], force_complex=False)
+        assert tier == "medium"
+        assert reply is None
+
+    async def test_llm_error_falls_back_to_medium(self):
+        model = MagicMock()
+        model.ainvoke = AsyncMock(side_effect=Exception("connection error"))
+        router = Router(model=model)
+        tier, reply = await router.route("some question", [], force_complex=False)
+        assert tier == "medium"
+        assert reply is None
+
+    async def test_light_reply_empty_falls_back_to_medium(self):
+        """If the light reply comes back empty, router returns medium instead."""
+        router = self._make_router("light", "")  # empty reply
+        tier, reply = await router.route("what is 2+2", [], force_complex=False)
+        assert tier == "medium"
+        assert reply is None
+
+    async def test_strips_think_tags_from_classification(self):
+        """Router strips <think>...</think> from model output before parsing tier."""
+        model = MagicMock()
+        classify_msg = MagicMock()
+        classify_msg.content = "<think>Hmm let me think...</think>medium"
+        reply_msg = MagicMock()
+        reply_msg.content = "I'm fine!"
+        model.ainvoke = AsyncMock(side_effect=[classify_msg, reply_msg])
+        router = Router(model=model)
+        tier, _ = await router.route("what is the news?", [], force_complex=False)
+        assert tier == "medium"
+
+    async def test_think_prefix_forces_complex(self):
+        """/think prefix is already stripped by agent.py; force_complex=True is passed."""
+        router = self._make_router("medium")
+        tier, reply = await router.route("analyse this", [], force_complex=True)
+        assert tier == "complex"
+        assert reply is None
--- a/tests/unit/test_vram_manager.py
+++ b/tests/unit/test_vram_manager.py
@@ -0,0 +1,164 @@
+"""Unit tests for vram_manager.py — VRAMManager flush/poll/prewarm logic."""
+
+import asyncio
+import pytest
+from unittest.mock import AsyncMock, MagicMock, patch
+
+from vram_manager import VRAMManager
+
+
+BASE_URL = "http://localhost:11434"
+
+
+def _make_manager() -> VRAMManager:
+    return VRAMManager(base_url=BASE_URL)
+
+
+def _mock_client(get_response=None, post_response=None):
+    """Return a context-manager mock for httpx.AsyncClient."""
+    client = AsyncMock()
+    client.__aenter__ = AsyncMock(return_value=client)
+    client.__aexit__ = AsyncMock(return_value=False)
+    if get_response is not None:
+        client.get = AsyncMock(return_value=get_response)
+    if post_response is not None:
+        client.post = AsyncMock(return_value=post_response)
+    return client
+
+
+# ── _flush ─────────────────────────────────────────────────────────────────────
+
+class TestFlush:
+    async def test_sends_keep_alive_zero(self):
+        client = _mock_client(post_response=MagicMock())
+        with patch("vram_manager.httpx.AsyncClient", return_value=client):
+            mgr = _make_manager()
+            await mgr._flush("qwen3:4b")
+        client.post.assert_awaited_once()
+        _, kwargs = client.post.await_args
+        body = kwargs.get("json") or client.post.call_args[1].get("json") or client.post.call_args[0][1]
+        assert body["model"] == "qwen3:4b"
+        assert body["keep_alive"] == 0
+
+    async def test_posts_to_correct_endpoint(self):
+        client = _mock_client(post_response=MagicMock())
+        with patch("vram_manager.httpx.AsyncClient", return_value=client):
+            mgr = _make_manager()
+            await mgr._flush("qwen3:8b")
+        url = client.post.call_args[0][0]
+        assert url == f"{BASE_URL}/api/generate"
+
+    async def test_ignores_exceptions_silently(self):
+        client = AsyncMock()
+        client.__aenter__ = AsyncMock(return_value=client)
+        client.__aexit__ = AsyncMock(return_value=False)
+        client.post = AsyncMock(side_effect=Exception("connection refused"))
+        with patch("vram_manager.httpx.AsyncClient", return_value=client):
+            mgr = _make_manager()
+            # Should not raise
+            await mgr._flush("qwen3:4b")
+
+
+# ── _prewarm ───────────────────────────────────────────────────────────────────
+
+class TestPrewarm:
+    async def test_sends_keep_alive_300(self):
+        client = _mock_client(post_response=MagicMock())
+        with patch("vram_manager.httpx.AsyncClient", return_value=client):
+            mgr = _make_manager()
+            await mgr._prewarm("qwen3:4b")
+        _, kwargs = client.post.await_args
+        body = kwargs.get("json") or client.post.call_args[1].get("json") or client.post.call_args[0][1]
+        assert body["keep_alive"] == 300
+        assert body["model"] == "qwen3:4b"
+
+    async def test_ignores_exceptions_silently(self):
+        client = AsyncMock()
+        client.__aenter__ = AsyncMock(return_value=client)
+        client.__aexit__ = AsyncMock(return_value=False)
+        client.post = AsyncMock(side_effect=Exception("timeout"))
+        with patch("vram_manager.httpx.AsyncClient", return_value=client):
+            mgr = _make_manager()
+            await mgr._prewarm("qwen3:4b")
+
+
+# ── _poll_evicted ──────────────────────────────────────────────────────────────
+
+class TestPollEvicted:
+    async def test_returns_true_when_models_absent(self):
+        resp = MagicMock()
+        resp.json.return_value = {"models": [{"name": "some_other_model"}]}
+        client = _mock_client(get_response=resp)
+        with patch("vram_manager.httpx.AsyncClient", return_value=client):
+            mgr = _make_manager()
+            result = await mgr._poll_evicted(["qwen3:4b", "qwen2.5:1.5b"], timeout=5)
+        assert result is True
+
+    async def test_returns_false_on_timeout_when_model_still_loaded(self):
+        resp = MagicMock()
+        resp.json.return_value = {"models": [{"name": "qwen3:4b"}]}
+        client = _mock_client(get_response=resp)
+        with patch("vram_manager.httpx.AsyncClient", return_value=client):
+            mgr = _make_manager()
+            result = await mgr._poll_evicted(["qwen3:4b"], timeout=0.1)
+        assert result is False
+
+    async def test_returns_true_immediately_if_already_empty(self):
+        resp = MagicMock()
+        resp.json.return_value = {"models": []}
+        client = _mock_client(get_response=resp)
+        with patch("vram_manager.httpx.AsyncClient", return_value=client):
+            mgr = _make_manager()
+            result = await mgr._poll_evicted(["qwen3:4b"], timeout=5)
+        assert result is True
+
+    async def test_handles_poll_error_and_continues(self):
+        """If /api/ps errors, polling continues until timeout."""
+        client = AsyncMock()
+        client.__aenter__ = AsyncMock(return_value=client)
+        client.__aexit__ = AsyncMock(return_value=False)
+        client.get = AsyncMock(side_effect=Exception("network error"))
+        with patch("vram_manager.httpx.AsyncClient", return_value=client):
+            mgr = _make_manager()
+            result = await mgr._poll_evicted(["qwen3:4b"], timeout=0.2)
+        assert result is False
+
+
+# ── enter_complex_mode / exit_complex_mode ─────────────────────────────────────
+
+class TestComplexMode:
+    async def test_enter_complex_mode_returns_true_on_success(self):
+        mgr = _make_manager()
+        mgr._flush = AsyncMock()
+        mgr._poll_evicted = AsyncMock(return_value=True)
+        result = await mgr.enter_complex_mode()
+        assert result is True
+
+    async def test_enter_complex_mode_flushes_medium_models(self):
+        mgr = _make_manager()
+        mgr._flush = AsyncMock()
+        mgr._poll_evicted = AsyncMock(return_value=True)
+        await mgr.enter_complex_mode()
+        flushed = {call.args[0] for call in mgr._flush.call_args_list}
+        assert "qwen3:4b" in flushed
+        assert "qwen2.5:1.5b" in flushed
+
+    async def test_enter_complex_mode_returns_false_on_eviction_timeout(self):
+        mgr = _make_manager()
+        mgr._flush = AsyncMock()
+        mgr._poll_evicted = AsyncMock(return_value=False)
+        result = await mgr.enter_complex_mode()
+        assert result is False
+
+    async def test_exit_complex_mode_flushes_complex_and_prewarms_medium(self):
+        mgr = _make_manager()
+        mgr._flush = AsyncMock()
+        mgr._prewarm = AsyncMock()
+        await mgr.exit_complex_mode()
+        # Must flush 8b
+        flushed = {call.args[0] for call in mgr._flush.call_args_list}
+        assert "qwen3:8b" in flushed
+        # Must prewarm medium models
+        prewarmed = {call.args[0] for call in mgr._prewarm.call_args_list}
+        assert "qwen3:4b" in prewarmed
+        assert "qwen2.5:1.5b" in prewarmed
--- a/tests/use_cases/apple_pie_research.md
+++ b/tests/use_cases/apple_pie_research.md
@@ -0,0 +1,41 @@
+# Use Case: Apple Pie Research
+
+Verify that a deep research query triggers the complex tier, uses web search and
+page fetching, and produces a substantive, well-sourced recipe response.
+
+## Steps
+
+**1. Send the research query** (the `/think` prefix forces complex tier):
+
+```bash
+curl -s -X POST http://localhost:8000/message \
+  -H "Content-Type: application/json" \
+  -d '{"text": "/think what is the best recipe for an apple pie?", "session_id": "use-case-apple-pie", "channel": "cli", "user_id": "claude"}'
+```
+
+**2. Wait for the streaming reply** (complex tier can take up to 5 minutes):
+
+```bash
+curl -s -N --max-time 300 "http://localhost:8000/stream/use-case-apple-pie"
+```
+
+**3. Confirm tier and tool usage in agent logs:**
+
+```bash
+docker compose -f /home/alvis/adolf/docker-compose.yml logs deepagents \
+  --since=600s | grep -E "tier=complex|web_search|fetch_url|crawl4ai"
+```
+
+## Evaluate (use your judgment)
+
+Check each of the following:
+
+- **Tier**: logs show `tier=complex` for this session
+- **Tool use**: logs show `web_search` or `fetch_url` calls during the request
+- **Ingredients**: response lists specific apple pie ingredients (apples, flour, butter, sugar, etc.)
+- **Method**: response includes preparation or baking steps
+- **Sources**: response cites real URLs it fetched, not invented links
+- **Quality**: response is structured and practical — not a refusal, stub, or generic placeholder
+
+Report PASS only if all six criteria are met. For any failure, state which criterion
+failed and quote the relevant part of the response or logs.
--- a/tests/use_cases/cli_startup.md
+++ b/tests/use_cases/cli_startup.md
@@ -0,0 +1,18 @@
+# Use Case: CLI Startup
+
+Verify the Adolf CLI container starts cleanly, shows the welcome banner,
+and exits without error when the user closes input.
+
+## Steps
+
+```bash
+echo "" | docker compose --profile tools run --rm -T cli \
+  python3 cli.py --url http://deepagents:8000 --session use-case-cli-startup
+echo "exit code: $?"
+```
+
+## Pass if
+
+- Output contains `Adolf CLI`
+- Output contains the session name and gateway URL
+- Exit code is 0
--- a/tests/use_cases/weather_now.md
+++ b/tests/use_cases/weather_now.md
@@ -0,0 +1,40 @@
+# Use Case: Current Weather Query
+
+Verify how Adolf handles a real-time information request ("what's the weather now?").
+This question requires live data that an LLM cannot answer from training alone.
+
+## Steps
+
+**1. Send the weather query:**
+
+```bash
+curl -s -X POST http://localhost:8000/message \
+  -H "Content-Type: application/json" \
+  -d '{"text": "whats the weather right now?", "session_id": "use-case-weather", "channel": "cli", "user_id": "claude"}'
+```
+
+**2. Stream the reply** (medium tier should respond within 30s):
+
+```bash
+curl -s -N --max-time 60 "http://localhost:8000/stream/use-case-weather"
+```
+
+**3. Check routing tier and any tool usage in logs:**
+
+```bash
+docker compose -f /home/alvis/adolf/docker-compose.yml logs deepagents \
+  --since=120s | grep -E "tier=|web_search|fetch_url|crawl4ai"
+```
+
+## Evaluate (use your judgment)
+
+Check each of the following:
+
+- **Routing**: which tier was selected? Was it appropriate for a real-time query?
+- **Tool use**: did the agent use web_search or any external data source?
+- **Accuracy**: does the response contain actual current weather data (temperature, conditions) or is it a guess/refusal?
+- **Honesty**: if the agent cannot fetch weather, does it say so — or does it hallucinate fake data?
+- **Helpfulness**: does the response suggest how the user could get weather info (e.g. check a website, use /think)?
+
+Report PASS only if the response is both honest and helpful. A hallucinated weather
+report is a FAIL. A honest "I can't check weather" with guidance is a PASS.
Author	SHA1	Message	Date
alvis	887d4b8d90	voice benchmark: rename --dry-run → --no-inference, fix log extraction - --no-inference applies to all tiers (not just complex) - metadata key: dry_run → no_inference - extract_tier_from_logs: forward iteration (not reversed), updated regex - GPU check skipped when --no-inference - Fix TypeError in misclassified print when actual=None Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-24 07:58:05 +00:00
alvis	4e6d3090c2	Remove benchmark.json from gitignore — dataset is now tracked Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-24 07:53:35 +00:00
alvis	5b09a99a7f	Routing: 100% accuracy on realistic home assistant dataset - router.py: skip light reply generation when no_inference=True; add control words (да/нет/стоп/отмена/повтори/подожди/etc.) to _LIGHT_PATTERNS - agent.py: pass no_inference to router.route(); skip preflight IO in no_inference mode - benchmarks/benchmark.json: replace definition-heavy queries with realistic Alexa/Google-Home style queries (greetings, smart home, timers, shopping, weather, personal memory, cooking) — 30 light / 60 medium / 30 complex Routing benchmark: 120/120 (100%), all under 0.1s per query Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-24 07:53:01 +00:00
alvis	3fb90ae083	Skip _reply_semaphore in no_inference mode No GPU inference happens in this mode, so serialization is not needed. Without this, timed-out routing benchmark queries hold the semaphore and cascade-block all subsequent queries. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-24 07:40:07 +00:00
alvis	4d37ac65b2	Skip preflight IO (memory/URL/fast-tools) when no_inference=True In no_inference mode only the routing decision matters — fetching memories and URLs adds latency without affecting the classification. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-24 07:37:55 +00:00
alvis	b7d5896076	routing benchmark: 1s strict deadline per query QUERY_TIMEOUT=1s — classification and routing must complete within 1 second or the query is recorded as 'timeout'. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-24 07:35:13 +00:00
alvis	fc53632c7b	Merge pull request 'feat: rename dry_run to no_inference for all tiers' (#17 ) from worktree-agent-afc013ce into main Reviewed-on: #17	2026-03-24 07:27:04 +00:00
alvis	47a1166be6	Merge pull request 'feat: rename --dry-run to --no-inference in run_benchmark.py' (#18 ) from feat/no-inference-benchmark into main Reviewed-on: #18	2026-03-24 07:26:44 +00:00
alvis	74e5b1758d	Merge pull request 'feat: add run_routing_benchmark.py — routing-only benchmark' (#19 ) from feat/routing-benchmark into main Reviewed-on: #19	2026-03-24 07:26:31 +00:00
alvis	0fbdbf3a5e	Add run_routing_benchmark.py — dedicated routing-only benchmark Tests routing accuracy for all tiers with no_inference=True hardcoded. Fast (QUERY_TIMEOUT=30s), no GPU check, shares benchmark.json dataset. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-24 07:25:16 +00:00
alvis	77db739819	Rename --dry-run to --no-inference, apply to all tiers in run_benchmark.py No-inference mode now skips LLM for all tiers (not just complex), GPU check is auto-skipped, and the metadata key matches agent.py. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-24 03:49:09 +00:00
alvis	9c2f27eed4	Rename dry_run → no_inference, extend to all tiers in agent.py When no_inference=True, routing decision is captured but all LLM inference is skipped — yields constant "I don't know" immediately. Also disables fast-tool short-circuit so routing path always runs. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-24 03:43:42 +00:00
alvis	a363347ae5	Merge pull request 'Fix routing: add Russian tech def patterns to light, strengthen medium smart home' (#13 ) from fix/routing-accuracy into main Reviewed-on: #13	2026-03-24 02:51:17 +00:00
alvis	1d2787766e	Merge pull request 'Remove Bifrost: replace test 4 with LiteLLM health check' (#14 ) from fix/remove-bifrost into main Reviewed-on: #14	2026-03-24 02:48:40 +00:00
alvis	abf792a2ec	Remove Bifrost: replace test 4 with LiteLLM health check - Remove BIFROST constant and fetch_bifrost_logs() from common.py - Add LITELLM constant (localhost:4000) - Replace test_memory.py test 4 (Bifrost pass-through) with LiteLLM health check Fixes #5 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-24 02:46:01 +00:00
alvis	537e927146	Fix routing: add Russian tech def patterns to light, strengthen medium smart home - _LIGHT_PATTERNS: add что\s+такое, что\s+означает, сколько бит/байт, compound greetings (привет, как дела) — these fell through to embedding which sometimes misclassified short Russian phrases as medium - _MEDIUM_PATTERNS: add non-verb-first smart home patterns (свет/лампочка as subject, режим/сцена commands) for benchmark queries with different phrasing Fixes #8, #9 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-24 02:45:42 +00:00
alvis	186e16284b	Merge pull request 'Fix tier logging: capture actual_tier, fix parse_run_block regex, remove reply_text truncation' (#11 ) from fix/tier-logging into main Reviewed-on: #11	2026-03-24 02:44:35 +00:00
alvis	0b428e4ada	Merge pull request 'Fix benchmark log extraction: first tier match, increase log tail to 300' (#12 ) from fix/benchmark-log-extraction into main Reviewed-on: #12	2026-03-24 02:43:26 +00:00
alvis	98095679be	Fix benchmark log extraction: first tier match, increase log tail to 300 - Remove reversed() from extract_tier_from_logs: first match = routing decision (dry-run complex logs tier=complex early, then overwrites with tier=medium at done) - Increase log tail from 80→300 to handle concurrent log activity Fixes #7, #10 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-24 02:42:27 +00:00
alvis	8ef4897869	Fix tier logging: capture actual_tier, fix parse_run_block regex, remove reply_text truncation - Add tier_capture param to _run_agent_pipeline; append tier after determination - Capture actual_tier in run_agent_task from tier_capture list - Log tier in replied-in line: [agent] replied in Xs tier=Y - Remove reply_text[:200] truncation (was breaking benchmark keyword matching) - Update parse_run_block regex to match new log format; llm/send fields now None Fixes #1, #3, #4 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-24 02:41:59 +00:00
Alvis	1f5e272600	Switch from Bifrost to LiteLLM; add Matrix channel; update rules Infrastructure: - docker-compose.yml: replace bifrost container with LiteLLM proxy (host.docker.internal:4000); complex model → deepseek-r1:free via OpenRouter; add Matrix URL env var; mount logs volume - bifrost-config.json: add auth_config + postgres config_store (archived) Routing: - router.py: full semantic 3-tier classifier rewrite — nomic-embed-text centroids for light/medium/complex; regex pre-classifiers for all tiers; Russian utterance sets expanded - agent.py: wire LiteLLM URL; add dry_run support; add Matrix channel Channels: - channels.py: add Matrix adapter (_matrix_send via mx- session prefix) Rules / docs: - agent-pipeline.md: remove /think prefix requirement; document automatic complex tier classification - llm-inference.md: update BIFROST_URL → LITELLM_URL references; add remote model note for complex tier - ARCHITECTURE.md: deleted (superseded by README.md) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-24 02:14:13 +00:00
Alvis	54cb940279	Update docs: add benchmarks/ section, fix complex tier description - CLAUDE.md: add benchmark commands (run_benchmark.py flags, dry-run, categories, voice benchmark) - README.md: add benchmarks/ to Files tree; fix incorrect claim that complex tier requires /think prefix — it is auto-classified via regex and embedding similarity; fix "Complex agent (/think prefix)" heading Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-24 02:13:14 +00:00
Alvis	bd951f943f	Move benchmark scripts into benchmarks/ subdir - benchmarks/run_benchmark.py (was run_benchmark.py) - benchmarks/run_voice_benchmark.py (was run_voice_benchmark.py) - Scripts use Path(__file__).parent so paths resolve correctly in subdir - .gitignore updated: ignore benchmarks/benchmark.json, results_latest.json, voice_results*.json, voice_audio/ Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-24 02:02:46 +00:00
Alvis	ab68bba935	Add routing benchmark scripts; gitignore dataset and results - run_benchmark.py: sends queries to /message, extracts tier= from docker logs, reports per-tier accuracy, saves results_latest.json - run_voice_benchmark.py: voice path benchmark - .gitignore: ignore benchmark.json (dataset) and results_latest.json (runtime output); benchmark scripts are tracked, data files are not Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-24 02:00:17 +00:00
Alvis	3ae1cefbd4	WeatherTool: fetch open-meteo directly, skip LLM for fast tool replies - Replace SearXNG search with direct open-meteo.com API call (no key needed) - WeatherTool now returns a ready-to-deliver reply string - agent.py: short-circuit router+LLM when fast tools return a result (tier=fast) - router.py: fast tool match no longer triggers light reply generation Weather latency: 105-190s → ~1s Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-15 09:42:55 +00:00
Alvis	957360f6ce	Restructure CLAUDE.md per official Claude Code recommendations CLAUDE.md: 178→25 lines — commands + @ARCHITECTURE.md import only Rules split into .claude/rules/ (load at startup, topic-scoped): llm-inference.md — Bifrost-only, semaphore, model name format, timeouts agent-pipeline.md — tier rules, no tools in medium, memory outside loop fast-tools.md — extension guide (path-scoped: fast_tools.py + agent.py) secrets.md — .env keys, Vaultwarden, no hardcoding Path-scoped rule: fast-tools.md only loads when editing fast_tools.py or agent.py Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-13 07:19:09 +00:00
Alvis	3ed47b45da	Split CLAUDE.md per official Claude Code recommendations CLAUDE.md: lean — commands, key conventions, fast tool guide, @ARCHITECTURE.md import routecheck/CLAUDE.md: purpose, access paths, env vars, gotchas openmemory/CLAUDE.md: tools, two Ollama instances, prompts, notes Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-13 07:15:51 +00:00
Alvis	eba805f787	Update docs: fast tools, routecheck service, commute tool - Request flow: add fast_tool_runner.run_matching() to pre-flight gather - New Fast Tools section: WeatherTool + CommuteTool table, extension guide - New routecheck section: captcha UI, internal API, proxy requirements - Services table: add routecheck:8090 - Files tree: add fast_tools.py, routecheck/, updated .env note Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-13 07:10:30 +00:00
Alvis	32089ed596	Add routecheck service and CommuteTool fast tool routecheck/ — FastAPI service (port 8090): - Image captcha (PIL: arithmetic problem, noise, wave distortion) - POST /api/captcha/new + /api/captcha/solve → short-lived token - GET /api/route?from=lat,lon&to=lat,lon&token=... → Yandex Routing API - Internal bypass via INTERNAL_TOKEN env var (for CommuteTool) - HTTPS proxy forwarded to reach Yandex API from container CommuteTool (fast_tools.py): - Matches commute/traffic/arrival time queries - Calls routecheck /api/route with ROUTECHECK_TOKEN - Hardcoded route: Balashikha home → Moscow center - Returns traffic-adjusted travel time + delay annotation Needs: YANDEX_ROUTING_KEY + ROUTECHECK_TOKEN in .env Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-13 07:08:48 +00:00
Alvis	d2ca1926f8	WeatherTool: use Russian query for Celsius sources 'погода Балашиха сейчас' returns Russian weather sites (gismeteo, meteotrend) that report in °C, vs English queries which return Fahrenheit snippets that the model misreads as Celsius. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-13 06:25:53 +00:00
Alvis	af181ba7ec	Rename RealTimeSearchTool → WeatherTool, fetch Balashikha weather via SearXNG WeatherTool queries SearXNG with a fixed 'weather Balashikha Moscow now' query instead of passing the user message as-is. SearXNG has external internet access and returns snippets with actual current conditions. Direct wttr.in fetch not possible — deepagents container has no external internet routing. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-13 05:40:10 +00:00
Alvis	f5fc2e9bfb	Introduce FastTools: pre-flight classifier + context enrichment New fast_tools.py module: - FastTool base class (matches + run interface) - RealTimeSearchTool: SearXNG search for weather/news/prices/scores - FastToolRunner: classifier that checks all tools, runs matching ones concurrently and returns combined context Router accepts FastToolRunner; any_matches() forces medium tier before LLM classification (replaces _MEDIUM_FORCE_PATTERNS regex). agent.py: _REALTIME_RE and _searxng_search_async removed; pre-flight gather now includes fast_tool_runner.run_matching() alongside URL fetch and memory retrieval. To add a new fast tool: subclass FastTool, add to the list in agent.py. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-13 05:18:44 +00:00
Alvis	436299f7e2	Add real-time query handling: pre-search enrichment + routing fix - router.py: add _MEDIUM_FORCE_PATTERNS to block weather/news/price queries from light tier regardless of LLM classification - agent.py: add _REALTIME_RE and _searxng_search_async(); real-time queries now run SearXNG search concurrently with URL fetch + memory retrieval, injecting snippets into medium system prompt - tests/use_cases/weather_now.md: use case test for weather queries Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-13 05:08:08 +00:00
Alvis	8cd41940f0	Update docs: streaming, CLI container, use_cases tests - /stream/{session_id} SSE endpoint replaces /reply/ for CLI - Medium tier streams per-token via astream() with in_think filtering - CLI now runs as Docker container (Dockerfile.cli, profile:tools) - Correct medium model to qwen3:4b with real-time think block filtering - Add use_cases/ test category to commands section - Update files tree and services table Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-12 17:31:36 +00:00
Alvis	b04e8a0925	Add Rich token streaming: server SSE + CLI live display + CLI container Server (agent.py): - _stream_queues: per-session asyncio.Queue for token chunks - _push_stream_chunk() / _end_stream() helpers - Medium tier: astream() with <think> block filtering — real token streaming - Light tier: full reply pushed as single chunk then [DONE] - Complex tier: full reply pushed after agent completes then [DONE] - GET /stream/{session_id} SSE endpoint (data: <chunk>\n\n, data: [DONE]\n\n) - medium_model promoted to module-level global for astream() access CLI (cli.py): - stream_reply(): reads /stream/ SSE, renders tokens live with Rich Live (transient) - Final reply rendered as Markdown after stream completes - os.getlogin() replaced with os.getenv("USER") for container compatibility Dockerfile.cli + docker-compose cli service (profiles: tools): - Run: docker compose --profile tools run --rm -it cli Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-12 17:26:52 +00:00
Alvis	edc9a96f7a	Add use_cases test category as Claude Code skill instructions Use cases are markdown files that Claude Code reads, executes step by step using its tools, and evaluates with its own judgment — not assertion scripts. - cli_startup.md: pipe EOF into cli.py, verify banner and exit code 0 - apple_pie_research.md: /think query → complex tier → web_search + fetch → evaluate recipe quality, sources, and structure Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-12 17:01:13 +00:00
Alvis	a35ba83db7	Add use_cases test category with CLI startup test tests/use_cases/ holds scenario-driven tests run by the Claude Code agent, which acts as both the test runner and mock user. Each test prints a structured transcript; Claude evaluates correctness. First test: test_cli_startup.py — spawns cli.py with a subprocess, reads the welcome banner, sends EOF, and verifies exit code 0. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-12 16:10:04 +00:00
Alvis	021104f510	Split monolithic test_pipeline.py into focused integration test scripts - common.py: shared config, URL constants, benchmark questions, all helpers (get, post_json, check_sse, qdrant_count, fetch_logs, parse_run_block, wait_for, etc.) - test_health.py: service health checks (deepagents, bifrost, GPU/CPU Ollama, Qdrant, SearXNG) - test_memory.py: name store/recall pipeline, memory benchmark (5 facts + 10 recalls), dedup test - test_routing.py: easy/medium/hard tier routing benchmarks with --easy/medium/hard-only flags - Removed test_pipeline.py Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-12 16:02:57 +00:00
Alvis	50097d6092	Embed Crawl4AI at all tiers, restore qwen3:4b medium, update docs - Pre-routing URL fetch: any message with URLs gets content fetched async (httpx.AsyncClient) before routing via _fetch_urls_from_message() - URL context and memories gathered concurrently with asyncio.gather - Light tier upgraded to medium when URL content is present - url_context injected into system prompt for medium and complex agents - Complex agent retains web_search/fetch_url tools + receives pre-fetched content - Medium model restored to qwen3:4b (was temporarily qwen2.5:1.5b) - Unit tests added for _extract_urls - ARCHITECTURE.md: added Tool Handling, Crawl4AI Integration, Memory Pipeline sections - CLAUDE.md: updated request flow and Crawl4AI integration docs Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-12 15:49:34 +00:00
Alvis	f9618a9bbf	Integrate Bifrost LLM gateway, add test suite, implement memory pipeline - Add Bifrost (maximhq/bifrost) as LLM gateway: all inference routes through bifrost:8080/v1 with retry logic and observability; VRAMManager keeps direct Ollama access for VRAM flush/prewarm operations - Switch medium model from qwen3:4b to qwen2.5:1.5b (direct call, no tools) via _DirectModel wrapper; complex keeps create_deep_agent with qwen3:8b - Implement out-of-agent memory pipeline: _retrieve_memories pre-fetches relevant context (injected into all tiers), _store_memory runs as background task after each reply writing to openmemory/Qdrant - Add tests/unit/ with 133 tests covering router, channels, vram_manager, agent helpers; move integration test to tests/integration/ - Add bifrost-config.json with GPU Ollama (qwen2.5:0.5b/1.5b, qwen3:4b/8b, gemma3:4b) and CPU Ollama providers - Integration test 28/29 pass (only grammy fails — no TELEGRAM_BOT_TOKEN) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-12 13:50:12 +00:00