voice benchmark: rename --dry-run → --no-inference, fix log extraction

- --no-inference applies to all tiers (not just complex) - metadata key: dry_run → no_inference - extract_tier_from_logs: forward iteration (not reversed), updated regex - GPU check skipped when --no-inference - Fix TypeError in misclassified print when actual=None Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Remove benchmark.json from gitignore — dataset is now tracked
2026-03-24 07:58:05 +00:00 · 2026-03-24 07:53:35 +00:00 · 2026-03-24 07:53:01 +00:00 · 2026-03-24 07:40:07 +00:00 · 2026-03-24 07:37:55 +00:00 · 2026-03-24 07:35:13 +00:00
45 changed files with 5250 additions and 1468 deletions
--- a/.claude/rules/agent-pipeline.md
+++ b/.claude/rules/agent-pipeline.md
@@ -0,0 +1,22 @@
 # Agent Pipeline Rules
 ## Tiers
 - Routing is fully automatic: router classifies into light/medium/complex via 3-way embedding similarity.
 - Complex tier is reached automatically for deep research queries — no prefix required.
 - Medium is the default tier. Light is only for trivial static-knowledge queries matched by regex or embedding.
 - Light tier upgrade to medium is automatic when URL content is pre-fetched or a fast tool matches.
 - `tier_override` API parameter still allows callers to force a specific tier (e.g. `adolf-deep` model → complex).
 ## Medium agent
 - `_DirectModel` makes a single `ainvoke()` call with no tool schema. Do not add tools to the medium agent.
 - `qwen3:4b` behaves unreliably when a tool array is present in the request — inject context via system prompt instead.
 ## Memory
 - `add_memory` and `search_memory` are called directly in `run_agent_task()`, outside the agent loop.
 - Never add memory tools to any agent's tool list.
 - Memory storage (`_store_memory`) runs as an asyncio background task after the semaphore is released.
 ## Fast tools
 - `FastToolRunner.run_matching()` runs in the pre-flight `asyncio.gather` alongside URL fetch and memory retrieval.
 - Fast tool results are injected as a system prompt block, not returned to the user directly.
 - When `any_matches()` is true, the router forces medium tier before LLM classification.
--- a/.claude/rules/fast-tools.md
+++ b/.claude/rules/fast-tools.md
@@ -0,0 +1,24 @@
 ---
 paths:
  - "fast_tools.py"
  - "agent.py"
 ---
 # Fast Tools — Extension Guide
 To add a new fast tool:
 1. In `fast_tools.py`, subclass `FastTool` and implement:
   - `name` (str property) — unique identifier, used in logs
   - `matches(message: str) -> bool` — regex or logic; keep it cheap, runs on every message
   - `run(message: str) -> str` — async fetch; return a short context block or `""` on failure; never raise
 2. In `agent.py`, add an instance to the `_fast_tool_runner` list (module level, after env vars are defined).
 3. The router will automatically force medium tier when `matches()` returns true — no router changes needed.
 ## Constraints
 - `run()` must return in under 15s — it runs in the pre-flight gather that blocks routing.
 - Return `""` or a `[tool error: ...]` string on failure — never raise exceptions.
 - Keep returned context under ~1000 chars — larger contexts slow down `qwen3:4b` streaming significantly.
 - The deepagents container has no direct external internet. Use SearXNG (`host.docker.internal:11437`) or internal services.
--- a/.claude/rules/llm-inference.md
+++ b/.claude/rules/llm-inference.md
@@ -0,0 +1,8 @@
 # LLM Inference Rules
 - All LLM calls must use `base_url=LITELLM_URL` (points to LiteLLM at `host.docker.internal:4000/v1`). Never call Ollama directly for inference.
 - `_reply_semaphore` (asyncio.Semaphore(1)) serializes all GPU inference. Never bypass it or add a second semaphore.
 - Local Ollama models use the `ollama/` prefix: `ollama/qwen3:4b`, `ollama/qwen2.5:1.5b`. Remote models (e.g. OpenRouter) use their full LiteLLM name: `openrouter/deepseek-r1`.
 - Timeout values: router=30s, medium=180s, complex=600s. Do not reduce them.
 - `VRAMManager` is the only component that contacts Ollama directly (for flush/prewarm/poll). This is intentional — LiteLLM cannot manage VRAM.
 - Complex tier uses a remote model (`DEEPAGENTS_COMPLEX_MODEL`) — no VRAM management is needed for it.
--- a/.claude/rules/secrets.md
+++ b/.claude/rules/secrets.md
@@ -0,0 +1,7 @@
 # Secrets and Environment
 - `.env` is required at project root and must never be committed. It is in `.gitignore`.
 - Required keys: `TELEGRAM_BOT_TOKEN`, `ROUTECHECK_TOKEN`, `YANDEX_ROUTING_KEY`.
 - `ROUTECHECK_TOKEN` is a shared secret between `deepagents` and `routecheck` containers — generate once with `python3 -c "import uuid; print(uuid.uuid4())"`.
 - All tokens are stored in Vaultwarden (AI collection). Fetch with `bw get password "<NAME>"` — see `~/.claude/CLAUDE.md` for the full procedure.
 - Do not hardcode tokens, URLs, or credentials anywhere in source code.
--- a/.gitignore
+++ b/.gitignore
@@ -0,0 +1,7 @@
 __pycache__/
 *.pyc
 logs/*.jsonl
 adolf_tuning_data/voice_audio/
 benchmarks/results_latest.json
 benchmarks/voice_results*.json
 benchmarks/voice_audio/
--- a/ARCHITECTURE.md
+++ b/ARCHITECTURE.md
@@ -1,118 +0,0 @@
 # Adolf
 Autonomous personal assistant with a multi-channel gateway. Three-tier model routing with GPU VRAM management.
 ## Architecture
 ```
 ┌─────────────────────────────────────────────────────┐
 │                 CHANNEL ADAPTERS                    │
 │                                                     │
 │  [Telegram/Grammy]   [CLI]   [Voice — future]       │
 │       ↕                ↕            ↕               │
 │       └────────────────┴────────────┘               │
 │                        ↕                            │
 │          ┌─────────────────────────┐                │
 │          │   GATEWAY  (agent.py)   │                │
 │          │   FastAPI  :8000        │                │
 │          │                         │                │
 │          │  POST /message          │  ← all inbound │
 │          │  POST /chat  (legacy)   │                │
 │          │  GET  /reply/{id}  SSE  │  ← CLI polling │
 │          │  GET  /health           │                │
 │          │                         │                │
 │          │  channels.py registry   │                │
 │          │  conversation buffers   │                │
 │          └──────────┬──────────────┘                │
 │                     ↓                               │
 │          ┌──────────────────────┐                   │
 │          │    AGENT CORE        │                   │
 │          │  three-tier routing  │                   │
 │          │  VRAM management     │                   │
 │          └──────────────────────┘                   │
 │                     ↓                               │
 │          channels.deliver(session_id, channel, text)│
 │               ↓                    ↓                │
 │    telegram → POST grammy/send   cli → SSE queue    │
 └─────────────────────────────────────────────────────┘
 ```
 ## Channel Adapters
 | Channel | session_id | Inbound | Outbound |
 |---------|-----------|---------|---------|
 | Telegram | `tg-<chat_id>` | Grammy long-poll → POST /message | channels.py → POST grammy:3001/send |
 | CLI | `cli-<user>` | POST /message directly | GET /reply/{id} SSE stream |
 | Voice | `voice-<device>` | (future) | (future) |
 ## Unified Message Flow
 ```
 1. Channel adapter receives message
 2. POST /message {text, session_id, channel, user_id}
 3. 202 Accepted immediately
 4. Background: run_agent_task(message, session_id, channel)
 5. Route → run agent tier → get reply text
 6. channels.deliver(session_id, channel, reply_text)
   - always puts reply in pending_replies[session_id] queue (for SSE)
   - calls channel-specific send callback
 7. GET /reply/{session_id} SSE clients receive the reply
 ```
 ## Three-Tier Model Routing
 | Tier | Model | VRAM | Trigger | Latency |
 |------|-------|------|---------|---------|
 | Light | qwen2.5:1.5b (router answers) | ~1.2 GB | Router classifies as light | ~2–4s |
 | Medium | qwen3:4b | ~2.5 GB | Default | ~20–40s |
 | Complex | qwen3:8b | ~6.0 GB | `/think` prefix | ~60–120s |
 **`/think` prefix**: forces complex tier, stripped before sending to agent.
 ## VRAM Management
 GTX 1070 — 8 GB. Ollama must be restarted if CUDA init fails (model loads on CPU).
 1. Flush explicitly before loading qwen3:8b (`keep_alive=0`)
 2. Verify eviction via `/api/ps` poll (15s timeout) before proceeding
 3. Fallback: timeout → run medium agent instead
 4. Post-complex: flush 8b, pre-warm 4b + router
 ## Session ID Convention
 - Telegram: `tg-<chat_id>` (e.g. `tg-346967270`)
 - CLI: `cli-<username>` (e.g. `cli-alvis`)
 Conversation history is keyed by session_id (5-turn buffer).
 ## Files
 ```
 adolf/
 ├── docker-compose.yml      Services: deepagents, openmemory, grammy
 ├── Dockerfile              deepagents container (Python 3.12)
 ├── agent.py                FastAPI gateway + three-tier routing
 ├── channels.py             Channel registry + deliver() + pending_replies
 ├── router.py               Router class — qwen2.5:1.5b routing
 ├── vram_manager.py         VRAMManager — flush/prewarm/poll Ollama VRAM
 ├── agent_factory.py        build_medium_agent / build_complex_agent
 ├── cli.py                  Interactive CLI REPL client
 ├── wiki_research.py        Batch wiki research pipeline (uses /message + SSE)
 ├── .env                    TELEGRAM_BOT_TOKEN (not committed)
 ├── openmemory/
 │   ├── server.py           FastMCP + mem0 MCP tools
 │   └── Dockerfile
 └── grammy/
    ├── bot.mjs             grammY Telegram bot + POST /send HTTP endpoint
    ├── package.json
    └── Dockerfile
 ```
 ## External Services (from openai/ stack)
 | Service | Host Port | Role |
 |---------|-----------|------|
 | Ollama GPU | 11436 | All reply inference |
 | Ollama CPU | 11435 | Memory embedding (nomic-embed-text) |
 | Qdrant | 6333 | Vector store for memories |
 | SearXNG | 11437 | Web search |
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -0,0 +1,41 @@
 # CLAUDE.md
 This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
 ## Commands
 ```bash
 # Start all services
 docker compose up --build
 # Interactive CLI (requires services running)
 docker compose --profile tools run --rm -it cli
 # Integration tests — run from tests/integration/, require all services up
 python3 test_health.py
 python3 test_memory.py [--name-only|--bench-only|--dedup-only]
 python3 test_routing.py [--easy-only|--medium-only|--hard-only]
 # Use case tests — read the .md file and follow its steps as Claude Code
 # example: read tests/use_cases/weather_now.md and execute it
 # Routing benchmark — measures tier classification accuracy across 120 queries
 # Run from benchmarks/ — Adolf must be running. DO NOT run during active use (holds GPU).
 cd benchmarks
 python3 run_benchmark.py                       # full run (120 queries)
 python3 run_benchmark.py --tier light          # light tier only (30 queries)
 python3 run_benchmark.py --tier medium         # medium tier only (50 queries)
 python3 run_benchmark.py --tier complex --dry-run  # complex tier, medium model (no API cost)
 python3 run_benchmark.py --category smart_home_control
 python3 run_benchmark.py --ids 1,2,3
 python3 run_benchmark.py --list-categories
 # Voice benchmark
 python3 run_voice_benchmark.py
 # benchmark.json (dataset) and results_latest.json are gitignored — not committed
 ```
 ## Architecture
@README.md
--- a/4
+++ b/4
@@ -2,9 +2,9 @@ FROM python:3.12-slim
 WORKDIR /app
-RUN pip install --no-cache-dir deepagents langchain-ollama langgraph \
+RUN pip install --no-cache-dir deepagents langchain-openai langgraph \
    fastapi uvicorn langchain-mcp-adapters langchain-community httpx
-COPY agent.py channels.py vram_manager.py router.py agent_factory.py hello_world.py .
+COPY agent.py channels.py vram_manager.py router.py agent_factory.py fast_tools.py hello_world.py ./
 CMD ["uvicorn", "agent:app", "--host", "0.0.0.0", "--port", "8000"]
--- a/Dockerfile.cli
+++ b/Dockerfile.cli
@@ -0,0 +1,9 @@
 FROM python:3.12-slim
 WORKDIR /app
 RUN pip install --no-cache-dir rich
 COPY cli.py .
 CMD ["python3", "cli.py"]
--- a/README.md
+++ b/README.md
@@ -0,0 +1,208 @@
 # Adolf
 Autonomous personal assistant with a multi-channel gateway. Three-tier model routing with GPU VRAM management.
 ## Architecture
 ```
 ┌─────────────────────────────────────────────────────┐
 │                 CHANNEL ADAPTERS                    │
 │                                                     │
 │  [Telegram/Grammy]   [CLI]   [Voice — future]       │
 │       ↕                ↕            ↕               │
 │       └────────────────┴────────────┘               │
 │                        ↕                            │
 │          ┌─────────────────────────┐                │
 │          │   GATEWAY  (agent.py)   │                │
 │          │   FastAPI  :8000        │                │
 │          │                         │                │
 │          │  POST /message          │  ← all inbound │
 │          │  POST /chat  (legacy)   │                │
 │          │  GET  /stream/{id} SSE  │  ← token stream│
 │          │  GET  /reply/{id}  SSE  │  ← legacy poll │
 │          │  GET  /health           │                │
 │          │                         │                │
 │          │  channels.py registry   │                │
 │          │  conversation buffers   │                │
 │          └──────────┬──────────────┘                │
 │                     ↓                               │
 │          ┌──────────────────────┐                   │
 │          │    AGENT CORE        │                   │
 │          │  three-tier routing  │                   │
 │          │  VRAM management     │                   │
 │          └──────────────────────┘                   │
 │                     ↓                               │
 │          channels.deliver(session_id, channel, text)│
 │               ↓                    ↓                │
 │    telegram → POST grammy/send   cli → SSE queue    │
 └─────────────────────────────────────────────────────┘
 ```
 ## Channel Adapters
 | Channel | session_id | Inbound | Outbound |
 |---------|-----------|---------|---------|
 | Telegram | `tg-<chat_id>` | Grammy long-poll → POST /message | channels.py → POST grammy:3001/send |
 | CLI | `cli-<user>` | POST /message directly | GET /stream/{id} SSE — Rich Live streaming |
 | Voice | `voice-<device>` | (future) | (future) |
 ## Unified Message Flow
 ```
 1. Channel adapter receives message
 2. POST /message {text, session_id, channel, user_id}
 3. 202 Accepted immediately
 4. Background: run_agent_task(message, session_id, channel)
 5. Parallel IO (asyncio.gather):
   a. _fetch_urls_from_message()       — Crawl4AI fetches any URLs in message
   b. _retrieve_memories()             — openmemory semantic search for context
   c. _fast_tool_runner.run_matching() — FastTools (weather, commute) if pattern matches
 6. router.route() with enriched history (url_context + fast_context + memories)
   - fast tool match → force medium (real-time data, no point routing to light)
   - if URL content fetched and tier=light → upgrade to medium
 7. Invoke agent for tier with url_context + memories in system prompt
 8. Token streaming:
   - medium: astream() pushes per-token chunks to _stream_queues[session_id]; <think> blocks filtered in real time
   - light/complex: full reply pushed as single chunk after completion
   - _end_stream() sends [DONE] sentinel
 9. channels.deliver(session_id, channel, reply_text) — Telegram callback
 10. _store_memory() background task — stores turn in openmemory
 11. GET /stream/{session_id} SSE clients receive chunks; CLI renders with Rich Live + final Markdown
 ```
 ## Tool Handling
 Adolf uses LangChain's tool interface but only the complex agent actually invokes tools at runtime.
 **Complex agent:** `web_search` and `fetch_url` are defined as `langchain_core.tools.Tool` objects and passed to `create_deep_agent()`. The deepagents library runs an agentic loop (LangGraph `create_react_agent` under the hood) that sends the tool schema to the model via OpenAI function-calling format and handles tool dispatch.
 **Medium agent (default):** `_DirectModel` makes a single `model.ainvoke(messages)` call with no tool schema. Context (memories, fetched URL content) is injected via the system prompt instead. This is intentional — `qwen3:4b` behaves unreliably when a tool array is present.
 **Memory tools (out-of-loop):** `add_memory` and `search_memory` are LangChain MCP tool objects (via `langchain_mcp_adapters`) but are excluded from both agents' tool lists. They are called directly — `await _memory_add_tool.ainvoke(...)` — outside the agent loop, before and after each turn.
 ## Three-Tier Model Routing
 | Tier | Model | Agent | Trigger | Latency |
 |------|-------|-------|---------|---------|
 | Light | `qwen2.5:1.5b` (router answers directly) | — | Regex pre-match or 3-way embedding classifies "light" | ~2–4s |
 | Medium | `qwen3:4b` (`DEEPAGENTS_MODEL`) | `_DirectModel` — single LLM call, no tools | Default; also forced when message contains URLs | ~10–20s |
 | Complex | `deepseek/deepseek-r1:free` via LiteLLM (`DEEPAGENTS_COMPLEX_MODEL`) | `create_deep_agent` — agentic loop with tools | Auto-classified by embedding similarity | ~30–90s |
 Routing is fully automatic via 3-way cosine similarity over pre-embedded utterance centroids (light / medium / complex). No prefix required. Use `adolf-deep` model name to force complex tier via API.
 Complex tier is reached automatically for deep research queries — `исследуй`, `изучи все`, `напиши подробный`, etc. — via regex pre-classifier and embedding similarity. No prefix required. Use `adolf-deep` model name to force it via API.
 ## Fast Tools (`fast_tools.py`)
 Pre-flight tools that run concurrently with URL fetch and memory retrieval before any LLM call. Each tool has two methods:
 - `matches(message) → bool` — regex classifier; also used by `Router` to force medium tier
 - `run(message) → str` — async fetch returning a context block injected into system prompt
 `FastToolRunner` holds all tools. `any_matches()` is called by the Router at step 0a; `run_matching()` is called in the pre-flight `asyncio.gather` in `run_agent_task()`.
 | Tool | Pattern | Source | Context returned |
 |------|---------|--------|-----------------|
 | `WeatherTool` | weather/forecast/temperature/snow/rain | SearXNG `"погода Балашиха сейчас"` | Current conditions in °C from Russian weather sites |
 | `CommuteTool` | commute/traffic/arrival/пробки | `routecheck:8090/api/route` (Yandex Routing API) | Drive time with/without traffic, Balashikha→Moscow |
 **To add a new fast tool:** subclass `FastTool` in `fast_tools.py`, implement `name`/`matches`/`run`, add an instance to `_fast_tool_runner` in `agent.py`.
 ## routecheck Service (`routecheck/`)
 Local web service on port 8090. Exists because Yandex Routing API free tier requires a web UI that uses the API.
 **Web UI** (`http://localhost:8090`): PIL-generated arithmetic captcha → lat/lon form → travel time result.
 **Internal API**: `GET /api/route?from=lat,lon&to=lat,lon&token=ROUTECHECK_TOKEN` — bypasses captcha, used by `CommuteTool`. The `ROUTECHECK_TOKEN` shared secret is set in `.env` and passed to both `routecheck` and `deepagents` containers.
 Yandex API calls are routed through the host HTTPS proxy (`host.docker.internal:56928`) since the container has no direct external internet access.
 **Requires** `.env`: `YANDEX_ROUTING_KEY` (free from `developer.tech.yandex.ru`) + `ROUTECHECK_TOKEN`.
 ## Crawl4AI Integration
 Crawl4AI runs as a Docker service (`crawl4ai:11235`) providing JS-rendered, bot-bypass page fetching.
 **Pre-routing fetch (all tiers):**
 - `_URL_RE` detects `https?://` URLs in any incoming message
 - `_crawl4ai_fetch_async()` uses `httpx.AsyncClient` to POST `{urls: [...]}` to `/crawl`
 - Up to 3 URLs fetched concurrently via `asyncio.gather`
 - Fetched content (up to 3000 chars/URL) injected as a system context block into enriched history before routing and into medium/complex system prompts
 - If fetch succeeds and router returns light → tier upgraded to medium
 **Complex agent tools:**
 - `web_search`: SearXNG query + Crawl4AI auto-fetch of top 2 result URLs → combined snippet + page text
 - `fetch_url`: Crawl4AI single-URL fetch for any specific URL
 ## Memory Pipeline
 openmemory runs as a FastMCP server (`openmemory:8765`) backed by mem0 + Qdrant + nomic-embed-text.
 **Retrieval (before routing):** `_retrieve_memories()` calls `search_memory` MCP tool with the user message as query. Results (threshold ≥ 0.5) are prepended to enriched history so all tiers benefit.
 **Storage (after reply):** `_store_memory()` runs as an asyncio background task, calling `add_memory` with `"User: ...\nAssistant: ..."`. The extraction LLM (`qwen2.5:1.5b` on GPU Ollama) pulls facts; dedup is handled by mem0's update prompt.
 Memory tools (`add_memory`, `search_memory`, `get_all_memories`) are excluded from agent tool lists — memory management happens outside the agent loop.
 ## VRAM Management
 GTX 1070 — 8 GB. Ollama must be restarted if CUDA init fails (model loads on CPU).
 1. Flush explicitly before loading qwen3:8b (`keep_alive=0`)
 2. Verify eviction via `/api/ps` poll (15s timeout) before proceeding
 3. Fallback: timeout → run medium agent instead
 4. Post-complex: flush 8b, pre-warm medium + router
 ## Session ID Convention
 - Telegram: `tg-<chat_id>` (e.g. `tg-346967270`)
 - CLI: `cli-<username>` (e.g. `cli-alvis`)
 Conversation history is keyed by session_id (5-turn buffer).
 ## Files
 ```
 adolf/
 ├── docker-compose.yml      Services: deepagents, openmemory, grammy, crawl4ai, routecheck, cli
 ├── Dockerfile              deepagents container (Python 3.12)
 ├── Dockerfile.cli          CLI container (python:3.12-slim + rich)
 ├── agent.py                FastAPI gateway, run_agent_task, Crawl4AI pre-fetch, fast tools, memory pipeline
 ├── fast_tools.py           FastTool base, FastToolRunner, WeatherTool, CommuteTool
 ├── channels.py             Channel registry + deliver() + pending_replies
 ├── router.py               Router class — regex + LLM tier classification, FastToolRunner integration
 ├── vram_manager.py         VRAMManager — flush/prewarm/poll Ollama VRAM
 ├── agent_factory.py        _DirectModel (medium) / create_deep_agent (complex)
 ├── cli.py                  Interactive CLI REPL — Rich Live streaming + Markdown render
 ├── wiki_research.py        Batch wiki research pipeline (uses /message + SSE)
 ├── benchmarks/
 │   ├── run_benchmark.py    Routing accuracy benchmark — 120 queries across 3 tiers
 │   ├── run_voice_benchmark.py  Voice path benchmark
 │   ├── benchmark.json      Query dataset (gitignored)
 │   └── results_latest.json Last run results (gitignored)
 ├── .env                    TELEGRAM_BOT_TOKEN, ROUTECHECK_TOKEN, YANDEX_ROUTING_KEY (not committed)
 ├── routecheck/
 │   ├── app.py              FastAPI: image captcha + /api/route Yandex proxy
 │   └── Dockerfile
 ├── tests/
 │   ├── integration/        Standalone integration test scripts (common.py + test_*.py)
 │   └── use_cases/          Claude Code skill markdown files — Claude acts as user + evaluator
 ├── openmemory/
 │   ├── server.py           FastMCP + mem0: add_memory, search_memory, get_all_memories
 │   └── Dockerfile
 └── grammy/
    ├── bot.mjs             grammY Telegram bot + POST /send HTTP endpoint
    ├── package.json
    └── Dockerfile
 ```
 ## External Services (host ports, from openai/ stack)
 | Service | Host Port | Role |
 |---------|-----------|------|
 | LiteLLM | 4000 | LLM proxy — all inference goes through here (`LITELLM_URL` env var) |
 | Ollama GPU | 11436 | GPU inference backend + VRAM management (direct) + memory extraction |
 | Ollama CPU | 11435 | nomic-embed-text embeddings for openmemory |
 | Langfuse | 3200 | LLM observability — traces all requests via LiteLLM callbacks |
 | Qdrant | 6333 | Vector store for memories |
 | SearXNG | 11437 | Web search (used by `web_search` tool) |
--- a/agent.py
+++ b/agent.py
@@ -1,7 +1,9 @@
 import asyncio
 import json as _json_module
 import os
 import time
-from contextlib import asynccontextmanager
+from contextlib import asynccontextmanager, nullcontext
 from pathlib import Path
 from fastapi import FastAPI, BackgroundTasks, Request
 from fastapi.responses import JSONResponse, StreamingResponse
@@ -10,7 +12,14 @@ from pydantic import BaseModel
 import re as _re
 import httpx as _httpx
-from langchain_ollama import ChatOllama
+_URL_RE = _re.compile(r'https?://[^\s<>"\']+')
 def _extract_urls(text: str) -> list[str]:
    return _URL_RE.findall(text)
 from openai import AsyncOpenAI
 from langchain_openai import ChatOpenAI
 from langchain_mcp_adapters.client import MultiServerMCPClient
 from langchain_community.utilities import SearxSearchWrapper
 from langchain_core.tools import Tool
@@ -18,23 +27,120 @@ from langchain_core.tools import Tool
 from vram_manager import VRAMManager
 from router import Router
 from agent_factory import build_medium_agent, build_complex_agent
 from fast_tools import FastToolRunner, WeatherTool, CommuteTool
 import channels
 # LiteLLM proxy — all LLM inference goes through here
 LITELLM_URL = os.getenv("LITELLM_URL", "http://host.docker.internal:4000/v1")
 LITELLM_API_KEY = os.getenv("LITELLM_API_KEY", "dummy")
 # Direct Ollama URL — used only by VRAMManager for flush/prewarm/poll
 OLLAMA_BASE_URL = os.getenv("OLLAMA_BASE_URL", "http://localhost:11434")
-ROUTER_MODEL = os.getenv("DEEPAGENTS_ROUTER_MODEL", "qwen2.5:0.5b")
+
 ROUTER_MODEL = os.getenv("DEEPAGENTS_ROUTER_MODEL", "qwen2.5:1.5b")
 MEDIUM_MODEL = os.getenv("DEEPAGENTS_MODEL", "qwen3:4b")
 COMPLEX_MODEL = os.getenv("DEEPAGENTS_COMPLEX_MODEL", "qwen3:8b")
 SEARXNG_URL = os.getenv("SEARXNG_URL", "http://host.docker.internal:11437")
 OPENMEMORY_URL = os.getenv("OPENMEMORY_URL", "http://openmemory:8765")
 CRAWL4AI_URL = os.getenv("CRAWL4AI_URL", "http://crawl4ai:11235")
 ROUTECHECK_URL = os.getenv("ROUTECHECK_URL", "http://routecheck:8090")
 ROUTECHECK_TOKEN = os.getenv("ROUTECHECK_TOKEN", "")
 MAX_HISTORY_TURNS = 5
 _conversation_buffers: dict[str, list] = {}
 # ── Interaction logging (RLHF data collection) ─────────────────────────────────
 _LOG_DIR = Path(os.getenv("ADOLF_LOG_DIR", "/app/logs"))
 _INTERACTIONS_LOG = _LOG_DIR / "interactions.jsonl"
 def _ensure_log_dir() -> None:
    try:
        _LOG_DIR.mkdir(parents=True, exist_ok=True)
    except Exception as e:
        print(f"[log] cannot create log dir {_LOG_DIR}: {e}", flush=True)
 async def _log_interaction(
    session_id: str,
    channel: str,
    tier: str,
    input_text: str,
    response_text: str | None,
    latency_ms: int,
    metadata: dict | None = None,
 ) -> None:
    """Append one interaction record to the JSONL log for future RLHF/finetuning."""
    record = {
        "ts": time.time(),
        "session_id": session_id,
        "channel": channel,
        "tier": tier,
        "input": input_text,
        "output": response_text or "",
        "latency_ms": latency_ms,
    }
    if metadata:
        record["metadata"] = metadata
    try:
        _ensure_log_dir()
        with open(_INTERACTIONS_LOG, "a", encoding="utf-8") as f:
            f.write(_json_module.dumps(record, ensure_ascii=False) + "\n")
    except Exception as e:
        print(f"[log] write error: {e}", flush=True)
 # Per-session streaming queues — filled during inference, read by /stream/{session_id}
 _stream_queues: dict[str, asyncio.Queue] = {}
 async def _push_stream_chunk(session_id: str, chunk: str) -> None:
    q = _stream_queues.setdefault(session_id, asyncio.Queue())
    await q.put(chunk)
 async def _end_stream(session_id: str) -> None:
    q = _stream_queues.setdefault(session_id, asyncio.Queue())
    await q.put("[DONE]")
 async def _crawl4ai_fetch_async(url: str) -> str:
    """Async fetch via Crawl4AI — JS-rendered, bot-bypass, returns clean markdown."""
    try:
        async with _httpx.AsyncClient(timeout=60) as client:
            r = await client.post(f"{CRAWL4AI_URL}/crawl", json={"urls": [url]})
            r.raise_for_status()
            results = r.json().get("results", [])
            if not results or not results[0].get("success"):
                return ""
            md_obj = results[0].get("markdown") or {}
            md = md_obj.get("raw_markdown") if isinstance(md_obj, dict) else str(md_obj)
            return (md or "")[:5000]
    except Exception as e:
        return f"[fetch error: {e}]"
 async def _fetch_urls_from_message(message: str) -> str:
    """If message contains URLs, fetch their content concurrently via Crawl4AI.
    Returns a formatted context block, or '' if no URLs or all fetches fail."""
    urls = _extract_urls(message)
    if not urls:
        return ""
    # Fetch up to 3 URLs concurrently
    results = await asyncio.gather(*[_crawl4ai_fetch_async(u) for u in urls[:3]])
    parts = []
    for url, content in zip(urls[:3], results):
        if content and not content.startswith("[fetch error"):
            parts.append(f"### {url}\n{content[:3000]}")
    if not parts:
        return ""
    return "User's message contains URLs. Fetched content:\n\n" + "\n\n".join(parts)
 # /no_think at the start of the system prompt disables qwen3 chain-of-thought.
 # create_deep_agent prepends our system_prompt before BASE_AGENT_PROMPT, so
 # /no_think lands at position 0 and is respected by qwen3 models via Ollama.
 MEDIUM_SYSTEM_PROMPT = (
-    "You are a helpful AI assistant. "
+    "You are a helpful AI assistant. Reply concisely. "
-    "Use web_search for questions about current events or facts you don't know. "
+    "If asked to remember a fact or name, simply confirm: 'Got it, I'll remember that.'"
    "Reply concisely."
 )
 COMPLEX_SYSTEM_PROMPT = (
@@ -49,11 +155,20 @@ COMPLEX_SYSTEM_PROMPT = (
    "NEVER invent URLs. End with: **Sources checked: N**"
 )
 medium_model = None
 medium_agent = None
 complex_agent = None
 router: Router = None
 vram_manager: VRAMManager = None
 mcp_client = None
 _memory_add_tool = None
 _memory_search_tool = None
 # Fast tools run before the LLM — classifier + context enricher
 _fast_tool_runner = FastToolRunner([
    WeatherTool(),
    CommuteTool(routecheck_url=ROUTECHECK_URL, internal_token=ROUTECHECK_TOKEN),
 ])
 # GPU mutex: one LLM inference at a time
 _reply_semaphore = asyncio.Semaphore(1)
@@ -61,25 +176,37 @@ _reply_semaphore = asyncio.Semaphore(1)
@asynccontextmanager
 async def lifespan(app: FastAPI):
-    global medium_agent, complex_agent, router, vram_manager, mcp_client
+    global medium_model, medium_agent, complex_agent, router, vram_manager, mcp_client, \
        _memory_add_tool, _memory_search_tool
    # Register channel adapters
    channels.register_defaults()
-    # Three model instances
+    # All three models route through Bifrost → Ollama GPU.
-    router_model = ChatOllama(
+    router_model = ChatOpenAI(
-        model=ROUTER_MODEL, base_url=OLLAMA_BASE_URL, think=False, num_ctx=4096,
+        model=f"ollama/{ROUTER_MODEL}",
        base_url=LITELLM_URL,
        api_key=LITELLM_API_KEY,
        temperature=0,
        timeout=30,
    )
-    medium_model = ChatOllama(
+    embedder = AsyncOpenAI(base_url=LITELLM_URL, api_key=LITELLM_API_KEY)
-        model=MEDIUM_MODEL, base_url=OLLAMA_BASE_URL, think=False, num_ctx=8192
+    medium_model = ChatOpenAI(
        model=f"ollama/{MEDIUM_MODEL}",
        base_url=LITELLM_URL,
        api_key=LITELLM_API_KEY,
        timeout=180,
    )
-    complex_model = ChatOllama(
+    complex_model = ChatOpenAI(
-        model=COMPLEX_MODEL, base_url=OLLAMA_BASE_URL, think=True, num_ctx=16384
+        model=COMPLEX_MODEL,  # full model name — may be remote (OpenRouter) or local ollama/*
        base_url=LITELLM_URL,
        api_key=LITELLM_API_KEY,
        timeout=600,
    )
    vram_manager = VRAMManager(base_url=OLLAMA_BASE_URL)
-    router = Router(model=router_model)
+    router = Router(model=router_model, embedder=embedder, fast_tool_runner=_fast_tool_runner)
    await router.initialize()
    mcp_connections = {
        "openmemory": {"transport": "sse", "url": f"{OPENMEMORY_URL}/sse"},
@@ -97,6 +224,13 @@ async def lifespan(app: FastAPI):
    agent_tools = [t for t in mcp_tools if t.name not in ("add_memory", "search_memory", "get_all_memories")]
    # Expose memory tools directly so run_agent_task can call them outside the agent loop
    for t in mcp_tools:
        if t.name == "add_memory":
            _memory_add_tool = t
        elif t.name == "search_memory":
            _memory_search_tool = t
    searx = SearxSearchWrapper(searx_host=SEARXNG_URL)
    def _crawl4ai_fetch(url: str) -> str:
@@ -187,13 +321,15 @@ async def lifespan(app: FastAPI):
    )
    print(
-        f"[agent] three-tier: router={ROUTER_MODEL} | medium={MEDIUM_MODEL} | complex={COMPLEX_MODEL}",
+        f"[agent] litellm={LITELLM_URL} | router=semantic(ollama/{ROUTER_MODEL}+nomic-embed-text) | "
        f"medium=ollama/{MEDIUM_MODEL} | complex={COMPLEX_MODEL}",
        flush=True,
    )
    print(f"[agent] agent tools: {[t.name for t in agent_tools]}", flush=True)
    yield
    medium_model = None
    medium_agent = None
    complex_agent = None
    router = None
@@ -222,13 +358,19 @@ class ChatRequest(BaseModel):
 # ── helpers ────────────────────────────────────────────────────────────────────
 def _strip_think(text: str) -> str:
    """Strip qwen3 chain-of-thought blocks that appear inline in content
    when using Ollama's OpenAI-compatible endpoint (/v1/chat/completions)."""
    return _re.sub(r"<think>.*?</think>", "", text, flags=_re.DOTALL).strip()
 def _extract_final_text(result) -> str | None:
    msgs = result.get("messages", [])
    for m in reversed(msgs):
        if type(m).__name__ == "AIMessage" and getattr(m, "content", ""):
-            return m.content
+            return _strip_think(m.content)
    if isinstance(result, dict) and result.get("output"):
-        return result["output"]
+        return _strip_think(result["output"])
    return None
@@ -244,60 +386,176 @@ def _log_messages(result):
            print(f"[agent]   {role} → {tc['name']}({tc['args']})", flush=True)
-# ── core task ──────────────────────────────────────────────────────────────────
+# ── memory helpers ─────────────────────────────────────────────────────────────
-async def run_agent_task(message: str, session_id: str, channel: str = "telegram"):
+def _resolve_user_id(session_id: str) -> str:
-    print(f"[agent] queued: {message[:80]!r} chat={session_id}", flush=True)
+    """Map any session_id to a canonical user identity for openmemory.
    All channels share the same memory pool for the single user."""
    return "alvis"
    force_complex = False
    clean_message = message
    if message.startswith("/think "):
        force_complex = True
        clean_message = message[len("/think "):]
        print("[agent] /think prefix → force_complex=True", flush=True)
-    async with _reply_semaphore:
+async def _store_memory(session_id: str, user_msg: str, assistant_reply: str) -> None:
    """Store a conversation turn in openmemory (runs as a background task)."""
    if _memory_add_tool is None:
        return
    t0 = time.monotonic()
    try:
        text = f"User: {user_msg}\nAssistant: {assistant_reply}"
        user_id = _resolve_user_id(session_id)
        await _memory_add_tool.ainvoke({"text": text, "user_id": user_id})
        print(f"[memory] stored in {time.monotonic() - t0:.1f}s", flush=True)
    except Exception as e:
        print(f"[memory] error: {e}", flush=True)
 async def _retrieve_memories(message: str, session_id: str) -> str:
    """Search openmemory for relevant context. Returns formatted string or ''."""
    if _memory_search_tool is None:
        return ""
    try:
        user_id = _resolve_user_id(session_id)
        result = await _memory_search_tool.ainvoke({"query": message, "user_id": user_id})
        if result and result.strip() and result.strip() != "[]":
            return f"Relevant memories:\n{result}"
    except Exception:
        pass
    return ""
 # ── core pipeline ──────────────────────────────────────────────────────────────
 from typing import AsyncGenerator
 async def _run_agent_pipeline(
    message: str,
    history: list[dict],
    session_id: str,
    tier_override: str | None = None,
    no_inference: bool = False,
    tier_capture: list | None = None,
 ) -> AsyncGenerator[str, None]:
    """Core pipeline: pre-flight → routing → inference. Yields text chunks.
    tier_override: "light" | "medium" | "complex" | None (auto-route)
    no_inference: if True, routing decision is still made but inference is skipped — yields "I don't know" immediately
    Caller is responsible for scheduling _store_memory after consuming all chunks.
    """
    async with (nullcontext() if no_inference else _reply_semaphore):
        t0 = time.monotonic()
-        history = _conversation_buffers.get(session_id, [])
+        clean_message = message
        print(f"[agent] running: {clean_message[:80]!r}", flush=True)
-        tier, light_reply = await router.route(clean_message, history, force_complex)
+        # Fetch URL content, memories, and fast-tool context concurrently
-        print(f"[agent] tier={tier} message={clean_message[:60]!r}", flush=True)
+        # Skip preflight IO in no_inference mode — only routing decision needed
        if no_inference:
            url_context = memories = fast_context = None
        else:
            url_context, memories, fast_context = await asyncio.gather(
                _fetch_urls_from_message(clean_message),
                _retrieve_memories(clean_message, session_id),
                _fast_tool_runner.run_matching(clean_message),
            )
            if url_context:
                print(f"[agent] crawl4ai: {len(url_context)} chars fetched", flush=True)
            if fast_context:
                names = _fast_tool_runner.matching_names(clean_message)
                print(f"[agent] fast_tools={names}: {len(fast_context)} chars injected", flush=True)
        # Build enriched history
        enriched_history = list(history)
        if url_context:
            enriched_history = [{"role": "system", "content": url_context}] + enriched_history
        if fast_context:
            enriched_history = [{"role": "system", "content": fast_context}] + enriched_history
        if memories:
            enriched_history = [{"role": "system", "content": memories}] + enriched_history
        final_text = None
-        try:
+        llm_elapsed = 0.0
            if tier == "light":
                final_text = light_reply
                llm_elapsed = time.monotonic() - t0
                print(f"[agent] light path: answered by router", flush=True)
-            elif tier == "medium":
+        try:
-                system_prompt = MEDIUM_SYSTEM_PROMPT
+            # Short-circuit: fast tool already has the answer
-                result = await medium_agent.ainvoke({
+            if fast_context and tier_override is None and not url_context and not no_inference:
-                    "messages": [
+                tier = "fast"
                final_text = fast_context
                llm_elapsed = time.monotonic() - t0
                names = _fast_tool_runner.matching_names(clean_message)
                print(f"[agent] tier=fast tools={names} — delivering directly", flush=True)
                yield final_text
            else:
                # Determine tier
                if tier_override in ("light", "medium", "complex"):
                    tier = tier_override
                    light_reply = None
                    if tier_override == "light":
                        tier, light_reply = await router.route(clean_message, enriched_history, no_inference=no_inference)
                        tier = "light"
                else:
                    tier, light_reply = await router.route(clean_message, enriched_history, no_inference=no_inference)
                    if url_context and tier == "light":
                        tier = "medium"
                        light_reply = None
                        print("[agent] URL in message → upgraded light→medium", flush=True)
                print(f"[agent] tier={tier} message={clean_message[:60]!r}", flush=True)
                if tier_capture is not None:
                    tier_capture.append(tier)
                if no_inference:
                    yield "I don't know"
                    return
                if tier == "light":
                    final_text = light_reply
                    llm_elapsed = time.monotonic() - t0
                    print("[agent] light path: answered by router", flush=True)
                    yield final_text
                elif tier == "medium":
                    system_prompt = MEDIUM_SYSTEM_PROMPT
                    if memories:
                        system_prompt += "\n\n" + memories
                    if url_context:
                        system_prompt += "\n\n" + url_context
                    if fast_context:
                        system_prompt += "\n\nLive web search results (use these to answer):\n\n" + fast_context
                    in_think = False
                    response_parts = []
                    async for chunk in medium_model.astream([
                        {"role": "system", "content": system_prompt},
                        *history,
                        {"role": "user", "content": clean_message},
-                    ]
+                    ]):
-                })
+                        token = chunk.content or ""
-                llm_elapsed = time.monotonic() - t0
+                        if not token:
-                _log_messages(result)
+                            continue
-                final_text = _extract_final_text(result)
+                        if in_think:
                            if "</think>" in token:
                                in_think = False
                                after = token.split("</think>", 1)[1]
                                if after:
                                    yield after
                                    response_parts.append(after)
                        else:
                            if "<think>" in token:
                                in_think = True
                                before = token.split("<think>", 1)[0]
                                if before:
                                    yield before
                                    response_parts.append(before)
                            else:
                                yield token
                                response_parts.append(token)
-            else:  # complex
+                    llm_elapsed = time.monotonic() - t0
-                ok = await vram_manager.enter_complex_mode()
+                    final_text = "".join(response_parts).strip() or None
-                if not ok:
+
-                    print("[agent] complex→medium fallback (eviction timeout)", flush=True)
+                else:  # complex — remote model, no VRAM management needed
                    tier = "medium"
                    result = await medium_agent.ainvoke({
                        "messages": [
                            {"role": "system", "content": MEDIUM_SYSTEM_PROMPT},
                            *history,
                            {"role": "user", "content": clean_message},
                        ]
                    })
                else:
                    system_prompt = COMPLEX_SYSTEM_PROMPT.format(user_id=session_id)
                    if url_context:
                        system_prompt += "\n\n[Pre-fetched URL content from user's message:]\n" + url_context
                    result = await complex_agent.ainvoke({
                        "messages": [
                            {"role": "system", "content": system_prompt},
@@ -305,38 +563,90 @@ async def run_agent_task(message: str, session_id: str, channel: str = "telegram
                            {"role": "user", "content": clean_message},
                        ]
                    })
                    asyncio.create_task(vram_manager.exit_complex_mode())
-                llm_elapsed = time.monotonic() - t0
+                    llm_elapsed = time.monotonic() - t0
-                _log_messages(result)
+                    _log_messages(result)
-                final_text = _extract_final_text(result)
+                    final_text = _extract_final_text(result)
                    if final_text:
                        yield final_text
        except Exception as e:
            import traceback
            llm_elapsed = time.monotonic() - t0
-            print(f"[agent] error after {llm_elapsed:.1f}s for chat {session_id}: {e}", flush=True)
+            print(f"[agent] error after {llm_elapsed:.1f}s for {session_id}: {e}", flush=True)
            traceback.print_exc()
-        # Deliver reply through the originating channel
+        print(f"[agent] pipeline done in {time.monotonic() - t0:.1f}s tier={tier if 'tier' in dir() else '?'}", flush=True)
        # Store memory as side-effect (non-blocking, best-effort)
        if final_text:
-            t1 = time.monotonic()
+            asyncio.create_task(_store_memory(session_id, clean_message, final_text))
-            await channels.deliver(session_id, channel, final_text)
+
-            send_elapsed = time.monotonic() - t1
+
-            print(
+# ── core task (Telegram / Matrix / CLI wrapper) ─────────────────────────────────
-                f"[agent] replied in {time.monotonic() - t0:.1f}s "
+
-                f"(llm={llm_elapsed:.1f}s, send={send_elapsed:.1f}s) tier={tier}",
+async def run_agent_task(
-                flush=True,
+    message: str,
-            )
+    session_id: str,
-            print(f"[agent] reply_text: {final_text}", flush=True)
+    channel: str = "telegram",
    metadata: dict | None = None,
 ):
    print(f"[agent] queued: {message[:80]!r} chat={session_id}", flush=True)
    t0 = time.monotonic()
    meta = metadata or {}
    no_inference = bool(meta.get("no_inference", False))
    is_benchmark = bool(meta.get("benchmark", False))
    history = _conversation_buffers.get(session_id, [])
    final_text = None
    actual_tier = "unknown"
    tier_capture: list = []
    async for chunk in _run_agent_pipeline(message, history, session_id, no_inference=no_inference, tier_capture=tier_capture):
        await _push_stream_chunk(session_id, chunk)
        if final_text is None:
            final_text = chunk
        else:
-            print("[agent] warning: no text reply from agent", flush=True)
+            final_text += chunk
    await _end_stream(session_id)
    actual_tier = tier_capture[0] if tier_capture else "unknown"
    elapsed_ms = int((time.monotonic() - t0) * 1000)
    if final_text:
        final_text = final_text.strip()
        # Skip channel delivery for benchmark sessions (no Telegram spam)
        if not is_benchmark:
            try:
                await channels.deliver(session_id, channel, final_text)
            except Exception as e:
                print(f"[agent] delivery error (non-fatal): {e}", flush=True)
        print(f"[agent] replied in {elapsed_ms / 1000:.1f}s tier={actual_tier}", flush=True)
        print(f"[agent] reply_text: {final_text}", flush=True)
        # Update conversation buffer
-        if final_text:
+        buf = _conversation_buffers.get(session_id, [])
-            buf = _conversation_buffers.get(session_id, [])
+        buf.append({"role": "user", "content": message})
-            buf.append({"role": "user", "content": clean_message})
+        buf.append({"role": "assistant", "content": final_text})
-            buf.append({"role": "assistant", "content": final_text})
+        _conversation_buffers[session_id] = buf[-(MAX_HISTORY_TURNS * 2):]
-            _conversation_buffers[session_id] = buf[-(MAX_HISTORY_TURNS * 2):]
+
        # Log interaction for RLHF data collection (skip benchmark sessions to avoid noise)
        if not is_benchmark:
            asyncio.create_task(_log_interaction(
                session_id=session_id,
                channel=channel,
                tier=actual_tier,
                input_text=message,
                response_text=final_text,
                latency_ms=elapsed_ms,
                metadata=meta if meta else None,
            ))
    else:
        print("[agent] warning: no text reply from agent", flush=True)
 # ── endpoints ──────────────────────────────────────────────────────────────────
@@ -348,7 +658,7 @@ async def message(request: InboundMessage, background_tasks: BackgroundTasks):
        return JSONResponse(status_code=503, content={"error": "Agent not ready"})
    session_id = request.session_id
    channel = request.channel
-    background_tasks.add_task(run_agent_task, request.text, session_id, channel)
+    background_tasks.add_task(run_agent_task, request.text, session_id, channel, request.metadata)
    return JSONResponse(status_code=202, content={"status": "accepted"})
@@ -374,13 +684,132 @@ async def reply_stream(session_id: str):
        try:
            text = await asyncio.wait_for(q.get(), timeout=900)
            # Escape newlines so entire reply fits in one SSE data line
-            yield f"data: {text.replace(chr(10), '\\n').replace(chr(13), '')}\n\n"
+            yield f"data: {text.replace(chr(10), chr(92) + 'n').replace(chr(13), '')}\n\n"
        except asyncio.TimeoutError:
            yield "data: [timeout]\n\n"
    return StreamingResponse(event_generator(), media_type="text/event-stream")
@app.get("/stream/{session_id}")
 async def stream_reply(session_id: str):
    """
    SSE endpoint — streams reply tokens as they are generated.
    Each chunk: data: <token>\\n\\n
    Signals completion: data: [DONE]\\n\\n
    Medium tier: real token-by-token streaming (think blocks filtered out).
    Light and complex tiers: full reply delivered as one chunk then [DONE].
    """
    q = _stream_queues.setdefault(session_id, asyncio.Queue())
    async def event_generator():
        try:
            while True:
                chunk = await asyncio.wait_for(q.get(), timeout=900)
                escaped = chunk.replace("\n", "\\n").replace("\r", "")
                yield f"data: {escaped}\n\n"
                if chunk == "[DONE]":
                    break
        except asyncio.TimeoutError:
            yield "data: [DONE]\n\n"
    return StreamingResponse(event_generator(), media_type="text/event-stream")
@app.get("/health")
 async def health():
    return {"status": "ok", "agent_ready": medium_agent is not None}
 # ── OpenAI-compatible API (for OpenWebUI and other clients) ────────────────────
 _TIER_MAP = {
    "adolf": None,
    "adolf-light": "light",
    "adolf-medium": "medium",
    "adolf-deep": "complex",
 }
@app.get("/v1/models")
 async def list_models():
    return {
        "object": "list",
        "data": [
            {"id": "adolf", "object": "model", "owned_by": "adolf"},
            {"id": "adolf-light", "object": "model", "owned_by": "adolf"},
            {"id": "adolf-medium", "object": "model", "owned_by": "adolf"},
            {"id": "adolf-deep", "object": "model", "owned_by": "adolf"},
        ],
    }
@app.post("/v1/chat/completions")
 async def chat_completions(request: Request):
    if medium_agent is None:
        return JSONResponse(status_code=503, content={"error": {"message": "Agent not ready", "type": "server_error"}})
    body = await request.json()
    model = body.get("model", "adolf")
    messages = body.get("messages", [])
    stream = body.get("stream", True)
    # Extract current user message and history
    user_messages = [m for m in messages if m.get("role") == "user"]
    if not user_messages:
        return JSONResponse(status_code=400, content={"error": {"message": "No user message", "type": "invalid_request_error"}})
    current_message = user_messages[-1]["content"]
    # History = everything before the last user message (excluding system messages from OpenWebUI)
    last_user_idx = len(messages) - 1 - next(
        i for i, m in enumerate(reversed(messages)) if m.get("role") == "user"
    )
    history = [m for m in messages[:last_user_idx] if m.get("role") in ("user", "assistant")]
    session_id = request.headers.get("X-Session-Id", "owui-default")
    tier_override = _TIER_MAP.get(model)
    import json as _json
    import uuid as _uuid
    response_id = f"chatcmpl-{_uuid.uuid4().hex[:12]}"
    if stream:
        async def event_stream():
            # Opening chunk with role
            opening = {
                "id": response_id, "object": "chat.completion.chunk",
                "choices": [{"index": 0, "delta": {"role": "assistant"}, "finish_reason": None}]
            }
            yield f"data: {_json.dumps(opening)}\n\n"
            async for chunk in _run_agent_pipeline(current_message, history, session_id, tier_override):
                data = {
                    "id": response_id, "object": "chat.completion.chunk",
                    "choices": [{"index": 0, "delta": {"content": chunk}, "finish_reason": None}]
                }
                yield f"data: {_json.dumps(data)}\n\n"
            # Final chunk
            final = {
                "id": response_id, "object": "chat.completion.chunk",
                "choices": [{"index": 0, "delta": {}, "finish_reason": "stop"}]
            }
            yield f"data: {_json.dumps(final)}\n\n"
            yield "data: [DONE]\n\n"
        return StreamingResponse(event_stream(), media_type="text/event-stream")
    else:
        # Non-streaming: collect all chunks
        parts = []
        async for chunk in _run_agent_pipeline(current_message, history, session_id, tier_override):
            if chunk:
                parts.append(chunk)
        full_text = "".join(parts).strip()
        return {
            "id": response_id, "object": "chat.completion",
            "choices": [{"index": 0, "message": {"role": "assistant", "content": full_text}, "finish_reason": "stop"}],
            "model": model,
        }
--- a/agent_factory.py
+++ b/agent_factory.py
@@ -1,13 +1,21 @@
 from deepagents import create_deep_agent
 class _DirectModel:
    """Thin wrapper: single LLM call, no tools, same ainvoke interface as a graph."""
    def __init__(self, model):
        self._model = model
    async def ainvoke(self, input_dict: dict) -> dict:
        messages = input_dict["messages"]
        response = await self._model.ainvoke(messages)
        return {"messages": list(messages) + [response]}
 def build_medium_agent(model, agent_tools: list, system_prompt: str):
-    """Medium agent: create_deep_agent with TodoList planning, no subagents."""
+    """Medium agent: single LLM call, no tools — fast ~3s response."""
-    return create_deep_agent(
+    return _DirectModel(model)
        model=model,
        tools=agent_tools,
        system_prompt=system_prompt,
    )
 def build_complex_agent(model, agent_tools: list, system_prompt: str):
--- a/benchmarks/benchmark.json
+++ b/benchmarks/benchmark.json
@@ -0,0 +1,137 @@
 {
  "description": "Adolf routing benchmark — домашние сценарии, Alexa/Google-Home стиль, русский язык",
  "tiers": {
    "light": "Приветствия, прощания, подтверждения, простые разговорные фразы. Не требуют поиска или действий.",
    "medium": "Управление домом, погода/пробки, таймеры, напоминания, покупки, личная память, быстрые вопросы.",
    "complex": "Глубокое исследование, сравнение технологий, подробные руководства с несколькими источниками."
  },
  "queries": [
    {"id": 1,  "tier": "light",   "category": "greetings", "query": "привет"},
    {"id": 2,  "tier": "light",   "category": "greetings", "query": "пока"},
    {"id": 3,  "tier": "light",   "category": "greetings", "query": "спасибо"},
    {"id": 4,  "tier": "light",   "category": "greetings", "query": "привет, как дела?"},
    {"id": 5,  "tier": "light",   "category": "greetings", "query": "окей"},
    {"id": 6,  "tier": "light",   "category": "greetings", "query": "добрый вечер"},
    {"id": 7,  "tier": "light",   "category": "greetings", "query": "доброе утро"},
    {"id": 8,  "tier": "light",   "category": "greetings", "query": "добрый день"},
    {"id": 9,  "tier": "light",   "category": "greetings", "query": "hi"},
    {"id": 10, "tier": "light",   "category": "greetings", "query": "thanks"},
    {"id": 11, "tier": "light",   "category": "greetings", "query": "отлично, спасибо"},
    {"id": 12, "tier": "light",   "category": "greetings", "query": "понятно"},
    {"id": 13, "tier": "light",   "category": "greetings", "query": "ясно"},
    {"id": 14, "tier": "light",   "category": "greetings", "query": "ладно"},
    {"id": 15, "tier": "light",   "category": "greetings", "query": "договорились"},
    {"id": 16, "tier": "light",   "category": "greetings", "query": "good morning"},
    {"id": 17, "tier": "light",   "category": "greetings", "query": "good night"},
    {"id": 18, "tier": "light",   "category": "greetings", "query": "всё понятно"},
    {"id": 19, "tier": "light",   "category": "greetings", "query": "да"},
    {"id": 20, "tier": "light",   "category": "greetings", "query": "нет"},
    {"id": 21, "tier": "light",   "category": "greetings", "query": "не нужно"},
    {"id": 22, "tier": "light",   "category": "greetings", "query": "отмена"},
    {"id": 23, "tier": "light",   "category": "greetings", "query": "стоп"},
    {"id": 24, "tier": "light",   "category": "greetings", "query": "подожди"},
    {"id": 25, "tier": "light",   "category": "greetings", "query": "повтори"},
    {"id": 26, "tier": "light",   "category": "greetings", "query": "ты тут?"},
    {"id": 27, "tier": "light",   "category": "greetings", "query": "слышишь меня?"},
    {"id": 28, "tier": "light",   "category": "greetings", "query": "всё ок"},
    {"id": 29, "tier": "light",   "category": "greetings", "query": "хорошо"},
    {"id": 30, "tier": "light",   "category": "greetings", "query": "пожалуйста"},
    {"id": 31, "tier": "medium",  "category": "weather_commute", "query": "какая сегодня погода в Балашихе"},
    {"id": 32, "tier": "medium",  "category": "weather_commute", "query": "пойдет ли сегодня дождь"},
    {"id": 33, "tier": "medium",  "category": "weather_commute", "query": "какая температура на улице сейчас"},
    {"id": 34, "tier": "medium",  "category": "weather_commute", "query": "будет ли снег сегодня"},
    {"id": 35, "tier": "medium",  "category": "weather_commute", "query": "погода на завтра"},
    {"id": 36, "tier": "medium",  "category": "weather_commute", "query": "сколько ехать до Москвы сейчас"},
    {"id": 37, "tier": "medium",  "category": "weather_commute", "query": "какие пробки на дороге до Москвы"},
    {"id": 38, "tier": "medium",  "category": "weather_commute", "query": "время в пути на работу"},
    {"id": 39, "tier": "medium",  "category": "weather_commute", "query": "есть ли пробки сейчас"},
    {"id": 40, "tier": "medium",  "category": "weather_commute", "query": "стоит ли брать зонтик"},
    {"id": 41, "tier": "medium",  "category": "smart_home_control", "query": "включи свет в гостиной"},
    {"id": 42, "tier": "medium",  "category": "smart_home_control", "query": "выключи свет на кухне"},
    {"id": 43, "tier": "medium",  "category": "smart_home_control", "query": "какая температура дома"},
    {"id": 44, "tier": "medium",  "category": "smart_home_control", "query": "установи температуру 22 градуса"},
    {"id": 45, "tier": "medium",  "category": "smart_home_control", "query": "включи свет в спальне на 50 процентов"},
    {"id": 46, "tier": "medium",  "category": "smart_home_control", "query": "выключи все лампочки"},
    {"id": 47, "tier": "medium",  "category": "smart_home_control", "query": "какие устройства сейчас включены"},
    {"id": 48, "tier": "medium",  "category": "smart_home_control", "query": "закрыты ли все окна"},
    {"id": 49, "tier": "medium",  "category": "smart_home_control", "query": "включи вентилятор в детской"},
    {"id": 50, "tier": "medium",  "category": "smart_home_control", "query": "есть ли кто-нибудь дома"},
    {"id": 51, "tier": "medium",  "category": "smart_home_control", "query": "включи ночной режим"},
    {"id": 52, "tier": "medium",  "category": "smart_home_control", "query": "какое потребление электричества сегодня"},
    {"id": 53, "tier": "medium",  "category": "smart_home_control", "query": "выключи телевизор"},
    {"id": 54, "tier": "medium",  "category": "smart_home_control", "query": "открой шторы в гостиной"},
    {"id": 55, "tier": "medium",  "category": "smart_home_control", "query": "установи будильник на 7 утра"},
    {"id": 56, "tier": "medium",  "category": "smart_home_control", "query": "включи кофемашину"},
    {"id": 57, "tier": "medium",  "category": "smart_home_control", "query": "выключи свет во всём доме"},
    {"id": 58, "tier": "medium",  "category": "smart_home_control", "query": "сколько у нас датчиков движения"},
    {"id": 59, "tier": "medium",  "category": "smart_home_control", "query": "состояние всех дверных замков"},
    {"id": 60, "tier": "medium",  "category": "smart_home_control", "query": "включи режим кино в гостиной"},
    {"id": 61, "tier": "medium",  "category": "smart_home_control", "query": "прибавь яркость в детской"},
    {"id": 62, "tier": "medium",  "category": "smart_home_control", "query": "закрой все шторы"},
    {"id": 63, "tier": "medium",  "category": "smart_home_control", "query": "кто последний открывал входную дверь"},
    {"id": 64, "tier": "medium",  "category": "smart_home_control", "query": "заблокируй входную дверь"},
    {"id": 65, "tier": "medium",  "category": "smart_home_control", "query": "покажи камеру у входа"},
    {"id": 66, "tier": "medium",  "category": "timers_reminders", "query": "поставь таймер на 10 минут"},
    {"id": 67, "tier": "medium",  "category": "timers_reminders", "query": "напомни мне позвонить врачу в 15:00"},
    {"id": 68, "tier": "medium",  "category": "timers_reminders", "query": "поставь будильник на завтра в 6:30"},
    {"id": 69, "tier": "medium",  "category": "timers_reminders", "query": "напомни выключить плиту через 20 минут"},
    {"id": 70, "tier": "medium",  "category": "timers_reminders", "query": "сколько времени осталось на таймере"},
    {"id": 71, "tier": "medium",  "category": "shopping_cooking", "query": "добавь молоко в список покупок"},
    {"id": 72, "tier": "medium",  "category": "shopping_cooking", "query": "что есть в списке покупок"},
    {"id": 73, "tier": "medium",  "category": "shopping_cooking", "query": "добавь хлеб и яйца в список покупок"},
    {"id": 74, "tier": "medium",  "category": "shopping_cooking", "query": "сколько граммов муки нужно для блинов на 4 человека"},
    {"id": 75, "tier": "medium",  "category": "shopping_cooking", "query": "какой рецепт борща ты знаешь"},
    {"id": 76, "tier": "medium",  "category": "personal_memory", "query": "как меня зовут"},
    {"id": 77, "tier": "medium",  "category": "personal_memory", "query": "где я живу"},
    {"id": 78, "tier": "medium",  "category": "personal_memory", "query": "что мы обсуждали в прошлый раз"},
    {"id": 79, "tier": "medium",  "category": "personal_memory", "query": "что ты знаешь о моем домашнем сервере"},
    {"id": 80, "tier": "medium",  "category": "personal_memory", "query": "напомни, какие сервисы я запускаю"},
    {"id": 81, "tier": "medium",  "category": "personal_memory", "query": "что я говорил о своей сети"},
    {"id": 82, "tier": "medium",  "category": "personal_memory", "query": "что я просил тебя запомнить"},
    {"id": 83, "tier": "medium",  "category": "quick_info", "query": "какой сейчас курс биткоина"},
    {"id": 84, "tier": "medium",  "category": "quick_info", "query": "курс доллара к рублю сейчас"},
    {"id": 85, "tier": "medium",  "category": "quick_info", "query": "есть ли проблемы у Cloudflare сегодня"},
    {"id": 86, "tier": "medium",  "category": "quick_info", "query": "какая последняя версия Docker"},
    {"id": 87, "tier": "medium",  "category": "quick_info", "query": "какие новые функции в Home Assistant 2024"},
    {"id": 88, "tier": "medium",  "category": "quick_info", "query": "как проверить использование диска в Linux"},
    {"id": 89, "tier": "medium",  "category": "quick_info", "query": "как перезапустить Docker контейнер"},
    {"id": 90, "tier": "medium",  "category": "quick_info", "query": "как посмотреть логи Docker контейнера"},
    {"id": 91,  "tier": "complex", "category": "infrastructure", "query": "исследуй и сравни Proxmox, Unraid и TrueNAS для домашней лаборатории"},
    {"id": 92,  "tier": "complex", "category": "infrastructure", "query": "напиши подробное руководство по безопасности домашнего сервера, подключенного к интернету"},
    {"id": 93,  "tier": "complex", "category": "infrastructure", "query": "исследуй все доступные дашборды для самохостинга и сравни их функции"},
    {"id": 94,  "tier": "complex", "category": "infrastructure", "query": "исследуй лучший стек мониторинга для самохостинга в 2024 году со всеми вариантами"},
    {"id": 95,  "tier": "complex", "category": "infrastructure", "query": "сравни все системы резервного копирования для Linux: Restic, Borg, Duplicati, Timeshift"},
    {"id": 96,  "tier": "complex", "category": "infrastructure", "query": "напиши полное руководство по настройке обратного прокси Caddy для домашнего сервера с SSL"},
    {"id": 97,  "tier": "complex", "category": "network", "query": "исследуй и сравни WireGuard, OpenVPN и Tailscale для домашней VPN с детальными плюсами и минусами"},
    {"id": 98,  "tier": "complex", "category": "network", "query": "исследуй лучшие практики сегментации домашней сети с VLAN и правилами файрвола"},
    {"id": 99,  "tier": "complex", "category": "network", "query": "изучи все самохостируемые DNS решения и их возможности"},
    {"id": 100, "tier": "complex", "category": "network", "query": "исследуй лучшие самохостируемые системы мониторинга сети: Zabbix, Grafana, Prometheus, Netdata"},
    {"id": 101, "tier": "complex", "category": "home_assistant", "query": "исследуй и сравни все платформы умного дома: Home Assistant, OpenHAB и Domoticz"},
    {"id": 102, "tier": "complex", "category": "home_assistant", "query": "изучи лучшие Zigbee координаторы и их совместимость с Home Assistant в 2024 году"},
    {"id": 103, "tier": "complex", "category": "home_assistant", "query": "напиши детальный отчет о поддержке протокола Matter и совместимых устройствах"},
    {"id": 104, "tier": "complex", "category": "home_assistant", "query": "исследуй все способы интеграции умных ламп с Home Assistant: Zigbee, WiFi, Bluetooth"},
    {"id": 105, "tier": "complex", "category": "home_assistant", "query": "найди и сравни все варианты датчиков движения для умного дома с оценками и ценами"},
    {"id": 106, "tier": "complex", "category": "home_assistant", "query": "напиши подробное руководство по настройке автоматизаций в Home Assistant для умного освещения"},
    {"id": 107, "tier": "complex", "category": "home_assistant", "query": "исследуй все варианты голосового управления умным домом на русском языке, включая локальные решения"},
    {"id": 108, "tier": "complex", "category": "home_assistant", "query": "исследуй все протоколы умного дома и их плюсы и минусы: Zigbee, Z-Wave, WiFi, Thread, Bluetooth"},
    {"id": 109, "tier": "complex", "category": "media_files", "query": "исследуй и сравни все самохостируемые решения для хранения фотографий с детальным сравнением функций"},
    {"id": 110, "tier": "complex", "category": "media_files", "query": "изучи лучшие самохостируемые медиасерверы: Jellyfin, Plex и Emby — с характеристиками и отзывами"},
    {"id": 111, "tier": "complex", "category": "media_files", "query": "сравни все самохостируемые облачные хранилища: Nextcloud, Seafile, Owncloud — производительность и функции"},
    {"id": 112, "tier": "complex", "category": "research", "query": "исследуй последние достижения в локальном LLM инференсе и оборудовании для него"},
    {"id": 113, "tier": "complex", "category": "research", "query": "изучи лучшие опенсорс альтернативы Google сервисов для приватного домашнего окружения"},
    {"id": 114, "tier": "complex", "category": "research", "query": "изучи все варианты локального запуска языковых моделей на видеокарте 8 ГБ VRAM"},
    {"id": 115, "tier": "complex", "category": "research", "query": "найди и сравни все фреймворки для создания локальных AI ассистентов с открытым исходным кодом"},
    {"id": 116, "tier": "complex", "category": "research", "query": "изучи все доступные локальные ассистенты с голосовым управлением на русском языке"},
    {"id": 117, "tier": "complex", "category": "infrastructure", "query": "изучи свежие CVE и уязвимости в популярном самохостируемом ПО: Gitea, Nextcloud, Jellyfin"},
    {"id": 118, "tier": "complex", "category": "infrastructure", "query": "напиши детальное сравнение систем управления конфигурацией: Ansible, Salt, Puppet для домашнего окружения"},
    {"id": 119, "tier": "complex", "category": "network", "query": "исследуй все самохостируемые решения для блокировки рекламы: Pi-hole, AdGuard Home, NextDNS"},
    {"id": 120, "tier": "complex", "category": "research", "query": "напиши подробный отчет о технологиях синтеза речи с открытым исходным кодом на русском языке"}
  ]
 }
--- a/benchmarks/run_benchmark.py
+++ b/benchmarks/run_benchmark.py
@@ -0,0 +1,316 @@
 #!/usr/bin/env python3
 """
 Adolf routing benchmark.
 Sends each query to Adolf's /message endpoint, waits briefly for the routing
 decision to appear in docker logs, then records the actual tier.
 Usage:
    python3 run_benchmark.py [options]
    python3 run_benchmark.py --tier light|medium|complex
    python3 run_benchmark.py --category <name>
    python3 run_benchmark.py --ids 1,2,3
    python3 run_benchmark.py --list-categories
    python3 run_benchmark.py --no-inference    # skip all LLM inference — routing decisions only, all tiers
 IMPORTANT: Always check GPU is free before running. This script does it automatically.
 Adolf must be running at http://localhost:8000.
 """
 import argparse
 import asyncio
 import json
 import re
 import subprocess
 import sys
 import time
 from pathlib import Path
 import httpx
 ADOLF_URL = "http://localhost:8000"
 OLLAMA_URL = "http://localhost:11436"  # GPU Ollama
 DATASET = Path(__file__).parent / "benchmark.json"
 RESULTS = Path(__file__).parent / "results_latest.json"
 # Max time to wait for each query to fully complete via SSE stream
 QUERY_TIMEOUT = 300  # seconds — generous to handle GPU semaphore waits
 # Memory thresholds
 MIN_FREE_RAM_MB = 1500   # abort if less than this is free
 MIN_FREE_VRAM_MB = 500   # warn if less than this is free on GPU
 # ── Pre-flight checks ──────────────────────────────────────────────────────────
 def check_ram() -> tuple[bool, str]:
    """Check available system RAM. Returns (ok, message)."""
    try:
        with open("/proc/meminfo") as f:
            info = {}
            for line in f:
                parts = line.split()
                if len(parts) >= 2:
                    info[parts[0].rstrip(":")] = int(parts[1])
        free_mb = (info.get("MemAvailable", 0)) // 1024
        total_mb = info.get("MemTotal", 0) // 1024
        msg = f"RAM: {free_mb} MB free / {total_mb} MB total"
        if free_mb < MIN_FREE_RAM_MB:
            return False, f"CRITICAL: {msg} — need at least {MIN_FREE_RAM_MB} MB free"
        return True, msg
    except Exception as e:
        return True, f"RAM check failed (non-fatal): {e}"
 def check_gpu() -> tuple[bool, str]:
    """Check GPU VRAM via Ollama /api/ps. Returns (ok, message)."""
    try:
        r = httpx.get(f"{OLLAMA_URL}/api/ps", timeout=5)
        r.raise_for_status()
        data = r.json()
        models = data.get("models", [])
        if models:
            names = [m.get("name", "?") for m in models]
            sizes_mb = [m.get("size_vram", 0) // (1024 * 1024) for m in models]
            loaded = ", ".join(f"{n} ({s}MB)" for n, s in zip(names, sizes_mb))
            total_vram = sum(sizes_mb)
            if total_vram > 7000:
                return False, f"GPU BUSY: models loaded = {loaded} — total VRAM used {total_vram}MB. Wait for models to unload."
            return True, f"GPU: models loaded = {loaded} (total {total_vram}MB VRAM)"
        return True, "GPU: idle (no models loaded)"
    except httpx.ConnectError:
        return True, "GPU check skipped (Ollama not reachable at localhost:11436)"
    except Exception as e:
        return True, f"GPU check failed (non-fatal): {e}"
 def preflight_checks(skip_gpu_check: bool = False) -> bool:
    """Run all pre-flight checks. Returns True if safe to proceed."""
    print("\n── Pre-flight checks ──────────────────────────────────────────")
    ram_ok, ram_msg = check_ram()
    print(f"  {'✓' if ram_ok else '✗'} {ram_msg}")
    if not ram_ok:
        print("\nABORTING: not enough RAM. Free up memory before running benchmark.")
        return False
    if not skip_gpu_check:
        gpu_ok, gpu_msg = check_gpu()
        print(f"  {'✓' if gpu_ok else '✗'} {gpu_msg}")
        if not gpu_ok:
            print("\nABORTING: GPU is busy. Wait for current inference to finish, then retry.")
            return False
    print("  All checks passed.\n")
    return True
 # ── Log helpers ────────────────────────────────────────────────────────────────
 def get_log_tail(n: int = 50) -> str:
    result = subprocess.run(
        ["docker", "logs", "deepagents", "--tail", str(n)],
        capture_output=True, text=True,
    )
    return result.stdout + result.stderr
 def extract_tier_from_logs(logs_before: str, logs_after: str) -> str | None:
    """Find new tier= lines that appeared after we sent the query."""
    before_lines = set(logs_before.splitlines())
    new_lines = [l for l in logs_after.splitlines() if l not in before_lines]
    for line in new_lines:
        m = re.search(r"tier=(\w+(?:\s*\(no-inference\))?)", line)
        if m:
            tier_raw = m.group(1)
            # Normalise: "complex (no-inference)" → "complex"
            return tier_raw.split()[0]
    return None
 # ── Request helpers ────────────────────────────────────────────────────────────
 async def post_message(
    client: httpx.AsyncClient,
    query_id: int,
    query: str,
    no_inference: bool = False,
 ) -> bool:
    payload = {
        "text": query,
        "session_id": f"benchmark-{query_id}",
        "channel": "cli",
        "user_id": "benchmark",
        "metadata": {"no_inference": no_inference, "benchmark": True},
    }
    try:
        r = await client.post(f"{ADOLF_URL}/message", json=payload, timeout=10)
        r.raise_for_status()
        return True
    except Exception as e:
        print(f" POST_ERROR: {e}", end="")
        return False
 # ── Dataset ────────────────────────────────────────────────────────────────────
 def load_dataset() -> list[dict]:
    with open(DATASET) as f:
        return json.load(f)["queries"]
 def filter_queries(queries, tier, category, ids):
    if tier:
        queries = [q for q in queries if q["tier"] == tier]
    if category:
        queries = [q for q in queries if q["category"] == category]
    if ids:
        queries = [q for q in queries if q["id"] in ids]
    return queries
 # ── Main run ───────────────────────────────────────────────────────────────────
 async def run(queries: list[dict], no_inference: bool = False) -> list[dict]:
    results = []
    async with httpx.AsyncClient() as client:
        try:
            r = await client.get(f"{ADOLF_URL}/health", timeout=5)
            r.raise_for_status()
        except Exception as e:
            print(f"ERROR: Adolf not reachable: {e}", file=sys.stderr)
            sys.exit(1)
        total = len(queries)
        correct = 0
        dry_label = " [NO-INFERENCE: routing only]" if no_inference else ""
        print(f"\nRunning {total} queries{dry_label}\n")
        print(f"{'ID':>3}  {'EXPECTED':8}  {'ACTUAL':8}  {'OK':3}  {'TIME':6}  {'CATEGORY':22}  QUERY")
        print("─" * 110)
        for q in queries:
            qid = q["id"]
            expected = q["tier"]
            category = q["category"]
            query_text = q["query"]
            session_id = f"benchmark-{qid}"
            print(f"{qid:>3}  {expected:8}  ", end="", flush=True)
            logs_before = get_log_tail(300)
            t0 = time.monotonic()
            ok_post = await post_message(client, qid, query_text, no_inference=no_inference)
            if not ok_post:
                print(f"{'?':8}  {'ERR':3}  {'?':6}  {category:22}  {query_text[:40]}")
                results.append({"id": qid, "expected": expected, "actual": None, "ok": False})
                continue
            # Wait for query to complete via SSE stream (handles GPU semaphore waits)
            try:
                async with client.stream(
                    "GET", f"{ADOLF_URL}/stream/{session_id}", timeout=QUERY_TIMEOUT
                ) as sse:
                    async for line in sse.aiter_lines():
                        if "data: [DONE]" in line:
                            break
            except Exception:
                pass  # timeout or connection issue — check logs anyway
            # Now the query is done — check logs for tier
            await asyncio.sleep(0.3)
            logs_after = get_log_tail(300)
            actual = extract_tier_from_logs(logs_before, logs_after)
            elapsed = time.monotonic() - t0
            match = actual == expected or (actual == "fast" and expected == "medium")
            if match:
                correct += 1
            mark = "✓" if match else "✗"
            actual_str = actual or "?"
            print(f"{actual_str:8}  {mark:3}  {elapsed:5.1f}s  {category:22}  {query_text[:40]}")
            results.append({
                "id": qid,
                "expected": expected,
                "actual": actual_str,
                "ok": match,
                "elapsed": round(elapsed, 1),
                "category": category,
                "query": query_text,
                "no_inference": no_inference,
            })
        print("─" * 110)
        accuracy = correct / total * 100 if total else 0
        print(f"\nAccuracy: {correct}/{total} ({accuracy:.0f}%)")
        for tier_name in ["light", "medium", "complex"]:
            tier_qs = [r for r in results if r["expected"] == tier_name]
            if tier_qs:
                tier_ok = sum(1 for r in tier_qs if r["ok"])
                print(f"  {tier_name:8}: {tier_ok}/{len(tier_qs)}")
        wrong = [r for r in results if not r["ok"]]
        if wrong:
            print(f"\nMisclassified ({len(wrong)}):")
            for r in wrong:
                print(f"  id={r['id']:3}  expected={r['expected']:8}  actual={r['actual']:8}  {r['query'][:60]}")
    with open(RESULTS, "w") as f:
        json.dump(results, f, indent=2, ensure_ascii=False)
    print(f"\nResults saved to {RESULTS}")
    return results
 def main():
    parser = argparse.ArgumentParser(
        description="Adolf routing benchmark",
        epilog="IMPORTANT: Always check GPU is free before running. This is done automatically."
    )
    parser.add_argument("--tier", choices=["light", "medium", "complex"])
    parser.add_argument("--category")
    parser.add_argument("--ids", help="Comma-separated IDs")
    parser.add_argument("--list-categories", action="store_true")
    parser.add_argument(
        "--no-inference",
        action="store_true",
        help="Skip LLM inference for all tiers — only routing decisions are tested (no GPU/API cost)",
    )
    parser.add_argument(
        "--skip-gpu-check",
        action="store_true",
        help="Skip GPU availability check (use only if you know GPU is free)",
    )
    args = parser.parse_args()
    queries = load_dataset()
    if args.list_categories:
        cats = sorted(set(q["category"] for q in queries))
        tiers = {t: sum(1 for q in queries if q["tier"] == t) for t in ["light", "medium", "complex"]}
        print(f"Total: {len(queries)} | Tiers: {tiers}")
        print(f"Categories: {cats}")
        return
    # ALWAYS check GPU and RAM before running
    if not preflight_checks(skip_gpu_check=args.no_inference):
        sys.exit(1)
    ids = [int(i) for i in args.ids.split(",")] if args.ids else None
    queries = filter_queries(queries, args.tier, args.category, ids)
    if not queries:
        print("No queries match filters.")
        sys.exit(1)
    asyncio.run(run(queries, no_inference=args.no_inference))
 if __name__ == "__main__":
    main()
--- a/benchmarks/run_routing_benchmark.py
+++ b/benchmarks/run_routing_benchmark.py
@@ -0,0 +1,218 @@
 #!/usr/bin/env python3
 """
 Adolf routing benchmark — tests routing decisions only, no LLM inference.
 Sends each query with no_inference=True, waits for the routing decision to
 appear in docker logs, and records whether the correct tier was selected.
 Usage:
    python3 run_routing_benchmark.py [options]
    python3 run_routing_benchmark.py --tier light|medium|complex
    python3 run_routing_benchmark.py --category <name>
    python3 run_routing_benchmark.py --ids 1,2,3
    python3 run_routing_benchmark.py --list-categories
 No GPU check needed — inference is disabled for all queries.
 Adolf must be running at http://localhost:8000.
 """
 import argparse
 import asyncio
 import json
 import re
 import subprocess
 import sys
 import time
 from pathlib import Path
 import httpx
 ADOLF_URL = "http://localhost:8000"
 DATASET = Path(__file__).parent / "benchmark.json"
 RESULTS = Path(__file__).parent / "routing_results_latest.json"
 QUERY_TIMEOUT = 1  # 1s strict deadline — routing must decide within 1 second
 # ── Log helpers ────────────────────────────────────────────────────────────────
 def get_log_tail(n: int = 50) -> str:
    result = subprocess.run(
        ["docker", "logs", "deepagents", "--tail", str(n)],
        capture_output=True, text=True,
    )
    return result.stdout + result.stderr
 def extract_tier_from_logs(logs_before: str, logs_after: str) -> str | None:
    """Find new tier= lines that appeared after we sent the query."""
    before_lines = set(logs_before.splitlines())
    new_lines = [line for line in logs_after.splitlines() if line not in before_lines]
    for line in new_lines:
        m = re.search(r"tier=(\w+(?:\s*\(no-inference\))?)", line)
        if m:
            tier_raw = m.group(1)
            return tier_raw.split()[0]
    return None
 # ── Request helpers ────────────────────────────────────────────────────────────
 async def post_message(client: httpx.AsyncClient, query_id: int, query: str) -> bool:
    payload = {
        "text": query,
        "session_id": f"routing-bench-{query_id}",
        "channel": "cli",
        "user_id": "benchmark",
        "metadata": {"no_inference": True, "benchmark": True},
    }
    try:
        r = await client.post(f"{ADOLF_URL}/message", json=payload, timeout=10)
        r.raise_for_status()
        return True
    except Exception as e:
        print(f" POST_ERROR: {e}", end="")
        return False
 # ── Dataset ────────────────────────────────────────────────────────────────────
 def load_dataset() -> list[dict]:
    with open(DATASET) as f:
        return json.load(f)["queries"]
 def filter_queries(queries, tier, category, ids):
    if tier:
        queries = [q for q in queries if q["tier"] == tier]
    if category:
        queries = [q for q in queries if q["category"] == category]
    if ids:
        queries = [q for q in queries if q["id"] in ids]
    return queries
 # ── Main run ───────────────────────────────────────────────────────────────────
 async def run(queries: list[dict]) -> list[dict]:
    results = []
    async with httpx.AsyncClient() as client:
        try:
            r = await client.get(f"{ADOLF_URL}/health", timeout=5)
            r.raise_for_status()
        except Exception as e:
            print(f"ERROR: Adolf not reachable: {e}", file=sys.stderr)
            sys.exit(1)
        total = len(queries)
        correct = 0
        print(f"\nRunning {total} queries [NO-INFERENCE: routing only]\n")
        print(f"{'ID':>3}  {'EXPECTED':8}  {'ACTUAL':8}  {'OK':3}  {'TIME':6}  {'CATEGORY':22}  QUERY")
        print("─" * 110)
        for q in queries:
            qid = q["id"]
            expected = q["tier"]
            category = q["category"]
            query_text = q["query"]
            session_id = f"routing-bench-{qid}"
            print(f"{qid:>3}  {expected:8}  ", end="", flush=True)
            logs_before = get_log_tail(300)
            t0 = time.monotonic()
            ok_post = await post_message(client, qid, query_text)
            if not ok_post:
                print(f"{'?':8}  {'ERR':3}  {'?':6}  {category:22}  {query_text[:40]}")
                results.append({"id": qid, "expected": expected, "actual": None, "ok": False})
                continue
            try:
                async with client.stream(
                    "GET", f"{ADOLF_URL}/stream/{session_id}", timeout=QUERY_TIMEOUT
                ) as sse:
                    async for line in sse.aiter_lines():
                        if "data: [DONE]" in line:
                            break
            except Exception:
                pass  # timeout or connection issue — check logs anyway
            logs_after = get_log_tail(300)
            actual = extract_tier_from_logs(logs_before, logs_after)
            if actual is None:
                actual = "timeout"
            elapsed = time.monotonic() - t0
            match = actual == expected or (actual == "fast" and expected == "medium")
            if match:
                correct += 1
            mark = "✓" if match else "✗"
            actual_str = actual
            print(f"{actual_str:8}  {mark:3}  {elapsed:5.1f}s  {category:22}  {query_text[:40]}")
            results.append({
                "id": qid,
                "expected": expected,
                "actual": actual_str,
                "ok": match,
                "elapsed": round(elapsed, 1),
                "category": category,
                "query": query_text,
            })
        print("─" * 110)
        accuracy = correct / total * 100 if total else 0
        print(f"\nAccuracy: {correct}/{total} ({accuracy:.0f}%)")
        for tier_name in ["light", "medium", "complex"]:
            tier_qs = [r for r in results if r["expected"] == tier_name]
            if tier_qs:
                tier_ok = sum(1 for r in tier_qs if r["ok"])
                print(f"  {tier_name:8}: {tier_ok}/{len(tier_qs)}")
        wrong = [r for r in results if not r["ok"]]
        if wrong:
            print(f"\nMisclassified ({len(wrong)}):")
            for r in wrong:
                print(f"  id={r['id']:3}  expected={r['expected']:8}  actual={r['actual']:8}  {r['query'][:60]}")
    with open(RESULTS, "w") as f:
        json.dump(results, f, indent=2, ensure_ascii=False)
    print(f"\nResults saved to {RESULTS}")
    return results
 def main():
    parser = argparse.ArgumentParser(
        description="Adolf routing benchmark — routing decisions only, no LLM inference",
    )
    parser.add_argument("--tier", choices=["light", "medium", "complex"])
    parser.add_argument("--category")
    parser.add_argument("--ids", help="Comma-separated IDs")
    parser.add_argument("--list-categories", action="store_true")
    args = parser.parse_args()
    queries = load_dataset()
    if args.list_categories:
        cats = sorted(set(q["category"] for q in queries))
        tiers = {t: sum(1 for q in queries if q["tier"] == t) for t in ["light", "medium", "complex"]}
        print(f"Total: {len(queries)} | Tiers: {tiers}")
        print(f"Categories: {cats}")
        return
    ids = [int(i) for i in args.ids.split(",")] if args.ids else None
    queries = filter_queries(queries, args.tier, args.category, ids)
    if not queries:
        print("No queries match filters.")
        sys.exit(1)
    asyncio.run(run(queries))
 if __name__ == "__main__":
    main()
--- a/benchmarks/run_voice_benchmark.py
+++ b/benchmarks/run_voice_benchmark.py
@@ -0,0 +1,425 @@
 #!/usr/bin/env python3
 """
 Adolf voice benchmark.
 Pipeline for each query:
  1. Synthesize query text → WAV via Silero TTS (localhost:8881)
  2. Transcribe WAV → text via faster-whisper STT (localhost:8880)
  3. Send transcription to Adolf → check routing tier
  4. Report: WER per query, routing accuracy vs text baseline
 Usage:
    python3 run_voice_benchmark.py [options]
    python3 run_voice_benchmark.py --tier light|medium|complex
    python3 run_voice_benchmark.py --ids 1,2,3
    python3 run_voice_benchmark.py --no-inference  # skip LLM inference — routing only, all tiers
 IMPORTANT: Always check GPU is free before running. Done automatically.
 Services required:
  - Adolf:         http://localhost:8000
  - Silero TTS:    http://localhost:8881  (openai/silero-tts container)
  - faster-whisper: http://localhost:8880  (faster-whisper container)
 """
 import argparse
 import asyncio
 import io
 import json
 import re
 import subprocess
 import sys
 import tempfile
 import time
 import unicodedata
 from pathlib import Path
 import httpx
 ADOLF_URL = "http://localhost:8000"
 OLLAMA_URL = "http://localhost:11436"
 TTS_URL = "http://localhost:8881"       # Silero TTS — OpenAI-compatible /v1/audio/speech
 STT_URL = "http://localhost:8880"       # faster-whisper — OpenAI-compatible /v1/audio/transcriptions
 DATASET = Path(__file__).parent / "benchmark.json"
 RESULTS_DIR = Path(__file__).parent
 TIER_WAIT = 15        # seconds to wait for tier= in docker logs
 MIN_FREE_RAM_MB = 1500
 MIN_FREE_VRAM_MB = 500
 # ── Pre-flight ─────────────────────────────────────────────────────────────────
 def check_ram() -> tuple[bool, str]:
    try:
        with open("/proc/meminfo") as f:
            info = {}
            for line in f:
                parts = line.split()
                if len(parts) >= 2:
                    info[parts[0].rstrip(":")] = int(parts[1])
        free_mb = info.get("MemAvailable", 0) // 1024
        total_mb = info.get("MemTotal", 0) // 1024
        msg = f"RAM: {free_mb} MB free / {total_mb} MB total"
        if free_mb < MIN_FREE_RAM_MB:
            return False, f"CRITICAL: {msg} — need at least {MIN_FREE_RAM_MB} MB free"
        return True, msg
    except Exception as e:
        return True, f"RAM check failed (non-fatal): {e}"
 def check_gpu() -> tuple[bool, str]:
    try:
        r = httpx.get(f"{OLLAMA_URL}/api/ps", timeout=5)
        r.raise_for_status()
        data = r.json()
        models = data.get("models", [])
        if models:
            names = [m.get("name", "?") for m in models]
            sizes_mb = [m.get("size_vram", 0) // (1024 * 1024) for m in models]
            loaded = ", ".join(f"{n} ({s}MB)" for n, s in zip(names, sizes_mb))
            total_vram = sum(sizes_mb)
            if total_vram > 7000:
                return False, f"GPU BUSY: {loaded} — {total_vram}MB VRAM used. Wait for models to unload."
            return True, f"GPU: {loaded} ({total_vram}MB VRAM)"
        return True, "GPU: idle"
    except httpx.ConnectError:
        return True, "GPU check skipped (Ollama not reachable)"
    except Exception as e:
        return True, f"GPU check failed (non-fatal): {e}"
 def check_services() -> tuple[bool, str]:
    """Check TTS and STT are reachable."""
    msgs = []
    ok = True
    for name, url, path in [("TTS", TTS_URL, "/"), ("STT", STT_URL, "/")]:
        try:
            r = httpx.get(url + path, timeout=5)
            msgs.append(f"{name}: reachable (HTTP {r.status_code})")
        except Exception as e:
            msgs.append(f"{name}: NOT REACHABLE — {e}")
            ok = False
    return ok, " | ".join(msgs)
 def preflight_checks(skip_gpu_check: bool = False) -> bool:
    print("\n── Pre-flight checks ──────────────────────────────────────────")
    ram_ok, ram_msg = check_ram()
    print(f"  {'✓' if ram_ok else '✗'} {ram_msg}")
    if not ram_ok:
        print("\nABORTING: not enough RAM.")
        return False
    if not skip_gpu_check:
        gpu_ok, gpu_msg = check_gpu()
        print(f"  {'✓' if gpu_ok else '✗'} {gpu_msg}")
        if not gpu_ok:
            print("\nABORTING: GPU is busy.")
            return False
    svc_ok, svc_msg = check_services()
    print(f"  {'✓' if svc_ok else '✗'} {svc_msg}")
    if not svc_ok:
        print("\nABORTING: required voice services not running.")
        print("Start them with: cd /home/alvis/agap_git/openai && docker compose up -d faster-whisper silero-tts")
        return False
    print("  All checks passed.\n")
    return True
 # ── TTS ────────────────────────────────────────────────────────────────────────
 async def synthesize(client: httpx.AsyncClient, text: str) -> bytes | None:
    """Synthesize text to WAV via Silero TTS (OpenAI-compatible /v1/audio/speech)."""
    try:
        r = await client.post(
            f"{TTS_URL}/v1/audio/speech",
            json={"model": "tts-1", "input": text, "voice": "alloy", "response_format": "wav"},
            timeout=30,
        )
        r.raise_for_status()
        return r.content
    except Exception as e:
        print(f"\n  [TTS error: {e}]", end="")
        return None
 # ── STT ────────────────────────────────────────────────────────────────────────
 async def transcribe(client: httpx.AsyncClient, wav_bytes: bytes) -> str | None:
    """Transcribe WAV to text via faster-whisper (OpenAI-compatible /v1/audio/transcriptions)."""
    try:
        files = {"file": ("audio.wav", wav_bytes, "audio/wav")}
        data = {"model": "whisper-1", "language": "ru", "response_format": "json"}
        r = await client.post(
            f"{STT_URL}/v1/audio/transcriptions",
            files=files,
            data=data,
            timeout=60,
        )
        r.raise_for_status()
        result = r.json()
        return result.get("text", "").strip()
    except Exception as e:
        print(f"\n  [STT error: {e}]", end="")
        return None
 # ── WER ────────────────────────────────────────────────────────────────────────
 def normalize(text: str) -> str:
    """Lowercase, strip punctuation, normalize unicode for WER calculation."""
    text = unicodedata.normalize("NFC", text.lower())
    text = re.sub(r"[^\w\s]", " ", text)
    return re.sub(r"\s+", " ", text).strip()
 def word_error_rate(reference: str, hypothesis: str) -> float:
    """Compute WER between reference and hypothesis."""
    ref = normalize(reference).split()
    hyp = normalize(hypothesis).split()
    if not ref:
        return 0.0 if not hyp else 1.0
    # Dynamic programming edit distance
    d = [[0] * (len(hyp) + 1) for _ in range(len(ref) + 1)]
    for i in range(len(ref) + 1):
        d[i][0] = i
    for j in range(len(hyp) + 1):
        d[0][j] = j
    for i in range(1, len(ref) + 1):
        for j in range(1, len(hyp) + 1):
            if ref[i - 1] == hyp[j - 1]:
                d[i][j] = d[i - 1][j - 1]
            else:
                d[i][j] = 1 + min(d[i - 1][j], d[i][j - 1], d[i - 1][j - 1])
    return d[len(ref)][len(hyp)] / len(ref)
 # ── Adolf interaction ──────────────────────────────────────────────────────────
 def get_log_tail(n: int = 60) -> str:
    result = subprocess.run(
        ["docker", "logs", "deepagents", "--tail", str(n)],
        capture_output=True, text=True,
    )
    return result.stdout + result.stderr
 def extract_tier_from_logs(logs_before: str, logs_after: str) -> str | None:
    before_lines = set(logs_before.splitlines())
    new_lines = [line for line in logs_after.splitlines() if line not in before_lines]
    for line in new_lines:
        m = re.search(r"tier=(\w+(?:\s*\(no-inference\))?)", line)
        if m:
            return m.group(1).split()[0]
    return None
 async def post_to_adolf(
    client: httpx.AsyncClient,
    query_id: int,
    text: str,
    no_inference: bool = False,
 ) -> bool:
    payload = {
        "text": text,
        "session_id": f"voice-bench-{query_id}",
        "channel": "cli",
        "user_id": "benchmark",
        "metadata": {"no_inference": no_inference, "benchmark": True, "voice": True},
    }
    try:
        r = await client.post(f"{ADOLF_URL}/message", json=payload, timeout=10)
        r.raise_for_status()
        return True
    except Exception as e:
        print(f"\n  [Adolf error: {e}]", end="")
        return False
 # ── Dataset ────────────────────────────────────────────────────────────────────
 def load_dataset() -> list[dict]:
    with open(DATASET) as f:
        return json.load(f)["queries"]
 def filter_queries(queries, tier, category, ids):
    if tier:
        queries = [q for q in queries if q["tier"] == tier]
    if category:
        queries = [q for q in queries if q["category"] == category]
    if ids:
        queries = [q for q in queries if q["id"] in ids]
    return queries
 # ── Main run ───────────────────────────────────────────────────────────────────
 async def run(queries: list[dict], no_inference: bool = False, save_audio: bool = False) -> None:
    async with httpx.AsyncClient() as client:
        # Check Adolf
        try:
            r = await client.get(f"{ADOLF_URL}/health", timeout=5)
            r.raise_for_status()
        except Exception as e:
            print(f"ERROR: Adolf not reachable: {e}", file=sys.stderr)
            sys.exit(1)
        total = len(queries)
        results = []
        dry_label = " [NO-INFERENCE: routing only]" if no_inference else ""
        print(f"Voice benchmark: {total} queries{dry_label}\n")
        print(f"{'ID':>3}  {'EXP':8}  {'ACT':8}  {'OK':3}  {'WER':5}  {'TRANSCRIPT'}")
        print("─" * 100)
        total_wer = 0.0
        wer_count = 0
        correct = 0
        for q in queries:
            qid = q["id"]
            expected = q["tier"]
            original = q["query"]
            print(f"{qid:>3}  {expected:8}  ", end="", flush=True)
            # Step 1: TTS
            wav = await synthesize(client, original)
            if wav is None:
                print(f"{'?':8}  {'ERR':3}  {'?':5}  [TTS failed]")
                results.append({"id": qid, "expected": expected, "actual": None, "ok": False, "wer": None, "error": "tts"})
                continue
            if save_audio:
                audio_path = RESULTS_DIR / f"voice_audio" / f"{qid}.wav"
                audio_path.parent.mkdir(exist_ok=True)
                audio_path.write_bytes(wav)
            # Step 2: STT
            transcript = await transcribe(client, wav)
            if transcript is None:
                print(f"{'?':8}  {'ERR':3}  {'?':5}  [STT failed]")
                results.append({"id": qid, "expected": expected, "actual": None, "ok": False, "wer": None, "error": "stt"})
                continue
            # Calculate WER
            wer = word_error_rate(original, transcript)
            total_wer += wer
            wer_count += 1
            # Step 3: Send to Adolf
            logs_before = get_log_tail(60)
            t0 = time.monotonic()
            ok_post = await post_to_adolf(client, qid, transcript, no_inference=no_inference)
            if not ok_post:
                print(f"{'?':8}  {'ERR':3}  {wer:4.2f}  {transcript[:50]}")
                results.append({"id": qid, "expected": expected, "actual": None, "ok": False, "wer": wer, "transcript": transcript})
                continue
            # Step 4: Wait for routing decision
            actual = None
            for _ in range(TIER_WAIT * 2):
                await asyncio.sleep(0.5)
                logs_after = get_log_tail(60)
                actual = extract_tier_from_logs(logs_before, logs_after)
                if actual and actual in ("light", "medium", "complex", "fast"):
                    break
            elapsed = time.monotonic() - t0
            match = actual == expected or (actual == "fast" and expected == "medium")
            if match:
                correct += 1
            mark = "✓" if match else "✗"
            actual_str = actual or "?"
            print(f"{actual_str:8}  {mark:3}  {wer:4.2f}  {transcript[:60]}")
            results.append({
                "id": qid,
                "expected": expected,
                "actual": actual_str,
                "ok": match,
                "wer": round(wer, 3),
                "original": original,
                "transcript": transcript,
                "elapsed": round(elapsed, 1),
                "no_inference": no_inference,
            })
            await asyncio.sleep(0.5)
        print("─" * 100)
        # Summary
        accuracy = correct / total * 100 if total else 0
        avg_wer = total_wer / wer_count * 100 if wer_count else 0
        print(f"\nRouting accuracy: {correct}/{total} ({accuracy:.0f}%)")
        print(f"Average WER:      {avg_wer:.1f}%  (lower is better; 0% = perfect transcription)")
        for tier_name in ["light", "medium", "complex"]:
            tier_qs = [r for r in results if r["expected"] == tier_name]
            if tier_qs:
                tier_ok = sum(1 for r in tier_qs if r["ok"])
                tier_wers = [r["wer"] for r in tier_qs if r.get("wer") is not None]
                avg = sum(tier_wers) / len(tier_wers) * 100 if tier_wers else 0
                print(f"  {tier_name:8}: routing {tier_ok}/{len(tier_qs)}  avg WER {avg:.1f}%")
        wrong = [r for r in results if not r["ok"]]
        if wrong:
            print(f"\nMisclassified after voice ({len(wrong)}):")
            for r in wrong:
                print(f"  id={r['id']:3}  expected={r.get('expected') or '?':8}  actual={r.get('actual') or '?':8}  transcript={r.get('transcript','')[:50]}")
        high_wer = [r for r in results if r.get("wer") and r["wer"] > 0.3]
        if high_wer:
            print(f"\nHigh WER queries (>30%) — transcription quality issues:")
            for r in high_wer:
                print(f"  id={r['id']:3}  WER={r['wer']*100:.0f}%  original: {r.get('original','')[:50]}")
                print(f"          transcript: {r.get('transcript','')[:50]}")
        # Save results
        ts = int(time.time())
        out_path = RESULTS_DIR / f"voice_results_{ts}.json"
        latest_path = RESULTS_DIR / "voice_results_latest.json"
        with open(out_path, "w") as f:
            json.dump({"summary": {"accuracy": accuracy, "avg_wer": avg_wer, "total": total}, "results": results}, f, indent=2, ensure_ascii=False)
        with open(latest_path, "w") as f:
            json.dump({"summary": {"accuracy": accuracy, "avg_wer": avg_wer, "total": total}, "results": results}, f, indent=2, ensure_ascii=False)
        print(f"\nResults saved to {latest_path}")
 def main():
    parser = argparse.ArgumentParser(
        description="Adolf voice benchmark — TTS→STT→routing pipeline",
        epilog="Requires: Silero TTS (port 8881) and faster-whisper (port 8880) running."
    )
    parser.add_argument("--tier", choices=["light", "medium", "complex"])
    parser.add_argument("--category")
    parser.add_argument("--ids", help="Comma-separated IDs")
    parser.add_argument("--no-inference", action="store_true",
                        help="Skip LLM inference for all tiers — routing decisions only (no GPU/API cost)")
    parser.add_argument("--save-audio", action="store_true",
                        help="Save synthesized WAV files to voice_audio/ directory")
    parser.add_argument("--skip-gpu-check", action="store_true")
    args = parser.parse_args()
    if not preflight_checks(skip_gpu_check=args.skip_gpu_check or args.no_inference):
        sys.exit(1)
    queries = load_dataset()
    ids = [int(i) for i in args.ids.split(",")] if args.ids else None
    queries = filter_queries(queries, args.tier, args.category, ids)
    if not queries:
        print("No queries match filters.")
        sys.exit(1)
    asyncio.run(run(queries, no_inference=args.no_inference, save_audio=args.save_audio))
 if __name__ == "__main__":
    main()
--- a/bifrost-config.json
+++ b/bifrost-config.json
@@ -0,0 +1,75 @@
 {
  "auth_config": {
    "is_enabled": true,
    "admin_username": "admin",
    "admin_password": "env.BIFROST_ADMIN_PASSWORD"
  },
  "config_store": {
    "enabled": true,
    "type": "postgres",
    "config": {
      "host": "bifrost-db",
      "port": "5432",
      "user": "bifrost",
      "password": "bifrost",
      "db_name": "bifrost",
      "ssl_mode": "disable"
    }
  },
  "client": {
    "drop_excess_requests": false
  },
  "providers": {
    "ollama": {
      "keys": [
        {
          "name": "ollama-gpu",
          "value": "dummy",
          "models": [
            "qwen2.5:0.5b",
            "qwen2.5:1.5b",
            "qwen3:4b",
            "gemma3:4b",
            "qwen3:8b"
          ],
          "weight": 1.0
        }
      ],
      "network_config": {
        "base_url": "http://host.docker.internal:11436",
        "default_request_timeout_in_seconds": 300,
        "max_retries": 2,
        "retry_backoff_initial_ms": 500,
        "retry_backoff_max_ms": 10000
      }
    },
    "ollama-cpu": {
      "keys": [
        {
          "name": "ollama-cpu-key",
          "value": "dummy",
          "models": [
            "gemma3:1b",
            "qwen2.5:1.5b",
            "qwen2.5:3b"
          ],
          "weight": 1.0
        }
      ],
      "network_config": {
        "base_url": "http://host.docker.internal:11435",
        "default_request_timeout_in_seconds": 120,
        "max_retries": 2,
        "retry_backoff_initial_ms": 500,
        "retry_backoff_max_ms": 10000
      },
      "custom_provider_config": {
        "base_provider_type": "openai",
        "allowed_requests": {
          "chat_completion": true,
          "chat_completion_stream": true
        }
      }
    }
  }
 }
--- a/channels.py
+++ b/channels.py
@@ -49,6 +49,7 @@ async def deliver(session_id: str, channel: str, text: str) -> None:
 # ── built-in channel adapters ─────────────────────────────────────────────────
 GRAMMY_URL = os.getenv("GRAMMY_URL", "http://grammy:3001")
 MATRIX_URL = os.getenv("MATRIX_URL", "http://matrix:3002")
 async def _telegram_send(session_id: str, text: str) -> None:
@@ -64,12 +65,26 @@ async def _telegram_send(session_id: str, text: str) -> None:
            )
 async def _matrix_send(session_id: str, text: str) -> None:
    """Send reply to Matrix via the matrix adapter POST /send endpoint."""
    room_id = session_id.removeprefix("mx-")
    MAX_MTX = 4000
    chunks = [text[i:i + MAX_MTX] for i in range(0, len(text), MAX_MTX)]
    async with httpx.AsyncClient(timeout=15) as client:
        for chunk in chunks:
            await client.post(
                f"{MATRIX_URL}/send",
                json={"room_id": room_id, "text": chunk},
            )
 async def _cli_send(session_id: str, text: str) -> None:
    """CLI replies are handled entirely through the pending_replies queue — no-op here."""
    pass
 def register_defaults() -> None:
-    """Register the built-in Telegram and CLI channel adapters."""
+    """Register the built-in Telegram, Matrix, and CLI channel adapters."""
    register("telegram", _telegram_send)
    register("matrix", _matrix_send)
    register("cli", _cli_send)
--- a/cli.py
+++ b/cli.py
@@ -1,9 +1,9 @@
 #!/usr/bin/env python3
 """
-Adolf CLI — interactive REPL for the multi-channel gateway.
+Adolf CLI — interactive REPL with Rich streaming display.
 Usage:
-    python3 cli.py [--url http://localhost:8000] [--session cli-alvis]
+    python3 cli.py [--url http://deepagents:8000] [--session cli-alvis]
 """
 import argparse
@@ -12,7 +12,13 @@ import os
 import sys
 import urllib.request
-GATEWAY = "http://localhost:8000"
+from rich.console import Console
 from rich.live import Live
 from rich.markdown import Markdown
 from rich.text import Text
 GATEWAY = "http://deepagents:8000"
 console = Console()
 def post_message(gateway: str, text: str, session_id: str) -> None:
@@ -20,7 +26,7 @@ def post_message(gateway: str, text: str, session_id: str) -> None:
        "text": text,
        "session_id": session_id,
        "channel": "cli",
-        "user_id": os.getlogin(),
+        "user_id": os.getenv("USER", "user"),
    }).encode()
    req = urllib.request.Request(
        f"{gateway}/message",
@@ -30,33 +36,49 @@ def post_message(gateway: str, text: str, session_id: str) -> None:
    )
    with urllib.request.urlopen(req, timeout=10) as r:
        if r.status != 202:
-            print(f"[error] gateway returned {r.status}", file=sys.stderr)
+            console.print(f"[red][error] gateway returned {r.status}[/red]")
            sys.exit(1)
-def wait_for_reply(gateway: str, session_id: str, timeout: int = 400) -> str:
+def stream_reply(gateway: str, session_id: str, timeout: int = 400) -> str:
-    """Open SSE stream and return first data event."""
+    """
    Open the /stream/{session_id} SSE endpoint and display tokens live with
    Rich. Returns the full assembled reply text.
    """
    req = urllib.request.Request(
-        f"{gateway}/reply/{session_id}",
+        f"{gateway}/stream/{session_id}",
        headers={"Accept": "text/event-stream"},
    )
    buffer = ""
    with urllib.request.urlopen(req, timeout=timeout + 5) as r:
-        for raw_line in r:
+        with Live(Text(""), console=console, refresh_per_second=20, transient=True) as live:
-            line = raw_line.decode("utf-8").rstrip("\n")
+            for raw_line in r:
-            if line.startswith("data:"):
+                line = raw_line.decode("utf-8").rstrip("\n")
-                return line[5:].strip().replace("\\n", "\n")
+                if not line.startswith("data:"):
-    return ""
+                    continue
                chunk = line[5:].strip()
                if chunk == "[DONE]":
                    break
                chunk = chunk.replace("\\n", "\n")
                buffer += chunk
                live.update(Text(buffer))
    # Render the complete reply as Markdown once streaming is done
    console.print(Markdown(buffer))
    return buffer
 def main():
    parser = argparse.ArgumentParser(description="Adolf CLI")
    parser.add_argument("--url", default=GATEWAY, help="Gateway URL")
-    parser.add_argument("--session", default=f"cli-{os.getlogin()}", help="Session ID")
+    parser.add_argument("--session", default=f"cli-{os.getenv('USER', 'user')}",
                        help="Session ID")
    parser.add_argument("--timeout", type=int, default=400, help="Reply timeout (seconds)")
    args = parser.parse_args()
-    print(f"Adolf CLI  (session={args.session}, gateway={args.url})")
+    console.print(f"[bold]Adolf CLI[/bold]  (session=[cyan]{args.session}[/cyan], "
-    print("Type your message and press Enter. Ctrl+C or Ctrl+D to exit.\n")
+                  f"gateway=[cyan]{args.url}[/cyan])")
    console.print("Type your message and press Enter. Ctrl+C or Ctrl+D to exit.\n")
    try:
        while True:
@@ -68,12 +90,11 @@ def main():
                continue
            post_message(args.url, text, args.session)
-            print("...", end="", flush=True)
+            stream_reply(args.url, args.session, timeout=args.timeout)
-            reply = wait_for_reply(args.url, args.session, timeout=args.timeout)
+            console.print()
            print(f"\r{reply}\n")
    except KeyboardInterrupt:
-        print("\nbye")
+        console.print("\n[dim]bye[/dim]")
 if __name__ == "__main__":
--- a/docker-compose.yml
+++ b/docker-compose.yml
@@ -6,19 +6,29 @@ services:
      - "8000:8000"
    environment:
      - PYTHONUNBUFFERED=1
      # LiteLLM proxy — all LLM inference goes through here
      - LITELLM_URL=http://host.docker.internal:4000/v1
      - LITELLM_API_KEY=sk-fjQC1BxAiGFSMs
      # Direct Ollama GPU URL — used only by VRAMManager for flush/prewarm
      - OLLAMA_BASE_URL=http://host.docker.internal:11436
      - DEEPAGENTS_MODEL=qwen3:4b
-      - DEEPAGENTS_COMPLEX_MODEL=qwen3:8b
+      - DEEPAGENTS_COMPLEX_MODEL=deepseek/deepseek-r1:free
      - DEEPAGENTS_ROUTER_MODEL=qwen2.5:1.5b
      - SEARXNG_URL=http://host.docker.internal:11437
      - GRAMMY_URL=http://grammy:3001
      - MATRIX_URL=http://host.docker.internal:3002
      - CRAWL4AI_URL=http://crawl4ai:11235
      - ROUTECHECK_URL=http://routecheck:8090
      - ROUTECHECK_TOKEN=${ROUTECHECK_TOKEN}
    volumes:
      - ./logs:/app/logs
    extra_hosts:
      - "host.docker.internal:host-gateway"
    depends_on:
      - openmemory
      - grammy
      - crawl4ai
      - routecheck
    restart: unless-stopped
  openmemory:
@@ -27,8 +37,9 @@ services:
    ports:
      - "8765:8765"
    environment:
-      # Extraction LLM (qwen2.5:1.5b) runs on GPU after reply — fast 2-5s extraction
+      # Extraction LLM runs on GPU — qwen2.5:1.5b for speed (~3s)
      - OLLAMA_GPU_URL=http://host.docker.internal:11436
      - OLLAMA_EXTRACTION_MODEL=qwen2.5:1.5b
      # Embedding (nomic-embed-text) runs on CPU — fast enough for search (50-150ms)
      - OLLAMA_CPU_URL=http://host.docker.internal:11435
    extra_hosts:
@@ -45,6 +56,33 @@ services:
      - DEEPAGENTS_URL=http://deepagents:8000
    restart: unless-stopped
  cli:
    build:
      context: .
      dockerfile: Dockerfile.cli
    container_name: cli
    environment:
      - DEEPAGENTS_URL=http://deepagents:8000
    depends_on:
      - deepagents
    stdin_open: true
    tty: true
    profiles:
      - tools
  routecheck:
    build: ./routecheck
    container_name: routecheck
    ports:
      - "8090:8090"
    environment:
      - YANDEX_ROUTING_KEY=${YANDEX_ROUTING_KEY}
      - INTERNAL_TOKEN=${ROUTECHECK_TOKEN}
      - HTTPS_PROXY=http://host.docker.internal:56928
    extra_hosts:
      - "host.docker.internal:host-gateway"
    restart: unless-stopped
  crawl4ai:
    image: unclecode/crawl4ai:latest
    container_name: crawl4ai
--- a/fast_tools.py
+++ b/fast_tools.py
@@ -0,0 +1,188 @@
 """
 Fast Tools — pre-flight tools invoked by a classifier before the main LLM call.
 Each FastTool has:
  - matches(message) → bool   : regex classifier that determines if this tool applies
  - run(message) → str        : async fetch that returns enrichment context
 FastToolRunner holds a list of FastTools. The Router uses any_matches() to force
 the tier to medium before LLM classification. run_agent_task() calls run_matching()
 to build extra context that is injected into the system prompt.
 To add a new fast tool:
  1. Subclass FastTool, implement name/matches/run
  2. Add an instance to the list passed to FastToolRunner in agent.py
 """
 import asyncio
 import re
 from abc import ABC, abstractmethod
 import httpx
 class FastTool(ABC):
    """Base class for all pre-flight fast tools."""
    @property
    @abstractmethod
    def name(self) -> str: ...
    @abstractmethod
    def matches(self, message: str) -> bool: ...
    @abstractmethod
    async def run(self, message: str) -> str: ...
 _WMO_CODES = {
    0: "clear sky", 1: "mainly clear", 2: "partly cloudy", 3: "overcast",
    45: "fog", 48: "icy fog",
    51: "light drizzle", 53: "drizzle", 55: "heavy drizzle",
    61: "light rain", 63: "rain", 65: "heavy rain",
    71: "light snow", 73: "snow", 75: "heavy snow", 77: "snow grains",
    80: "light showers", 81: "showers", 82: "heavy showers",
    85: "snow showers", 86: "heavy snow showers",
    95: "thunderstorm", 96: "thunderstorm with hail", 99: "thunderstorm with heavy hail",
 }
 class WeatherTool(FastTool):
    """
    Fetches current weather for Balashikha, Moscow region directly from open-meteo.com.
    No API key required. Returns a ready-to-deliver reply — no LLM reformatting needed.
    """
    _PATTERN = re.compile(
        r"\b(weather|forecast|temperature|rain(ing)?|snow(ing)?|humidity|wind\s*speed"
        r"|холодно|тепло|погода|прогноз погоды"
        r"|how (hot|cold|warm) is it|what.?s the (weather|temp)|dress for the weather)\b",
        re.IGNORECASE,
    )
    _URL = (
        "https://api.open-meteo.com/v1/forecast"
        "?latitude=55.7963&longitude=37.9382"
        "&current=temperature_2m,apparent_temperature,relative_humidity_2m"
        ",wind_speed_10m,weather_code"
        "&wind_speed_unit=ms"
    )
    @property
    def name(self) -> str:
        return "weather"
    def matches(self, message: str) -> bool:
        return bool(self._PATTERN.search(message))
    async def run(self, message: str) -> str:
        try:
            async with httpx.AsyncClient(timeout=10) as client:
                r = await client.get(self._URL)
                r.raise_for_status()
                c = r.json()["current"]
        except Exception as e:
            return f"[weather error: {e}]"
        temp = c["temperature_2m"]
        feels = c["apparent_temperature"]
        humidity = c["relative_humidity_2m"]
        wind = c["wind_speed_10m"]
        condition = _WMO_CODES.get(c.get("weather_code", 0), "unknown")
        return (
            f"Balashikha: {condition}, {temp:+.0f}°C (feels like {feels:+.0f}°C), "
            f"wind {wind:.1f} m/s, humidity {humidity}%."
        )
 class CommuteTool(FastTool):
    """
    Returns real-time driving time from home (Balashikha) to a destination
    using Yandex traffic data via the local routecheck service.
    Triggered by queries about commute time, arrival, or road traffic.
    The routecheck service handles Yandex API auth and the HTTPS proxy.
    """
    _PATTERN = re.compile(
        r"\b(commute|arrival time|how long.{0,20}(drive|get|travel|reach)"
        r"|сколько.{0,20}(ехать|добираться|минут)"
        r"|пробки|traffic|road.{0,10}now|drive to (work|office|center|москва|moscow)"
        r"|when (will i|do i) (arrive|get there|reach))\b",
        re.IGNORECASE,
    )
    # Home: Balashikha. Default destination: Moscow city center.
    _HOME = "55.7963,37.9382"
    _DEST = "55.7558,37.6173"
    def __init__(self, routecheck_url: str, internal_token: str):
        self._url = routecheck_url.rstrip("/")
        self._token = internal_token
    @property
    def name(self) -> str:
        return "commute"
    def matches(self, message: str) -> bool:
        return bool(self._PATTERN.search(message))
    async def run(self, message: str) -> str:
        if not self._token:
            return "[commute: ROUTECHECK_TOKEN not configured]"
        try:
            async with httpx.AsyncClient(timeout=15) as client:
                r = await client.get(
                    f"{self._url}/api/route",
                    params={"from": self._HOME, "to": self._DEST, "token": self._token},
                )
                r.raise_for_status()
                d = r.json()
        except Exception as e:
            return f"[commute error: {e}]"
        traffic = d["duration_traffic_min"]
        normal  = d["duration_min"]
        dist    = d["distance_km"]
        delay   = traffic - normal
        lines = [
            f"Current drive time from Balashikha to Moscow center:",
            f"  With traffic: {traffic} min",
            f"  Without traffic: {normal} min",
            f"  Distance: {dist} km",
        ]
        if delay > 5:
            lines.append(f"  Traffic delay: +{delay} min")
        return "\n".join(lines)
 class FastToolRunner:
    """
    Classifier + executor for fast tools.
    Used in two places:
      - Router.route(): any_matches() forces medium tier before LLM classification
      - run_agent_task(): run_matching() builds enrichment context in the pre-flight gather
    """
    def __init__(self, tools: list[FastTool]):
        self._tools = tools
    def any_matches(self, message: str) -> bool:
        """True if any fast tool applies to this message."""
        return any(t.matches(message) for t in self._tools)
    def matching_names(self, message: str) -> list[str]:
        """Names of tools that match this message (for logging)."""
        return [t.name for t in self._tools if t.matches(message)]
    async def run_matching(self, message: str) -> str:
        """Run all matching tools concurrently and return combined context."""
        matching = [t for t in self._tools if t.matches(message)]
        if not matching:
            return ""
        results = await asyncio.gather(*[t.run(message) for t in matching])
        parts = [r for r in results if r and not r.startswith("[")]
        return "\n\n".join(parts)
--- a/openmemory/CLAUDE.md
+++ b/openmemory/CLAUDE.md
@@ -0,0 +1,26 @@
 # openmemory
 FastMCP server wrapping mem0 for persistent per-session memory, backed by Qdrant + nomic-embed-text.
 ## Tools exposed (MCP)
 - `add_memory(text, user_id)` — extract facts from a conversation turn and store in Qdrant
 - `search_memory(query, user_id)` — semantic search, returns results with score ≥ 0.5
 - `get_all_memories(user_id)` — dump all stored memories for a session
 These are called directly by `agent.py` (outside the agent loop), never exposed to the LLM as tools.
 ## Two Ollama instances
 - **GPU** (`OLLAMA_GPU_URL`, port 11436) — extraction model (`qwen2.5:1.5b`): pulls facts from conversation text
 - **CPU** (`OLLAMA_CPU_URL`, port 11435) — embedding model (`nomic-embed-text`): 50–150 ms per query
 ## Prompts
 Custom `EXTRACTION_PROMPT` starts with `/no_think` to suppress qwen3 chain-of-thought and get clean JSON output. Custom `UPDATE_MEMORY_PROMPT` handles deduplication — mem0 merges new facts with existing ones rather than creating duplicates.
 ## Notes
 - Qdrant collection is created automatically on first use
 - Memory is keyed by `user_id` which equals `session_id` in `agent.py`
 - Extraction runs after the reply is sent (background task) — GPU contention with medium model is avoided since the semaphore is released before `_store_memory()` is scheduled
--- a/openmemory/server.py
+++ b/openmemory/server.py
@@ -6,6 +6,7 @@ from mem0 import Memory
 # Extraction LLM — GPU Ollama (qwen3:4b, same model as medium agent)
 # Runs after reply when GPU is idle; spin-wait in agent.py prevents contention
 OLLAMA_GPU_URL = os.getenv("OLLAMA_GPU_URL", "http://host.docker.internal:11436")
 EXTRACTION_MODEL = os.getenv("OLLAMA_EXTRACTION_MODEL", "qwen2.5:1.5b")
 # Embedding — CPU Ollama (nomic-embed-text, 137 MB RAM)
 # Used for both search (50-150ms, acceptable) and store-time embedding
@@ -94,7 +95,7 @@ config = {
    "llm": {
        "provider": "ollama",
        "config": {
-            "model": "qwen3:4b",
+            "model": EXTRACTION_MODEL,
            "ollama_base_url": OLLAMA_GPU_URL,
            "temperature": 0.1,  # consistent JSON output
        },
--- a/pytest.ini
+++ b/pytest.ini
@@ -0,0 +1,4 @@
 [pytest]
 testpaths = tests/unit
 pythonpath = .
 asyncio_mode = auto
--- a/routecheck/CLAUDE.md
+++ b/routecheck/CLAUDE.md
@@ -0,0 +1,25 @@
 # routecheck
 FastAPI service providing a Yandex Routing API proxy behind an image captcha.
 ## Purpose
 Yandex Routing API free tier requires a website that uses the API. This service is that website.
 It also exposes an internal endpoint (`/api/route`) used by `CommuteTool` in `fast_tools.py`.
 ## Two access paths
 - **Web UI** (`/`): solve PIL arithmetic captcha → get a token → query any two lat/lon points
 - **Internal API**: `GET /api/route?from=lat,lon&to=lat,lon&token=$ROUTECHECK_TOKEN` — no captcha
 ## Key env vars
 - `YANDEX_ROUTING_KEY` — from developer.tech.yandex.ru, Router API, free tier
 - `INTERNAL_TOKEN` — equals `ROUTECHECK_TOKEN` from root `.env`; shared with deepagents
 - `HTTPS_PROXY` — set to `http://host.docker.internal:56928`; container has no direct external internet
 ## Notes
 - Captchas expire after 5 min, route tokens after 1 hour, both stored in-memory (restart clears them)
 - Yandex API expects `lon,lat` order (not `lat,lon`) — `app.py` swaps automatically
 - Captcha image endpoint: `GET /captcha/image/{id}` — regenerates on each call with random noise
--- a/routecheck/Dockerfile
+++ b/routecheck/Dockerfile
@@ -0,0 +1,6 @@
 FROM python:3.12-slim
 WORKDIR /app
 RUN apt-get update && apt-get install -y --no-install-recommends fonts-dejavu-core && rm -rf /var/lib/apt/lists/*
 RUN pip install --no-cache-dir fastapi uvicorn pillow httpx
 COPY app.py .
 CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8090"]
--- a/routecheck/app.py
+++ b/routecheck/app.py
@@ -0,0 +1,377 @@
 """
 RouteCheck — local routing web service with image captcha.
 Endpoints:
  GET  /                          — web UI
  GET  /captcha/image/{id}        — PNG captcha image
  POST /api/captcha/new           — create captcha, return {id}
  POST /api/captcha/solve         — {id, answer} → {token} or 400
  GET  /api/route                 — ?from=lat,lon&to=lat,lon&token=...
                                    token = solved captcha token OR INTERNAL_TOKEN env var
 """
 import io
 import math
 import os
 import random
 import string
 import time
 import uuid
 from typing import Optional
 import httpx
 from fastapi import FastAPI, HTTPException, Query
 from fastapi.responses import HTMLResponse, JSONResponse, StreamingResponse
 from PIL import Image, ImageDraw, ImageFilter, ImageFont
 from pydantic import BaseModel
 app = FastAPI(title="RouteCheck")
 # ── Config ─────────────────────────────────────────────────────────────────────
 YANDEX_KEY = os.getenv("YANDEX_ROUTING_KEY", "")
 INTERNAL_TOKEN = os.getenv("INTERNAL_TOKEN", "")
 HTTPS_PROXY = os.getenv("HTTPS_PROXY", "")
 CAPTCHA_TTL = 300          # seconds a captcha is valid
 TOKEN_TTL   = 3600         # seconds a solved token is valid
 # ── In-memory captcha store ────────────────────────────────────────────────────
 _captchas: dict[str, dict] = {}   # id → {answer, token, expires}
 _tokens:   dict[str, float] = {}  # token → expires
 def _purge():
    now = time.time()
    for k in list(_captchas.keys()):
        if _captchas[k]["expires"] < now:
            del _captchas[k]
    for k in list(_tokens.keys()):
        if _tokens[k] < now:
            del _tokens[k]
 # ── Captcha image generation ───────────────────────────────────────────────────
 def _rand_color(dark=False):
    if dark:
        return tuple(random.randint(0, 100) for _ in range(3))
    return tuple(random.randint(140, 255) for _ in range(3))
 def _make_captcha_image(text: str) -> bytes:
    W, H = 220, 80
    img = Image.new("RGB", (W, H), color=_rand_color())
    draw = ImageDraw.Draw(img)
    # Background noise: random lines
    for _ in range(8):
        x1, y1 = random.randint(0, W), random.randint(0, H)
        x2, y2 = random.randint(0, W), random.randint(0, H)
        draw.line([(x1, y1), (x2, y2)], fill=_rand_color(dark=True), width=2)
    # Background noise: random dots
    for _ in range(300):
        x, y = random.randint(0, W), random.randint(0, H)
        draw.point((x, y), fill=_rand_color(dark=True))
    # Draw each character with slight random offset and rotation
    try:
        font = ImageFont.truetype("/usr/share/fonts/truetype/dejavu/DejaVuSans-Bold.ttf", 36)
    except Exception:
        font = ImageFont.load_default()
    char_w = W // (len(text) + 2)
    for i, ch in enumerate(text):
        x = char_w + i * char_w + random.randint(-4, 4)
        y = (H - 40) // 2 + random.randint(-6, 6)
        # Draw shadow
        draw.text((x + 2, y + 2), ch, font=font, fill=_rand_color(dark=True))
        draw.text((x, y), ch, font=font, fill=_rand_color(dark=True))
    # Wavy distortion via pixel manipulation
    pixels = img.load()
    for x in range(W):
        shift = int(4 * math.sin(x / 15.0))
        col = [pixels[x, y] for y in range(H)]
        for y in range(H):
            pixels[x, y] = col[(y - shift) % H]
    img = img.filter(ImageFilter.SMOOTH)
    buf = io.BytesIO()
    img.save(buf, format="PNG")
    return buf.getvalue()
 def _generate_problem() -> tuple[str, int]:
    """Return (display_text, answer)."""
    ops = [
        lambda a, b: (f"{a} + {b} = ?", a + b),
        lambda a, b: (f"{a} × {b} = ?", a * b),
        lambda a, b: (f"{max(a,b)} − {min(a,b)} = ?", max(a, b) - min(a, b)),
    ]
    op = random.choice(ops)
    a, b = random.randint(2, 9), random.randint(2, 9)
    text, answer = op(a, b)
    return text, answer
 # ── Routes ─────────────────────────────────────────────────────────────────────
@app.get("/", response_class=HTMLResponse)
 async def index():
    return HTML_PAGE
@app.get("/captcha/image/{captcha_id}")
 async def captcha_image(captcha_id: str):
    _purge()
    entry = _captchas.get(captcha_id)
    if not entry:
        raise HTTPException(404, "Captcha not found or expired")
    png = _make_captcha_image(entry["problem"])
    return StreamingResponse(io.BytesIO(png), media_type="image/png",
                             headers={"Cache-Control": "no-store"})
 class CaptchaNewResponse(BaseModel):
    id: str
@app.post("/api/captcha/new")
 async def captcha_new():
    _purge()
    problem_text, answer = _generate_problem()
    cid = str(uuid.uuid4())
    _captchas[cid] = {
        "problem": problem_text,
        "answer": answer,
        "expires": time.time() + CAPTCHA_TTL,
    }
    return {"id": cid}
 class SolveRequest(BaseModel):
    id: str
    answer: int
@app.post("/api/captcha/solve")
 async def captcha_solve(req: SolveRequest):
    _purge()
    entry = _captchas.get(req.id)
    if not entry:
        raise HTTPException(400, "Captcha expired or not found")
    if entry["answer"] != req.answer:
        raise HTTPException(400, "Wrong answer")
    token = str(uuid.uuid4())
    _tokens[token] = time.time() + TOKEN_TTL
    del _captchas[req.id]
    return {"token": token}
@app.get("/api/route")
 async def route(
    from_coords: str = Query(..., alias="from", description="lat,lon"),
    to_coords:   str = Query(..., alias="to",   description="lat,lon"),
    token: str = Query(...),
 ):
    _purge()
    # Auth: internal service token or valid captcha token
    if token != INTERNAL_TOKEN:
        if token not in _tokens:
            raise HTTPException(401, "Invalid or expired token — solve captcha first")
    if not YANDEX_KEY:
        raise HTTPException(503, "YANDEX_ROUTING_KEY not configured")
    # Parse coords
    try:
        from_lat, from_lon = map(float, from_coords.split(","))
        to_lat,   to_lon   = map(float, to_coords.split(","))
    except ValueError:
        raise HTTPException(400, "coords must be lat,lon")
    # Yandex Routing API expects lon,lat order
    waypoints = f"{from_lon},{from_lat}|{to_lon},{to_lat}"
    transport = httpx.AsyncHTTPTransport(proxy=HTTPS_PROXY) if HTTPS_PROXY else None
    async with httpx.AsyncClient(timeout=15, transport=transport) as client:
        try:
            r = await client.get(
                "https://api.routing.yandex.net/v2/route",
                params={"apikey": YANDEX_KEY, "waypoints": waypoints, "mode": "driving"},
            )
        except Exception as e:
            raise HTTPException(502, f"Yandex API unreachable: {e}")
    if r.status_code != 200:
        raise HTTPException(502, f"Yandex API error {r.status_code}: {r.text[:200]}")
    data = r.json()
    try:
        leg = data["route"]["legs"][0]
        duration_s         = leg["duration"]
        duration_traffic_s = leg.get("duration_in_traffic", duration_s)
        distance_m         = leg["distance"]
    except (KeyError, IndexError) as e:
        raise HTTPException(502, f"Unexpected Yandex response: {e} — {str(data)[:200]}")
    return {
        "duration_min":         round(duration_s / 60),
        "duration_traffic_min": round(duration_traffic_s / 60),
        "distance_km":          round(distance_m / 1000, 1),
    }
 # ── HTML ───────────────────────────────────────────────────────────────────────
 HTML_PAGE = """<!DOCTYPE html>
 <html lang="en">
 <head>
 <meta charset="UTF-8">
 <meta name="viewport" content="width=device-width, initial-scale=1">
 <title>RouteCheck</title>
 <style>
  * { box-sizing: border-box; margin: 0; padding: 0; }
  body { font-family: system-ui, sans-serif; background: #0f172a; color: #e2e8f0; min-height: 100vh;
         display: flex; align-items: center; justify-content: center; }
  .card { background: #1e293b; border-radius: 12px; padding: 2rem; width: 420px;
          box-shadow: 0 20px 60px rgba(0,0,0,.5); }
  h1 { font-size: 1.4rem; font-weight: 700; color: #38bdf8; margin-bottom: .3rem; }
  .sub { color: #94a3b8; font-size: .85rem; margin-bottom: 1.5rem; }
  label { display: block; font-size: .8rem; color: #94a3b8; margin-bottom: .3rem; margin-top: 1rem; }
  input { width: 100%; background: #0f172a; border: 1px solid #334155; border-radius: 6px;
          color: #e2e8f0; padding: .55rem .75rem; font-size: .95rem; outline: none; }
  input:focus { border-color: #38bdf8; }
  button { width: 100%; margin-top: 1.2rem; padding: .7rem; background: #0ea5e9;
           border: none; border-radius: 6px; color: #fff; font-size: 1rem;
           font-weight: 600; cursor: pointer; transition: background .2s; }
  button:hover { background: #0284c7; }
  button:disabled { background: #334155; cursor: default; }
  .captcha-row { display: flex; gap: .75rem; align-items: center; margin-top: 1rem; }
  .captcha-row img { border-radius: 6px; border: 1px solid #334155; cursor: pointer; }
  .captcha-row input { flex: 1; }
  .result { margin-top: 1.2rem; background: #0f172a; border-radius: 8px; padding: 1rem;
            border-left: 3px solid #38bdf8; display: none; }
  .result .big { font-size: 1.6rem; font-weight: 700; color: #38bdf8; }
  .result .label { font-size: .8rem; color: #64748b; margin-top: .2rem; }
  .result .row { display: flex; gap: 1.5rem; margin-top: .8rem; }
  .result .metric { flex: 1; }
  .result .metric .val { font-size: 1.1rem; font-weight: 600; }
  .error { color: #f87171; margin-top: .8rem; font-size: .85rem; display: none; }
  .step { display: none; }
  .step.active { display: block; }
  a.refresh { font-size: .75rem; color: #38bdf8; text-decoration: none; display: block;
              margin-top: .4rem; }
  a.refresh:hover { text-decoration: underline; }
 </style>
 </head>
 <body>
 <div class="card">
  <h1>RouteCheck</h1>
  <p class="sub">Real-time driving time with Yandex traffic data</p>
  <!-- Step 1: captcha -->
  <div class="step active" id="step-captcha">
    <label>Prove you are human</label>
    <div class="captcha-row">
      <img id="captcha-img" src="" alt="captcha" width="160" height="60"
           title="Click to refresh" onclick="loadCaptcha()">
      <input id="captcha-ans" type="number" placeholder="Answer" min="0" max="999">
    </div>
    <a class="refresh" href="#" onclick="loadCaptcha();return false;">↻ New challenge</a>
    <div class="error" id="captcha-err">Wrong answer, try again.</div>
    <button id="captcha-btn" onclick="solveCaptcha()">Verify →</button>
  </div>
  <!-- Step 2: route query -->
  <div class="step" id="step-route">
    <label>From (lat, lon)</label>
    <input id="from" type="text" placeholder="55.7963, 37.9382" value="55.7963, 37.9382">
    <label>To (lat, lon)</label>
    <input id="to" type="text" placeholder="55.7558, 37.6173" value="55.7558, 37.6173">
    <button id="route-btn" onclick="queryRoute()">Get travel time</button>
    <div class="error" id="route-err"></div>
    <div class="result" id="result">
      <div class="big" id="res-traffic"></div>
      <div class="label">with current traffic</div>
      <div class="row">
        <div class="metric"><div class="val" id="res-normal"></div>
          <div class="label">without traffic</div></div>
        <div class="metric"><div class="val" id="res-dist"></div>
          <div class="label">distance</div></div>
      </div>
    </div>
  </div>
 </div>
 <script>
 let captchaId = null;
 let routeToken = null;
 async function loadCaptcha() {
  const r = await fetch('/api/captcha/new', {method: 'POST'});
  const d = await r.json();
  captchaId = d.id;
  document.getElementById('captcha-img').src = '/captcha/image/' + captchaId + '?t=' + Date.now();
  document.getElementById('captcha-ans').value = '';
  document.getElementById('captcha-err').style.display = 'none';
 }
 async function solveCaptcha() {
  const ans = parseInt(document.getElementById('captcha-ans').value);
  if (isNaN(ans)) return;
  const btn = document.getElementById('captcha-btn');
  btn.disabled = true;
  const r = await fetch('/api/captcha/solve', {
    method: 'POST',
    headers: {'Content-Type': 'application/json'},
    body: JSON.stringify({id: captchaId, answer: ans})
  });
  if (r.ok) {
    const d = await r.json();
    routeToken = d.token;
    document.getElementById('step-captcha').classList.remove('active');
    document.getElementById('step-route').classList.add('active');
  } else {
    document.getElementById('captcha-err').style.display = 'block';
    loadCaptcha();
  }
  btn.disabled = false;
 }
 async function queryRoute() {
  const from = document.getElementById('from').value.trim();
  const to   = document.getElementById('to').value.trim();
  const btn  = document.getElementById('route-btn');
  const err  = document.getElementById('route-err');
  err.style.display = 'none';
  document.getElementById('result').style.display = 'none';
  btn.disabled = true;
  btn.textContent = 'Fetching…';
  const r = await fetch(`/api/route?from=${encodeURIComponent(from)}&to=${encodeURIComponent(to)}&token=${routeToken}`);
  btn.disabled = false;
  btn.textContent = 'Get travel time';
  if (!r.ok) {
    const d = await r.json();
    err.textContent = d.detail || 'Error';
    err.style.display = 'block';
    return;
  }
  const d = await r.json();
  document.getElementById('res-traffic').textContent = d.duration_traffic_min + ' min';
  document.getElementById('res-normal').textContent  = d.duration_min + ' min';
  document.getElementById('res-dist').textContent    = d.distance_km + ' km';
  document.getElementById('result').style.display = 'block';
 }
 loadCaptcha();
 document.getElementById('captcha-ans').addEventListener('keydown', e => {
  if (e.key === 'Enter') solveCaptcha();
 });
 </script>
 </body>
 </html>
 """
--- a/router.py
+++ b/router.py
@@ -1,10 +1,38 @@
 import asyncio
 import re
 import math
 from typing import Optional
 from openai import AsyncOpenAI
 from langchain_core.messages import SystemMessage, HumanMessage
 from fast_tools import FastToolRunner
-# ── Regex pre-classifier ──────────────────────────────────────────────────────
+# ── Regex pre-classifiers ─────────────────────────────────────────────────────
-# Catches obvious light-tier patterns before calling the LLM.
+
-# Keyed by regex → compiled pattern.
+# Complex: keyword triggers that reliably signal deep multi-source research
 _COMPLEX_PATTERNS = re.compile(
    r"(?:^|\s)("
    r"research|investigate|deep.dive|think carefully"
    r"|write a (?:detailed|comprehensive|full|thorough|complete)"
    r"|compare all|find and (?:compare|summarize|analyze)"
    r"|in[- ]depth analysis|comprehensive guide"
    r"|detailed (?:report|analysis|comparison|breakdown|overview)"
    r"|everything about|all (?:major|available|self-hosted|open.source)"
    r"|pros and cons|with (?:sources|citations|references)"
    # Russian complex research keywords (no trailing \b — stems like подробн match подробное/подробный)
    r"|исследуй|изучи все|сравни все|найди и сравни|найди и опиши"
    r"|напиши подробн|напиши детальн|напиши полн"
    r"|подробный отчет|детальн\w+ (?:анализ|сравнение|отчет)"
    r"|подробное (?:руководство|сравнение)|полное руководство"
    r"|все варианты|все способы|все доступные|все самохостируемые|все платформы"
    r"|лучшие практики|все инструменты|все решения|все протоколы"
    r"|найди детальн|найди и кратко опиши"
    r"|изучи свежие|изучи лучши|изучи все"
    r"|сравни все\b"
    r")",
    re.IGNORECASE,
 )
 # Light: trivial queries that need no tools or memory
 _LIGHT_PATTERNS = re.compile(
    r"^("
    # Greetings / farewells
@@ -14,35 +42,316 @@ _LIGHT_PATTERNS = re.compile(
    r"|thanks?|thank you|thx|ty|ok|okay|k|cool|great|awesome|perfect|sounds good|got it|nice|sure"
    r"|how are you|how are you\?|how are you doing(\s+today)?[?!.]*"
    r"|what.?s up"
-    # Calendar facts: "what day comes after X?" / "what comes after X?"
+    # Calendar facts
    r"|what\s+day\s+(comes\s+after|follows|is\s+after)\s+\w+[?!.]*"
    r"|what\s+comes\s+after\s+\w+[?!.]*"
-    # Acronym expansions: "what does X stand for?"
+    # Acronym expansions
    r"|what\s+does\s+\w+\s+stand\s+for[?!.]*"
    # Russian greetings / farewells / acknowledgements
    r"|привет|пока|спасибо|здравствуй|здравствуйте|добрый день|добрый вечер|доброе утро"
    r"|окей|хорошо|отлично|понятно|ок|ладно|договорились|спс|благодарю"
    r"|пожалуйста|не за что|всё понятно|ясно"
    r"|как дела|как ты|как жизнь|всё хорошо|всё ок"
    # Assistant control words / confirmations
    r"|да|нет|стоп|отмена|отменить|подожди|повтори|повторить|не нужно|не надо"
    r"|слышишь\s+меня|ты\s+тут|отлично[,!]?\s+спасибо"
    r"|yes|no|stop|cancel|wait|repeat"
    # Russian tech definitions — static knowledge (no tools needed)
    r"|что\s+такое\s+\S+"
    r"|что\s+означает\s+\S+"
    r"|сколько\s+(?:бит|байт|байтов|мегабайт|мегабайтов|гигабайт|гигабайтов)(?:\s+\w+)*"
    # Compound Russian greetings
    r"|привет[,!]?\s+как\s+дела"
    r"|добрый\s+(?:день|вечер|утро)[,!]?\s+как\s+дела"
    r")[\s!.?]*$",
    re.IGNORECASE,
 )
-# ── LLM classification prompt ─────────────────────────────────────────────────
+# ── Semantic router utterances ────────────────────────────────────────────────
-CLASSIFY_PROMPT = """Classify the message. Output ONLY one word: light, medium, or complex.
+# These are embedded at startup. New messages are classified by cosine
 # similarity — whichever tier's centroid is closest wins.
 _LIGHT_UTTERANCES = [
    # General facts (English)
    "what is 2+2",
    "what is the capital of France",
    "name the three primary colors",
    "tell me a short joke",
    "is the sky blue",
    "is water wet",
    "how many days in a week",
    "what is the speed of light",
    "what is the boiling point of water",
    "spell the word beautiful",
    "what color is the ocean",
    "how many inches in a foot",
    "who wrote hamlet",
    "what is pi",
    "what year did world war two end",
    "what is the largest planet",
    "how many continents are there",
    "what does DNA stand for",
    "what language do they speak in Brazil",
    "what is the square root of 144",
    # Tech definitions — static knowledge (English)
    "what is Docker",
    "what is a VPN",
    "what is SSH",
    "what is a reverse proxy",
    "what is an API",
    "what is a firewall",
    "what is a container",
    "what is DNS",
    "what is HTTPS",
    "what is a load balancer",
    "what is Kubernetes",
    "what is Git",
    "what is a network port",
    "what is an IP address",
    "what is a subnet mask",
    "what is the OSI model",
    "how many bits in a byte",
    "how many bytes in a gigabyte",
    "what is TCP",
    "what is a REST API",
    # Russian — static facts and definitions
    "что такое IP-адрес",
    "что такое VPN",
    "что такое Docker",
    "что такое DNS",
    "что такое SSH",
    "что означает API",
    "сколько байт в гигабайте",
    "сколько бит в байте",
    "что такое Zigbee",
    "что такое Z-Wave",
    "что такое брандмауэр",
    "что такое виртуальная машина",
    "что такое обратный прокси",
    "привет",
    "пока",
    "спасибо",
    "как дела",
    "что такое Matter протокол",
    "сколько планет в солнечной системе",
    "чему равно число Пи",
    # Russian — more static definitions
    "что такое TCP/IP",
    "что такое подсеть",
    "скорость света",
    "сколько дней в году",
    "что такое Kubernetes",
    "что такое Git",
    "что такое REST API",
    "что такое TCP",
    "что такое UDP",
    "что такое VLAN",
    "сколько мегабайт в гигабайте",
    "что такое процессор",
    "что такое оперативная память",
    "что такое виртуализация",
    "что такое Linux",
    "что такое умный дом",
    "что такое Home Assistant",
    "что такое Matter",
 ]
-LIGHT = answerable from general knowledge, no internet needed:
+_MEDIUM_UTTERANCES = [
-  what is 2+2 / what is the capital of France / name the three primary colors
+    # English — current data, memory, actions
-  tell me a short joke / is the sky blue / is water wet
+    "what is the weather today",
    "what is the bitcoin price right now",
    "what are the latest news",
    "what did we talk about last time",
    "what is my name",
    "where do I live",
    "what do you know about me",
    "what did I tell you before",
    "what is the current temperature outside",
    "remind me what I said about my project",
    "search for the latest iPhone release",
    "find me a restaurant nearby",
    "turn on the lights in the living room",
    "turn off all lights",
    "set temperature to 22 degrees",
    "what is the current traffic to Moscow",
    "check if anyone is home",
    "what devices are currently on",
    "look up my public IP address",
    "show me recent news about Proxmox",
    # Russian — weather and commute
    "какая сегодня погода в Балашихе",
    "пойдет ли сегодня дождь",
    "какая температура на улице сейчас",
    "погода на завтра",
    "будет ли снег сегодня",
    "сколько ехать до Москвы сейчас",
    "какие пробки на дороге до Москвы",
    "время в пути на работу",
    "есть ли пробки сейчас",
    "стоит ли брать зонтик",
    # Russian — smart home control
    "включи свет в гостиной",
    "выключи свет на кухне",
    "какая температура дома",
    "установи температуру 22 градуса",
    "выключи все лампочки",
    "какие устройства сейчас включены",
    "включи ночной режим",
    "открой шторы в гостиной",
    "включи свет в спальне на 50 процентов",
    "выключи свет во всём доме",
    "включи вентилятор в детской",
    "закрыты ли все окна",
    "выключи телевизор",
    "какое потребление электричества сегодня",
    "включи кофемашину",
    "сколько у нас датчиков движения",
    "состояние всех дверных замков",
    "есть ли кто-нибудь дома",
    "установи будильник на 7 утра",
    # Russian — personal memory
    "как меня зовут",
    "где я живу",
    "что мы обсуждали в прошлый раз",
    "что ты знаешь о моем домашнем сервере",
    "напомни, какие сервисы я запускаю",
    "что я просил тебя запомнить",
    "что я говорил о своей сети",
    # Russian — current info lookups requiring network/tools
    "какой сейчас курс биткоина",
    "курс доллара к рублю сейчас",
    "какая последняя версия Docker",
    "как перезапустить Docker контейнер",
    "как посмотреть логи Docker контейнера",
    "какие новые функции в Home Assistant 2024",
    "есть ли проблемы у Cloudflare сегодня",
    "какие новые Zigbee устройства вышли в 2024 году",
    "найди хороший опенсорс менеджер фотографий",
    "последние новости Proxmox",
    "напиши bash команду для поиска больших файлов",
    "как вывести список всех запущенных контейнеров",
    "как проверить использование диска в Linux",
 ]
-MEDIUM = requires web search or the user's stored memories:
+_COMPLEX_UTTERANCES = [
-  current weather / today's news / Bitcoin price / what did we talk about
+    # English
-  what is my name / where do I live / what is my job / do I have any pets
+    "research everything about Elon Musk's recent projects and investments",
-  what do you know about me / what are my preferences / what did I tell you
+    "write a detailed report on climate change solutions with sources",
    "investigate the history and current state of quantum computing",
    "find and summarize the latest academic papers on transformer architectures",
    "analyze in depth the pros and cons of nuclear energy with citations",
    "research the background and controversies around this person",
    "compare all major cloud providers with detailed pricing and features",
    "write a comprehensive biography of this historical figure",
    "investigate what caused the 2008 financial crisis with multiple sources",
    "research the best programming languages in 2024 with detailed comparison",
    "find everything published about this medical condition and treatments",
    "do a deep dive into the latest developments in artificial general intelligence",
    "research and compare all options for starting a business in Europe",
    "investigate recent news and controversies around this company",
    "write a thorough analysis of geopolitical tensions in the Middle East",
    "find detailed information on the side effects and studies for this medication",
    "research the top 10 JavaScript frameworks with benchmarks and community data",
    "investigate who is funding AI research and what their goals are",
    "write a detailed market analysis for the electric vehicle industry",
    "research everything you can find about this startup or technology",
    # Russian — deep research
    "исследуй и сравни все варианты умного домашнего освещения",
    "напиши подробный отчет о протоколах умного дома",
    "изучи все самохостируемые медиасерверы и сравни их",
    "исследуй лучшие практики безопасности домашнего сервера",
    "сравни все системы резервного копирования для Linux",
    "напиши детальное сравнение WireGuard и OpenVPN",
    "исследуй все варианты голосового управления на русском языке",
    "изучи все опенсорс альтернативы Google сервисам",
    "напиши подробный анализ локальных языковых моделей",
    "исследуй лучшие инструменты мониторинга для домашнего сервера",
    # Russian — more deep research queries matching benchmark
    "исследуй и сравни Proxmox, Unraid и TrueNAS для домашней лаборатории",
    "напиши подробное руководство по безопасности домашнего сервера",
    "исследуй все доступные дашборды для самохостинга и сравни их",
    "найди детальные бенчмарки ARM одноплатных компьютеров для домашней лаборатории",
    "исследуй лучший стек мониторинга для самохостинга в 2024 году",
    "исследуй и сравни WireGuard, OpenVPN и Tailscale для домашней сети",
    "исследуй лучшие практики сегментации домашней сети с VLAN",
    "изучи все самохостируемые DNS решения и их возможности",
    "исследуй и сравни все платформы умного дома: Home Assistant и другие",
    "изучи лучшие Zigbee координаторы и их совместимость с Home Assistant",
    "напиши детальный отчет о поддержке протокола Matter и совместимости устройств",
    "исследуй все способы интеграции умных ламп с Home Assistant",
    "найди и сравни все варианты датчиков движения для умного дома",
    "исследуй и сравни все самохостируемые решения для хранения фотографий",
    "изучи лучшие самохостируемые медиасерверы: Jellyfin, Plex и Emby",
    "исследуй последние достижения в локальном LLM инференсе и обзор моделей",
    "изучи лучшие опенсорс альтернативы Google сервисов для приватности",
    "найди и кратко опиши все крупные самохостируемые менеджеры паролей",
    "напиши детальный анализ текущего состояния AI ассистентов для самохостинга",
    "исследуй и сравни все инструменты оркестрации контейнеров для домашней лаборатории",
    "изучи лучшие подходы к автоматическому резервному копированию в Linux",
    "исследуй и сравни все самохостируемые инструменты личных финансов",
    "изучи свежие CVE и уязвимости в популярном самохостируемом ПО",
    "напиши подробное руководство по настройке автоматизаций в Home Assistant",
    "исследуй все варианты голосового управления умным домом на русском языке",
    "сравни все системы резервного копирования для Linux: Restic, BorgBackup и другие",
    "исследуй лучшие самохостируемые системы мониторинга сети: Zabbix, Grafana",
    "изучи все варианты локального запуска языковых моделей на видеокарте",
    "напиши подробный отчет о технологиях синтеза речи с открытым исходным кодом",
    "исследуй все способы интеграции умных розеток с мониторингом потребления",
    "напиши полное руководство по настройке обратного прокси Caddy",
    "исследуй лучшие практики написания Docker Compose файлов для продакшена",
    "сравни все самохостируемые облачные хранилища: Nextcloud, Seafile и другие",
    "изучи все доступные локальные ассистенты с голосовым управлением",
    "исследуй все самохостируемые решения для блокировки рекламы: Pi-hole, AdGuard",
    "напиши детальное сравнение систем управления конфигурацией: Ansible, Puppet",
    "исследуй все протоколы умного дома и их плюсы и минусы: Zigbee, Z-Wave, Matter",
    "найди и сравни все фреймворки для создания локальных AI ассистентов",
    "исследуй лучшие решения для автоматического управления медиатекой",
    "изучи все варианты самохостируемых систем учёта расходов с возможностью импорта",
    "напиши сравнение всех вариантов самохостинга для хранения и синхронизации файлов",
    "исследуй все открытые протоколы для умного дома и их экосистемы",
    "изучи лучшие инструменты для автоматизации домашней инфраструктуры",
 ]
-COMPLEX = /think prefix only:
+# Medium: queries that require tools, actions, or real-time data (not static knowledge)
-  /think compare frameworks / /think plan a trip
+_MEDIUM_PATTERNS = re.compile(
-
+    r"(?:"
-Message: {message}
+    # Russian smart home commands — always need HA integration
-Output (one word only — light, medium, or complex):"""
+    r"(?:включи|выключи|открой|закрой|установи|поставь|убавь|прибавь|переключи)\s"
    r"|(?:какая|какой|какое|каково)\s+(?:температура|влажность|потребление|состояние|статус)\s"
    r"|(?:сколько|есть ли)\s.*(?:датчик|устройств|замк)"
    # Russian memory queries
    r"|как меня зовут|где я живу|что мы обсуждали|что я говорил|что я просил"
    r"|напомни\b|что ты знаешь обо мне"
    # Russian current info
    r"|курс (?:доллара|биткоина|евро|рубл)"
    r"|(?:последние |свежие )?новости\b"
    r"|(?:погода|температура)\s+(?:на завтра|на неделю)"
    # Smart home commands that don't use verb-first pattern
    r"|(?:свет|лампочк|освещени)\w*\s+(?:включ|выключ|убавь|прибавь)"
    r"|(?:дома|в доме|по всему дому)\s+(?:свет|лампочк)"
    r"|(?:режим|сцена)\s+(?:ночной|утренний|вечерний|кинотеатр)"
    r")",
    re.IGNORECASE,
 )
 LIGHT_REPLY_PROMPT = """You are a helpful Telegram assistant. Answer briefly and naturally (1-3 sentences). Be friendly."""
 _EMBED_MODEL = "ollama/nomic-embed-text"
 def _cosine(a: list[float], b: list[float]) -> float:
    dot = sum(x * y for x, y in zip(a, b))
    norm_a = math.sqrt(sum(x * x for x in a))
    norm_b = math.sqrt(sum(x * x for x in b))
    if norm_a == 0 or norm_b == 0:
        return 0.0
    return dot / (norm_a * norm_b)
 def _centroid(embeddings: list[list[float]]) -> list[float]:
    n = len(embeddings)
    dim = len(embeddings[0])
    return [sum(embeddings[i][d] for i in range(n)) / n for d in range(dim)]
 def _format_history(history: list[dict]) -> str:
    if not history:
@@ -55,64 +364,97 @@ def _format_history(history: list[dict]) -> str:
    return "\n".join(lines)
 def _parse_tier(text: str) -> str:
    """Extract tier from raw model output. Default to medium."""
    t = text.strip().lower()
    snippet = t[:60]
    if "complex" in snippet:
        return "complex"
    if "medium" in snippet:
        return "medium"
    if "light" in snippet:
        return "light"
    # Model invented a descriptive category (e.g. "simplefact", "trivial", "basic") →
    # treat as light since it recognised the question doesn't need tools
    if any(w in snippet for w in ("simple", "fact", "trivial", "basic", "easy", "general")):
        return "light"
    return "medium"  # safe default
 class Router:
-    def __init__(self, model):
+    def __init__(
-        self.model = model
+        self,
        model,
        embedder: AsyncOpenAI,
        fast_tool_runner: FastToolRunner | None = None,
    ):
        self.model = model  # qwen2.5:1.5b — used only for generating light replies
        self._embedder = embedder
        self._fast_tool_runner = fast_tool_runner
        self._light_centroid: list[float] | None = None
        self._medium_centroid: list[float] | None = None
        self._complex_centroid: list[float] | None = None
    async def initialize(self) -> None:
        """Pre-compute utterance embeddings. Call once at startup. Retries until LiteLLM is ready."""
        print("[router] embedding utterances for semantic classifier...", flush=True)
        texts = _LIGHT_UTTERANCES + _MEDIUM_UTTERANCES + _COMPLEX_UTTERANCES
        for attempt in range(10):
            try:
                resp = await self._embedder.embeddings.create(model=_EMBED_MODEL, input=texts)
                embeddings = [item.embedding for item in resp.data]
                n_light = len(_LIGHT_UTTERANCES)
                n_medium = len(_MEDIUM_UTTERANCES)
                self._light_centroid = _centroid(embeddings[:n_light])
                self._medium_centroid = _centroid(embeddings[n_light:n_light + n_medium])
                self._complex_centroid = _centroid(embeddings[n_light + n_medium:])
                print("[router] semantic classifier ready (3-tier)", flush=True)
                return
            except Exception as e:
                print(f"[router] embedding attempt {attempt+1}/10 failed: {e}", flush=True)
                await asyncio.sleep(3)
        print("[router] WARNING: could not initialize semantic classifier — will default to medium", flush=True)
    async def _classify_by_embedding(self, message: str) -> str:
        """Embed message and return 'light', 'medium', or 'complex' based on centroid similarity."""
        if self._light_centroid is None or self._medium_centroid is None or self._complex_centroid is None:
            return "medium"
        try:
            resp = await self._embedder.embeddings.create(model=_EMBED_MODEL, input=[message])
            emb = resp.data[0].embedding
            score_light = _cosine(emb, self._light_centroid)
            score_medium = _cosine(emb, self._medium_centroid)
            score_complex = _cosine(emb, self._complex_centroid)
            tier = max(
                [("light", score_light), ("medium", score_medium), ("complex", score_complex)],
                key=lambda x: x[1],
            )[0]
            print(
                f"[router] semantic: light={score_light:.3f} medium={score_medium:.3f} "
                f"complex={score_complex:.3f} → {tier}",
                flush=True,
            )
            return tier
        except Exception as e:
            print(f"[router] embedding classify error, defaulting to medium: {e}", flush=True)
            return "medium"
    async def route(
        self,
        message: str,
        history: list[dict],
-        force_complex: bool = False,
+        no_inference: bool = False,
    ) -> tuple[str, Optional[str]]:
        """
        Returns (tier, reply_or_None).
-        For light tier: also generates the reply with a second call.
+        For light tier: also generates the reply inline (unless no_inference=True).
        For medium/complex: reply is None.
        """
-        if force_complex:
+        if self._fast_tool_runner and self._fast_tool_runner.any_matches(message.strip()):
-            return "complex", None
+            names = self._fast_tool_runner.matching_names(message.strip())
-
+            print(f"[router] fast_tool_match={names} → medium", flush=True)
        # Step 0: regex pre-classification for obvious light patterns
        if _LIGHT_PATTERNS.match(message.strip()):
            print(f"[router] regex→light", flush=True)
            return await self._generate_light_reply(message, history)
        # Step 1: LLM classification with raw text output
        try:
            classify_response = await self.model.ainvoke([
                HumanMessage(content=CLASSIFY_PROMPT.format(message=message)),
            ])
            raw = classify_response.content or ""
            raw = re.sub(r"<think>.*?</think>", "", raw, flags=re.DOTALL).strip()
            tier = _parse_tier(raw)
            if tier == "complex" and not message.startswith("/think"):
                tier = "medium"
            print(f"[router] raw={raw[:30]!r} → tier={tier}", flush=True)
        except Exception as e:
            print(f"[router] classify error, defaulting to medium: {e}", flush=True)
            return "medium", None
-        if tier != "light":
+        if _LIGHT_PATTERNS.match(message.strip()):
            print("[router] regex→light", flush=True)
            if no_inference:
                return "light", None
            return await self._generate_light_reply(message, history)
        if _COMPLEX_PATTERNS.search(message.strip()):
            print("[router] regex→complex", flush=True)
            return "complex", None
        if _MEDIUM_PATTERNS.search(message.strip()):
            print("[router] regex→medium", flush=True)
            return "medium", None
        tier = await self._classify_by_embedding(message)
        if tier != "light" or no_inference:
            return tier, None
        return await self._generate_light_reply(message, history)
@@ -120,7 +462,7 @@ class Router:
    async def _generate_light_reply(
        self, message: str, history: list[dict]
    ) -> tuple[str, Optional[str]]:
-        """Generate a short reply using the router model for light-tier messages."""
+        """Generate a short reply using qwen2.5:1.5b for light-tier messages."""
        history_text = _format_history(history)
        context = f"\nConversation history:\n{history_text}" if history else ""
        try:
--- a/test_pipeline.py
+++ b/test_pipeline.py
--- a/tests/init.py
+++ b/tests/init.py
--- a/tests/integration/init.py
+++ b/tests/integration/init.py
--- a/tests/integration/common.py
+++ b/tests/integration/common.py
@@ -0,0 +1,259 @@
 """
 Shared config, helpers, and utilities for Adolf integration tests.
 """
 import http.client
 import json
 import re
 import subprocess
 import time
 import urllib.request
 # ── config ────────────────────────────────────────────────────────────────────
 DEEPAGENTS   = "http://localhost:8000"
 LITELLM      = "http://localhost:4000"
 OPENMEMORY   = "http://localhost:8765"
 GRAMMY_HOST  = "localhost"
 GRAMMY_PORT  = 3001
 OLLAMA_GPU   = "http://localhost:11436"
 OLLAMA_CPU   = "http://localhost:11435"
 QDRANT       = "http://localhost:6333"
 SEARXNG      = "http://localhost:11437"
 COMPOSE_FILE = "/home/alvis/adolf/docker-compose.yml"
 DEFAULT_CHAT_ID = "346967270"
 NAMES = [
    "Maximilian", "Cornelius", "Zephyr", "Archibald", "Balthazar",
    "Ignatius", "Lysander", "Octavian", "Reginald", "Sylvester",
 ]
 BENCHMARK = {
    "easy": [
        "hi",
        "what is 2+2?",
        "what is the capital of France?",
        "tell me a short joke",
        "how are you doing today?",
        "thanks!",
        "what day comes after Wednesday?",
        "name the three primary colors",
        "is the sky blue?",
        "what does CPU stand for?",
    ],
    "medium": [
        "what is the current weather in Berlin?",
        "find the latest news about artificial intelligence",
        "what is the current price of Bitcoin?",
        "search for a good pasta carbonara recipe",
        "what movies are in theaters this week?",
        "find Python tutorials for beginners",
        "who won the last FIFA World Cup?",
        "do you remember what we talked about before?",
        "search for the best coffee shops in Tokyo",
        "what is happening in the tech industry this week?",
        "what's the weather like today?",
    ],
    "hard": [
        "/think compare the top 3 Python web frameworks (Django, FastAPI, Flask) and recommend one for a production REST API",
        "/think research the history of artificial intelligence and create a timeline of key milestones",
        "/think plan a 7-day trip to Japan with daily itinerary, accommodation suggestions, and budget breakdown",
        "/think analyze microservices vs monolithic architecture: pros, cons, and when to choose each",
        "/think write a Python script that reads a CSV file, cleans the data, and generates summary statistics",
        "/think research quantum computing: explain the key concepts and how it differs from classical computing",
        "/think compare PostgreSQL, MongoDB, and Redis — when to use each and what are the trade-offs?",
        "/think create a comprehensive Docker deployment guide covering best practices for production",
        "/think research climate change: summarize the latest IPCC findings and key data points",
        "/think design a REST API with authentication, rate limiting, and proper error handling — provide architecture and code outline",
    ],
 }
 # ── terminal colours ──────────────────────────────────────────────────────────
 PASS = "\033[32mPASS\033[0m"
 FAIL = "\033[31mFAIL\033[0m"
 INFO = "\033[36mINFO\033[0m"
 WARN = "\033[33mWARN\033[0m"
 # ── result helpers ────────────────────────────────────────────────────────────
 def report(results: list, name: str, ok: bool, detail: str = ""):
    tag = PASS if ok else FAIL
    print(f"  [{tag}] {name}" + (f" — {detail}" if detail else ""))
    results.append((name, ok))
 def print_summary(results: list):
    print(f"\n{'─'*55}")
    total  = len(results)
    passed = sum(1 for _, ok in results if ok)
    failed = total - passed
    print(f"Results: {passed}/{total} passed", end="")
    if failed:
        print(f"  ({failed} failed)\n")
        print("Failed checks:")
        for name, ok in results:
            if not ok:
                print(f"  - {name}")
    else:
        print(" — all good")
    print()
 def tf(v):
    """Format timing value."""
    return f"{v:6.2f}s" if v is not None else "   n/a"
 # ── HTTP helpers ──────────────────────────────────────────────────────────────
 def get(url, timeout=5):
    with urllib.request.urlopen(urllib.request.Request(url), timeout=timeout) as r:
        return r.status, r.read().decode()
 def post_json(url, payload, timeout=10):
    data = json.dumps(payload).encode()
    req = urllib.request.Request(
        url, data=data,
        headers={"Content-Type": "application/json"},
        method="POST",
    )
    with urllib.request.urlopen(req, timeout=timeout) as r:
        return r.status, json.loads(r.read().decode())
 def check_sse(host, port, path):
    try:
        conn = http.client.HTTPConnection(host, port, timeout=5)
        conn.request("GET", path, headers={"Accept": "text/event-stream"})
        r = conn.getresponse()
        conn.close()
        return r.status == 200, f"HTTP {r.status}"
    except Exception as e:
        return False, str(e)
 def qdrant_count():
    try:
        _, body = get(f"{QDRANT}/collections/adolf_memories")
        return json.loads(body).get("result", {}).get("points_count", 0)
    except Exception:
        return 0
 # ── log helpers ───────────────────────────────────────────────────────────────
 def fetch_logs(since_s=600):
    """Return deepagents log lines from the last since_s seconds."""
    try:
        r = subprocess.run(
            ["docker", "compose", "-f", COMPOSE_FILE, "logs", "deepagents",
             f"--since={int(since_s)}s", "--no-log-prefix"],
            capture_output=True, text=True, timeout=15,
        )
        return r.stdout.splitlines()
    except Exception:
        return []
 def parse_run_block(lines, msg_prefix):
    """
    Scan log lines for the LAST '[agent] running: <msg_prefix>' block.
    Extracts reply timing, tier, and memory timing from that block.
    Returns dict or None if the reply has not appeared in logs yet.
    Dict keys:
      reply_total, llm, send, tier, reply_text  — from "[agent] replied in ..."
      memory_s                                  — from "[memory] stored in ..."
      memory_error                              — True if "[memory] error" found
    """
    search = msg_prefix[:50]
    start_idx = None
    for i, line in enumerate(lines):
        if "[agent] running:" in line and search in line:
            start_idx = i  # keep updating — we want the LAST occurrence
    if start_idx is None:
        return None
    block = lines[start_idx:]
    last_ai_text = None
    reply_data = None
    for j, line in enumerate(block):
        if "AIMessage:" in line and "→" not in line:
            txt = line.split("AIMessage:", 1)[-1].strip()
            if txt:
                last_ai_text = txt
        m = re.search(r"replied in ([\d.]+)s(?:\s+tier=(\w+))?", line)
        if m:
            tier = m.group(2) if m.group(2) else "unknown"
            reply_data = {
                "reply_total": float(m.group(1)),
                "llm":         None,
                "send":        None,
                "tier":        tier,
                "reply_text":  last_ai_text,
                "memory_s":    None,
                "memory_error": False,
                "_j": j,
            }
            break
    if reply_data is not None:
        next_lines = block[reply_data["_j"] + 1: reply_data["_j"] + 3]
        for line in next_lines:
            if line.startswith("[agent] reply_text:"):
                reply_data["reply_text"] = line[len("[agent] reply_text:"):].strip()
                break
    if reply_data is None:
        return None
    for line in block[reply_data["_j"] + 1:]:
        mm = re.search(r"\[memory\] stored in ([\d.]+)s", line)
        if mm:
            reply_data["memory_s"] = float(mm.group(1))
            break
        if "[memory] error" in line:
            reply_data["memory_error"] = True
            break
    return reply_data
 def wait_for(label, msg_prefix, timeout_s=200, need_memory=True):
    """
    Poll deepagents logs until the message is fully processed.
    Shows a live progress line. Returns timing dict or None on timeout.
    """
    t_start = time.monotonic()
    deadline = t_start + timeout_s
    tick = 0
    last_result = None
    while time.monotonic() < deadline:
        since = int(time.monotonic() - t_start) + 90
        lines = fetch_logs(since_s=since)
        result = parse_run_block(lines, msg_prefix)
        if result:
            last_result = result
            has_mem = result["memory_s"] is not None or result["memory_error"]
            if (not need_memory) or has_mem:
                elapsed = time.monotonic() - t_start
                print(f"\r  [{label}] done after {elapsed:.0f}s{' ' * 30}")
                return result
        time.sleep(4)
        tick += 1
        rem = int(deadline - time.monotonic())
        if last_result:
            phase = "waiting for memory..." if need_memory else "done"
        else:
            phase = "waiting for LLM reply..."
        print(f"\r  [{label}] {tick*4}s elapsed, {rem}s left — {phase}  ", end="", flush=True)
    print(f"\r  [{label}] TIMEOUT after {timeout_s}s{' ' * 30}")
    return None
--- a/tests/integration/test_health.py
+++ b/tests/integration/test_health.py
@@ -0,0 +1,214 @@
 #!/usr/bin/env python3
 """
 Adolf service health integration tests.
 Checks:
  1.  deepagents /health — agent_ready
  1b. openmemory /sse reachable
  1c. grammy /sse reachable
  2.  Bifrost /health, /v1/models, direct inference, deepagents startup log
  3.  GPU Ollama — reachable, qwen3:8b present
  4.  CPU Ollama — reachable, nomic-embed-text present
  5.  Qdrant — reachable, adolf_memories collection, vector dims=768
  6.  SearXNG — reachable, JSON results, latency < 5s
 Usage:
    python3 test_health.py
 """
 import json
 import sys
 import time
 import urllib.request
 from common import (
    DEEPAGENTS, BIFROST, GRAMMY_HOST, GRAMMY_PORT,
    OLLAMA_GPU, OLLAMA_CPU, QDRANT, SEARXNG, COMPOSE_FILE,
    INFO, FAIL,
    report, print_summary, tf,
    get, post_json, check_sse, fetch_logs,
 )
 results = []
 timings = {}
 # ── 1. Service health ─────────────────────────────────────────────────────────
 print(f"\n[{INFO}] 1. Service health")
 t0 = time.monotonic()
 try:
    status, body = get(f"{DEEPAGENTS}/health")
    data = json.loads(body)
    ok = status == 200 and data.get("agent_ready") is True
    report(results, "deepagents /health — agent_ready", ok,
           f"agent_ready={data.get('agent_ready')}")
 except Exception as e:
    report(results, "deepagents /health", False, str(e))
 ok, detail = check_sse("localhost", 8765, "/sse")
 report(results, "openmemory /sse reachable", ok, detail)
 ok, detail = check_sse(GRAMMY_HOST, GRAMMY_PORT, "/sse")
 report(results, "grammy /sse reachable", ok, detail)
 timings["health_check"] = time.monotonic() - t0
 # ── 2. Bifrost gateway ────────────────────────────────────────────────────────
 print(f"\n[{INFO}] 2. Bifrost gateway (port 8080)")
 t0 = time.monotonic()
 try:
    status, body = get(f"{BIFROST}/health", timeout=5)
    report(results, "Bifrost /health reachable", status == 200, f"HTTP {status}")
 except Exception as e:
    report(results, "Bifrost /health reachable", False, str(e))
 try:
    status, body = get(f"{BIFROST}/v1/models", timeout=5)
    data = json.loads(body)
    model_ids = [m.get("id", "") for m in data.get("data", [])]
    gpu_models = [m for m in model_ids if m.startswith("ollama/")]
    report(results, "Bifrost lists ollama GPU models", len(gpu_models) > 0,
           f"found: {gpu_models}")
    for expected in ["ollama/qwen3:4b", "ollama/qwen3:8b", "ollama/qwen2.5:1.5b"]:
        report(results, f"  model {expected} listed", expected in model_ids)
 except Exception as e:
    report(results, "Bifrost /v1/models", False, str(e))
 print(f"  [bifrost-infer] POST /v1/chat/completions → ollama/qwen2.5:0.5b ...")
 t_infer = time.monotonic()
 try:
    infer_payload = {
        "model": "ollama/qwen2.5:0.5b",
        "messages": [{"role": "user", "content": "Reply with exactly one word: pong"}],
        "max_tokens": 16,
    }
    data = json.dumps(infer_payload).encode()
    req = urllib.request.Request(
        f"{BIFROST}/v1/chat/completions",
        data=data,
        headers={"Content-Type": "application/json"},
        method="POST",
    )
    with urllib.request.urlopen(req, timeout=60) as r:
        infer_status = r.status
        infer_body = json.loads(r.read().decode())
    infer_elapsed = time.monotonic() - t_infer
    reply_content = infer_body.get("choices", [{}])[0].get("message", {}).get("content", "")
    used_model = infer_body.get("model", "")
    report(results, "Bifrost → Ollama GPU inference succeeds",
           infer_status == 200 and bool(reply_content),
           f"{infer_elapsed:.1f}s  model={used_model!r}  reply={reply_content[:60]!r}")
    timings["bifrost_direct_infer"] = infer_elapsed
 except Exception as e:
    report(results, "Bifrost → Ollama GPU inference succeeds", False, str(e))
    timings["bifrost_direct_infer"] = None
 try:
    import subprocess
    r = subprocess.run(
        ["docker", "compose", "-f", COMPOSE_FILE, "logs", "deepagents",
         "--since=3600s", "--no-log-prefix"],
        capture_output=True, text=True, timeout=10,
    )
    log_lines = r.stdout.splitlines()
    bifrost_line = next(
        (l for l in log_lines if "[agent] bifrost=" in l and "bifrost:8080" in l),
        None,
    )
    report(results, "deepagents startup log confirms bifrost URL",
           bifrost_line is not None,
           bifrost_line.strip() if bifrost_line else "line not found in logs")
    if bifrost_line:
        has_prefix = "router=ollama/" in bifrost_line and "medium=ollama/" in bifrost_line
        report(results, "deepagents model names use ollama/ prefix", has_prefix,
               bifrost_line.strip())
 except Exception as e:
    report(results, "deepagents startup log check", False, str(e))
 timings["bifrost_check"] = time.monotonic() - t0
 # ── 3. GPU Ollama ─────────────────────────────────────────────────────────────
 print(f"\n[{INFO}] 3. GPU Ollama (port 11436)")
 t0 = time.monotonic()
 try:
    status, body = get(f"{OLLAMA_GPU}/api/tags")
    models = [m["name"] for m in json.loads(body).get("models", [])]
    has_qwen = any("qwen3" in m for m in models)
    report(results, "GPU Ollama reachable", True, f"models: {models}")
    report(results, "qwen3:8b present", has_qwen)
 except Exception as e:
    report(results, "GPU Ollama reachable", False, str(e))
    report(results, "qwen3:8b present", False, "skipped")
 timings["gpu_ollama_ping"] = time.monotonic() - t0
 # ── 4. CPU Ollama ─────────────────────────────────────────────────────────────
 print(f"\n[{INFO}] 4. CPU Ollama (port 11435)")
 t0 = time.monotonic()
 try:
    status, body = get(f"{OLLAMA_CPU}/api/tags")
    models = [m["name"] for m in json.loads(body).get("models", [])]
    has_embed = any("nomic-embed-text" in m for m in models)
    report(results, "CPU Ollama reachable", True, f"models: {models}")
    report(results, "nomic-embed-text present", has_embed)
 except Exception as e:
    report(results, "CPU Ollama reachable", False, str(e))
    report(results, "nomic-embed-text present", False, "skipped")
 timings["cpu_ollama_ping"] = time.monotonic() - t0
 # ── 5. Qdrant ─────────────────────────────────────────────────────────────────
 print(f"\n[{INFO}] 5. Qdrant (port 6333)")
 t0 = time.monotonic()
 try:
    status, body = get(f"{QDRANT}/collections")
    cols = [c["name"] for c in json.loads(body).get("result", {}).get("collections", [])]
    report(results, "Qdrant reachable", True, f"collections: {cols}")
    report(results, "adolf_memories collection exists", "adolf_memories" in cols)
 except Exception as e:
    report(results, "Qdrant reachable", False, str(e))
    report(results, "adolf_memories collection exists", False, "skipped")
 try:
    status, body = get(f"{QDRANT}/collections/adolf_memories")
    info = json.loads(body).get("result", {})
    dims = info.get("config", {}).get("params", {}).get("vectors", {}).get("size")
    report(results, "vector dims = 768", dims == 768, f"got {dims}")
 except Exception as e:
    report(results, "adolf_memories collection info", False, str(e))
 timings["qdrant_ping"] = time.monotonic() - t0
 # ── 6. SearXNG ────────────────────────────────────────────────────────────────
 print(f"\n[{INFO}] 6. SearXNG (port 11437)")
 t0 = time.monotonic()
 try:
    status, body = get(f"{SEARXNG}/search?q=test&format=json", timeout=15)
    elapsed = time.monotonic() - t0
    n = len(json.loads(body).get("results", []))
    report(results, "SearXNG reachable + JSON results", status == 200 and n > 0,
           f"{n} results in {elapsed:.1f}s")
    report(results, "SearXNG response < 5s", elapsed < 5, f"{elapsed:.2f}s")
    timings["searxng_latency"] = elapsed
 except Exception as e:
    report(results, "SearXNG reachable", False, str(e))
    report(results, "SearXNG response < 5s", False, "skipped")
    timings["searxng_latency"] = None
 timings["searxng_check"] = time.monotonic() - t0
 # ── summary ───────────────────────────────────────────────────────────────────
 print_summary(results)
 sys.exit(0 if all(ok for _, ok in results) else 1)
--- a/tests/integration/test_memory.py
+++ b/tests/integration/test_memory.py
@@ -0,0 +1,437 @@
 #!/usr/bin/env python3
 """
 Adolf memory integration tests.
 Tests:
  1. Name store   — POST "remember that your name is <RandomName>"
  2. Qdrant point — verifies a new vector was written after store
  3. Name recall  — POST "what is your name?" → reply must contain <RandomName>
  4. LiteLLM      — verifies LiteLLM proxy is reachable (replaced Bifrost)
  5. Timing profile — breakdown of store and recall latencies
  6. Memory benchmark — store 5 personal facts, recall with 10 questions
  7. Dedup test   — same fact stored twice must not grow Qdrant by 2 points
 Usage:
    python3 test_memory.py [--chat-id CHAT_ID] [--name-only] [--bench-only] [--dedup-only]
 """
 import argparse
 import json
 import random
 import subprocess
 import sys
 import time
 import urllib.request
 from common import (
    DEEPAGENTS, LITELLM, QDRANT, COMPOSE_FILE, DEFAULT_CHAT_ID,
    NAMES,
    INFO, PASS, FAIL, WARN,
    report, print_summary, tf,
    get, post_json, qdrant_count, fetch_logs,
    parse_run_block, wait_for,
 )
 # ── args ──────────────────────────────────────────────────────────────────────
 parser = argparse.ArgumentParser(description="Adolf memory integration tests")
 parser.add_argument("--chat-id", default=DEFAULT_CHAT_ID)
 parser.add_argument("--name-only",  action="store_true", help="Run only the name store/recall test")
 parser.add_argument("--bench-only", action="store_true", help="Run only the memory benchmark")
 parser.add_argument("--dedup-only", action="store_true", help="Run only the deduplication test")
 args = parser.parse_args()
 CHAT_ID = args.chat_id
 _only = args.name_only or args.bench_only or args.dedup_only
 _run_name  = not _only or args.name_only
 _run_bench = not _only or args.bench_only
 _run_dedup = not _only or args.dedup_only
 results = []
 timings = {}
 random_name = random.choice(NAMES)
 TEST_CHAT_ID = f"{CHAT_ID}-{random_name.lower()}"
 if _run_name:
    print(f"\n  Test name : \033[1m{random_name}\033[0m")
    print(f"  Chat ID   : {TEST_CHAT_ID}")
 # ── 1–4. Name store / recall pipeline ────────────────────────────────────────
 if _run_name:
    print(f"\n[{INFO}] 1. Name store / recall pipeline")
    store_msg  = f"remember that your name is {random_name}"
    recall_msg = "what is your name?"
    # Clear memories so each run starts clean
    try:
        post_json(f"{QDRANT}/collections/adolf_memories/points/delete",
                  {"filter": {}}, timeout=5)
    except Exception:
        pass
    pts_before = qdrant_count()
    print(f"  Qdrant points before: {pts_before}")
    # ── 1. Store ──────────────────────────────────────────────────────────────
    print(f"\n  [store] '{store_msg}'")
    t_store = time.monotonic()
    try:
        status, _ = post_json(f"{DEEPAGENTS}/chat",
                              {"message": store_msg, "chat_id": TEST_CHAT_ID}, timeout=5)
        t_accept = time.monotonic() - t_store
        report(results, "POST /chat (store) returns 202 immediately",
               status == 202 and t_accept < 1, f"status={status}, t={t_accept:.3f}s")
        timings["store_http_accept"] = t_accept
    except Exception as e:
        report(results, "POST /chat (store)", False, str(e))
        print_summary(results)
        sys.exit(1)
    store = wait_for("store", store_msg, timeout_s=220, need_memory=True)
    if store:
        timings.update({
            "store_llm":    store["llm"],
            "store_send":   store["send"],
            "store_reply":  store["reply_total"],
            "store_memory": store["memory_s"],
        })
        report(results, "Agent replied to store message", True,
               f"{store['reply_total']:.1f}s total  llm={store['llm']:.1f}s  "
               f"send={store['send']:.1f}s  tier={store['tier']}")
        if store["memory_s"] is not None:
            report(results, "Memory stored without error", True, f"{store['memory_s']:.1f}s")
        elif store["memory_error"]:
            report(results, "Memory stored without error", False, "error in [memory] log")
        else:
            report(results, "Memory stored without error", False, "not found in logs")
        print(f"    Store reply: {store['reply_text']!r}")
    else:
        report(results, "Agent replied to store message", False, "timeout")
        report(results, "Memory stored without error", False, "timeout")
        print_summary(results)
        sys.exit(1)
    # ── 2. Qdrant point check ─────────────────────────────────────────────────
    pts_after = qdrant_count()
    new_pts = pts_after - pts_before
    report(results, "New memory point(s) added to Qdrant", new_pts > 0,
           f"{pts_before} → {pts_after} (+{new_pts})")
    timings["qdrant_new_points"] = new_pts
    # ── 3. Recall ─────────────────────────────────────────────────────────────
    print(f"\n  [recall] '{recall_msg}'")
    t_recall = time.monotonic()
    try:
        status, _ = post_json(f"{DEEPAGENTS}/chat",
                              {"message": recall_msg, "chat_id": TEST_CHAT_ID}, timeout=5)
        t_accept2 = time.monotonic() - t_recall
        report(results, "POST /chat (recall) returns 202 immediately",
               status == 202 and t_accept2 < 1, f"status={status}, t={t_accept2:.3f}s")
        timings["recall_http_accept"] = t_accept2
    except Exception as e:
        report(results, "POST /chat (recall)", False, str(e))
    recall = wait_for("recall", recall_msg, timeout_s=160, need_memory=False)
    if recall:
        timings.update({
            "recall_llm":   recall["llm"],
            "recall_send":  recall["send"],
            "recall_reply": recall["reply_total"],
        })
        report(results, "Agent replied to recall message", True,
               f"{recall['reply_total']:.1f}s total  llm={recall['llm']:.1f}s  "
               f"send={recall['send']:.1f}s  tier={recall['tier']}")
        reply_text = recall["reply_text"] or ""
        name_in_reply = random_name.lower() in reply_text.lower()
        report(results, f"Reply contains '{random_name}'", name_in_reply,
               f"reply: {reply_text[:120]!r}")
    else:
        report(results, "Agent replied to recall message", False, "timeout")
        report(results, f"Reply contains '{random_name}'", False, "no reply")
    # ── 4. LiteLLM proxy reachable (replaced Bifrost) ─────────────────────────
    try:
        status, _ = get(f"{LITELLM}/health", timeout=5)
        litellm_ok = status == 200
    except Exception:
        litellm_ok = False
    report(results, "LiteLLM proxy reachable", litellm_ok)
    # ── 5. Timing profile ─────────────────────────────────────────────────────
    print(f"\n[{INFO}] 5. Timing profile")
    W = 36
    print(f"\n  {'Stage':<{W}}  {'Time':>8}")
    print(f"  {'─'*W}  {'─'*8}")
    for label, key in [
        ("[GPU] HTTP accept — store turn",        "store_http_accept"),
        ("[GPU] qwen3:Xb inference — store turn", "store_llm"),
        ("[GPU] Telegram send — store turn",      "store_send"),
        ("[GPU] Total reply latency — store",     "store_reply"),
        ("[GPU] qwen2.5:1.5b+embed — async mem",  "store_memory"),
    ]:
        print(f"  {label:<{W}}  {tf(timings.get(key)):>8}")
    print(f"  {'─'*W}  {'─'*8}")
    for label, key in [
        ("[GPU] HTTP accept — recall turn",    "recall_http_accept"),
        ("[GPU] qwen3:Xb inference — recall",  "recall_llm"),
        ("[GPU] Telegram send — recall turn",  "recall_send"),
        ("[GPU] Total reply latency — recall", "recall_reply"),
    ]:
        print(f"  {label:<{W}}  {tf(timings.get(key)):>8}")
    print(f"\n  Bottleneck analysis (each █ ≈ 5s):")
    print(f"  {'─'*(W+12)}")
    candidates = [
        ("[GPU] qwen3:Xb — store reply ", timings.get("store_llm")   or 0),
        ("[GPU] qwen3:Xb — recall reply", timings.get("recall_llm")  or 0),
        ("[GPU] qwen2.5:1.5b+embed (async)", timings.get("store_memory") or 0),
    ]
    candidates.sort(key=lambda x: x[1], reverse=True)
    for label, t in candidates:
        bar = "█" * min(int(t / 5), 24)
        total_pipeline = (timings.get("store_reply") or 0) + (timings.get("store_memory") or 0)
        pct = f"  {t/total_pipeline*100:4.0f}%" if total_pipeline > 0 else ""
        print(f"  {label}  {t:6.1f}s  {bar}{pct}")
    print()
 # ── 6. Memory benchmark ───────────────────────────────────────────────────────
 if _run_bench:
    _mem_name     = random.choice(["Alice", "Bruno", "Camille", "Diego", "Elena",
                                   "Farid", "Greta", "Hiroshi", "Irina", "Jonas"])
    _mem_city     = random.choice(["Tokyo", "Berlin", "Cairo", "Sydney", "Oslo",
                                   "Nairobi", "Lisbon", "Seoul", "Montreal", "Bangkok"])
    _mem_allergy  = random.choice(["nuts", "gluten", "dairy", "shellfish", "eggs"])
    _mem_job      = random.choice([
        ("software engineer", "startup"),
        ("data scientist", "research lab"),
        ("product manager", "tech company"),
        ("DevOps engineer", "cloud provider"),
    ])
    _mem_lang     = random.choice(["Python", "Rust", "Go", "TypeScript", "Kotlin"])
    _mem_pet_name = random.choice(["Whiskers", "Biscuit", "Mango", "Pebble", "Shadow",
                                   "Noodle", "Cheddar", "Cosmo", "Pippin", "Ziggy"])
    print(f"\n[{INFO}] 6. Memory benchmark")
    print(f"  name={_mem_name}  city={_mem_city}  allergy={_mem_allergy}  "
          f"job={_mem_job[0]}@{_mem_job[1]}  lang={_mem_lang}  pet={_mem_pet_name}")
    print(f"  Storing 5 facts, then querying with 10 recall questions")
    print(f"  Chat ID: {CHAT_ID}")
    print()
    # Wipe collection and restart openmemory for a clean slate
    try:
        req = urllib.request.Request(f"{QDRANT}/collections/adolf_memories", method="DELETE")
        with urllib.request.urlopen(req, timeout=5):
            pass
        print(f"  [{INFO}] Wiped adolf_memories collection")
    except Exception as e:
        print(f"  [{WARN}] Could not wipe collection: {e}")
    try:
        subprocess.run(
            ["docker", "compose", "-f", COMPOSE_FILE, "restart", "openmemory"],
            capture_output=True, timeout=30,
        )
        time.sleep(6)
        print(f"  [{INFO}] Restarted openmemory — fresh collection ready")
    except Exception as e:
        print(f"  [{WARN}] Could not restart openmemory: {e}")
    MEMORY_FACTS = [
        f"My name is {_mem_name} and I live in {_mem_city}",
        f"I prefer vegetarian food and I'm allergic to {_mem_allergy}",
        f"I work as a {_mem_job[0]} at a {_mem_job[1]}",
        f"My favorite programming language is {_mem_lang}",
        f"I have a cat named {_mem_pet_name}",
    ]
    MEMORY_RECALLS = [
        ("What is my name?",                       [_mem_name.lower()]),
        ("Where do I live?",                       [_mem_city.lower()]),
        ("Do I have any food allergies?",          [_mem_allergy.lower()]),
        ("What is my job?",                        [_mem_job[0].split()[0].lower()]),
        ("What programming language do I prefer?", [_mem_lang.lower()]),
        ("Do I have any pets?",                    [_mem_pet_name.lower()]),
        ("Am I vegetarian or do I eat meat?",      ["vegetarian"]),
        ("What city am I in?",                     [_mem_city.lower()]),
        ("Tell me what you know about me",         [_mem_name.lower(), _mem_city.lower()]),
        ("What's the name of my pet?",             [_mem_pet_name.lower()]),
    ]
    STORE_TIMEOUT  = 180
    RECALL_TIMEOUT = 180
    print(f"  Storing {len(MEMORY_FACTS)} facts...")
    store_ok = 0
    for i, fact in enumerate(MEMORY_FACTS, 1):
        print(f"  [mem-store-{i:02d}] {fact!r}")
        try:
            status, _ = post_json(f"{DEEPAGENTS}/chat",
                                  {"message": fact, "chat_id": CHAT_ID}, timeout=5)
            if status != 202:
                print(f"              → [{FAIL}] POST returned {status}")
                continue
        except Exception as e:
            print(f"              → [{FAIL}] POST error: {e}")
            continue
        found = wait_for(f"mem-store-{i:02d}", fact, timeout_s=STORE_TIMEOUT, need_memory=True)
        if found:
            store_ok += 1
            print(f"              → [{PASS}] stored  tier={found['tier']}  mem={found['memory_s']}s")
        else:
            print(f"              → [{FAIL}] timeout")
    report(results, f"All memory facts stored ({store_ok}/{len(MEMORY_FACTS)})",
           store_ok == len(MEMORY_FACTS))
    # Wait for async extraction to settle
    print(f"\n  Waiting for memory extraction to settle (up to 60s)...")
    _prev_count = -1
    _stable_ticks = 0
    _cur_count = 0
    for _ in range(30):
        time.sleep(2)
        try:
            _, body = get(f"{QDRANT}/collections/adolf_memories")
            _cur_count = json.loads(body).get("result", {}).get("points_count", 0)
        except Exception:
            _cur_count = _prev_count
        if _cur_count == _prev_count:
            _stable_ticks += 1
            if _stable_ticks >= 3:
                break
        else:
            _stable_ticks = 0
        _prev_count = _cur_count
    print(f"  Memory settled: {_cur_count} points in Qdrant")
    print(f"\n  Querying with {len(MEMORY_RECALLS)} recall questions...")
    recall_results = []
    for i, (question, keywords) in enumerate(MEMORY_RECALLS, 1):
        print(f"  [mem-recall-{i:02d}] {question!r}")
        try:
            status, _ = post_json(f"{DEEPAGENTS}/chat",
                                  {"message": question, "chat_id": CHAT_ID}, timeout=5)
            if status != 202:
                print(f"               → [{FAIL}] POST returned {status}")
                recall_results.append((question, keywords, None, False))
                continue
        except Exception as e:
            print(f"               → [{FAIL}] POST error: {e}")
            recall_results.append((question, keywords, None, False))
            continue
        t_start = time.monotonic()
        found = None
        while time.monotonic() - t_start < RECALL_TIMEOUT:
            since = int(time.monotonic() - t_start) + 30
            lines = fetch_logs(since_s=since)
            found = parse_run_block(lines, question)
            if found:
                break
            time.sleep(2)
        if not found:
            print(f"               → [{FAIL}] timeout")
            recall_results.append((question, keywords, None, False))
            continue
        reply_text = (found.get("reply_text") or "").lower()
        hit_keywords = [kw for kw in keywords if kw.lower() in reply_text]
        passed = len(hit_keywords) == len(keywords)
        tag_str = PASS if passed else WARN
        missing = [kw for kw in keywords if kw.lower() not in reply_text]
        detail = f"tier={found['tier']}  lat={found['reply_total']:.1f}s"
        if missing:
            detail += f"  missing keywords: {missing}"
        print(f"               → [{tag_str}] {detail}")
        recall_results.append((question, keywords, found.get("reply_text"), passed))
        time.sleep(1)
    print(f"\n  {'#':<4}  {'Pass':<5}  {'Question':<45}  {'Keywords'}")
    print(f"  {'─'*4}  {'─'*5}  {'─'*45}  {'─'*30}")
    for idx, (q, kws, reply, ok) in enumerate(recall_results, 1):
        ok_str = "✓" if ok else "✗"
        print(f"  {ok_str} {idx:<3}  {'yes' if ok else 'no':<5}  {q[:45]:<45}  {kws}")
    recall_pass = sum(1 for _, _, _, ok in recall_results if ok)
    total_recall = len(recall_results)
    print(f"\n  Memory recall score: {recall_pass}/{total_recall}")
    report(results, f"Memory recall ({recall_pass}/{total_recall} keywords found)",
           recall_pass == total_recall,
           f"{recall_pass}/{total_recall} questions had all expected keywords in reply")
 # ── 7. Deduplication test ─────────────────────────────────────────────────────
 if _run_dedup:
    print(f"\n[{INFO}] 7. Memory deduplication test")
    print(f"  Sends the same fact twice — Qdrant point count must not increase by 2")
    print(f"  Chat ID: {CHAT_ID}")
    print()
    DEDUP_TIMEOUT = 120
    _dedup_fact = f"My lucky number is {random.randint(1000, 9999)}"
    print(f"  Fact: {_dedup_fact!r}")
    pts_before = qdrant_count()
    print(f"  Qdrant points before: {pts_before}")
    print(f"  [dedup-1] sending fact (first time)")
    found1 = None
    try:
        status, _ = post_json(f"{DEEPAGENTS}/chat",
                              {"message": _dedup_fact, "chat_id": CHAT_ID}, timeout=5)
        if status != 202:
            report(results, "Dedup: first POST accepted", False, f"status={status}")
        else:
            found1 = wait_for("dedup-1", _dedup_fact, timeout_s=DEDUP_TIMEOUT, need_memory=True)
            if found1:
                print(f"  [dedup-1] stored  tier={found1['tier']}  mem={found1['memory_s']}s")
            else:
                print(f"  [dedup-1] timeout")
    except Exception as e:
        report(results, "Dedup: first POST accepted", False, str(e))
    pts_after_first = qdrant_count()
    new_first = pts_after_first - pts_before
    print(f"  Qdrant after first send: {pts_before} → {pts_after_first} (+{new_first})")
    print(f"  [dedup-2] sending same fact (second time)")
    try:
        status, _ = post_json(f"{DEEPAGENTS}/chat",
                              {"message": _dedup_fact, "chat_id": CHAT_ID}, timeout=5)
        if status != 202:
            report(results, "Dedup: second POST accepted", False, f"status={status}")
        else:
            found2 = wait_for("dedup-2", _dedup_fact, timeout_s=DEDUP_TIMEOUT, need_memory=True)
            if found2:
                print(f"  [dedup-2] stored  tier={found2['tier']}  mem={found2['memory_s']}s")
            else:
                print(f"  [dedup-2] timeout")
    except Exception as e:
        report(results, "Dedup: second POST accepted", False, str(e))
    pts_after_second = qdrant_count()
    new_second = pts_after_second - pts_after_first
    print(f"  Qdrant after second send: {pts_after_first} → {pts_after_second} (+{new_second})")
    dedup_ok = new_second <= new_first
    report(results, "Deduplication: second identical fact not added to Qdrant", dedup_ok,
           f"first send: +{new_first} pts, second send: +{new_second} pts (want second ≤ first)")
 # ── summary ───────────────────────────────────────────────────────────────────
 print_summary(results)
 sys.exit(0 if all(ok for _, ok in results) else 1)
--- a/tests/integration/test_routing.py
+++ b/tests/integration/test_routing.py
@@ -0,0 +1,317 @@
 #!/usr/bin/env python3
 """
 Adolf tier routing benchmark.
 Tests:
  easy   — 10 questions that must route to 'light' tier
  medium — 11 questions that must route to 'medium' (light acceptable for some; complex = fail)
  hard   — 10 /think questions that must route to 'complex' (medium fallback acceptable)
 Usage:
    python3 test_routing.py [--chat-id CHAT_ID]
                            [--easy-only]    # only easy benchmark
                            [--medium-only]  # only medium benchmark
                            [--hard-only]    # only hard benchmark
 """
 import argparse
 import sys
 import time
 from common import (
    DEEPAGENTS, COMPOSE_FILE, DEFAULT_CHAT_ID,
    BENCHMARK,
    INFO, PASS, FAIL, WARN,
    report, print_summary,
    post_json, fetch_logs,
    parse_run_block,
 )
 # ── args ──────────────────────────────────────────────────────────────────────
 parser = argparse.ArgumentParser(description="Adolf routing benchmark")
 parser.add_argument("--chat-id",     default=DEFAULT_CHAT_ID)
 parser.add_argument("--easy-only",   action="store_true")
 parser.add_argument("--medium-only", action="store_true")
 parser.add_argument("--hard-only",   action="store_true")
 args = parser.parse_args()
 CHAT_ID = args.chat_id
 _only = args.easy_only or args.medium_only or args.hard_only
 _run_easy   = not _only or args.easy_only
 _run_medium = not _only or args.medium_only
 _run_hard   = not _only or args.hard_only
 results = []
 # ── easy benchmark ────────────────────────────────────────────────────────────
 if _run_easy:
    print(f"\n[{INFO}] Easy routing benchmark")
    print(f"  {len(BENCHMARK['easy'])} questions — all must route to 'light'")
    print(f"  Chat ID: {CHAT_ID}")
    print()
    bench_results = []
    LIGHT_TIMEOUT = 60
    for i, question in enumerate(BENCHMARK["easy"], 1):
        tag = f"easy-{i:02d}"
        print(f"  [{tag}] {question[:55]!r}")
        t_send = time.monotonic()
        try:
            status, _ = post_json(f"{DEEPAGENTS}/chat",
                                  {"message": question, "chat_id": CHAT_ID}, timeout=5)
            if status != 202:
                print(f"          → [{FAIL}] POST returned {status}")
                bench_results.append((question, "?", None, False))
                continue
        except Exception as e:
            print(f"          → [{FAIL}] POST error: {e}")
            bench_results.append((question, "?", None, False))
            continue
        t_start = time.monotonic()
        found = None
        while time.monotonic() - t_start < LIGHT_TIMEOUT:
            since = int(time.monotonic() - t_start) + 30
            lines = fetch_logs(since_s=since)
            found = parse_run_block(lines, question)
            if found:
                break
            time.sleep(1)
        if not found:
            print(f"          → [{FAIL}] no reply within {LIGHT_TIMEOUT}s")
            bench_results.append((question, "timeout", None, False))
            continue
        tier = found.get("tier", "unknown")
        is_light = (tier == "light")
        tag_str = PASS if is_light else FAIL
        print(f"          → [{tag_str}] tier={tier}  latency={found['reply_total']:.1f}s  llm={found['llm']:.1f}s")
        bench_results.append((question, tier, found["reply_total"], is_light))
        time.sleep(1)
    print(f"\n  {'#':<4}  {'Tier':<8}  {'Latency':>8}  {'Question'}")
    print(f"  {'─'*4}  {'─'*8}  {'─'*8}  {'─'*50}")
    for idx, (q, tier, lat, ok) in enumerate(bench_results, 1):
        lat_str = f"{lat:.1f}s" if lat is not None else "timeout"
        ok_str = "✓" if ok else "✗"
        print(f"  {ok_str} {idx:<3}  {tier:<8}  {lat_str:>8}  {q[:50]!r}")
    light_count = sum(1 for _, _, _, ok in bench_results if ok)
    total_bench = len(bench_results)
    lats = [lat for _, _, lat, ok in bench_results if ok and lat is not None]
    avg_lat = sum(lats) / len(lats) if lats else 0
    print(f"\n  Light-path score: {light_count}/{total_bench}")
    if lats:
        print(f"  Avg latency (light): {avg_lat:.1f}s  min={min(lats):.1f}s  max={max(lats):.1f}s")
    report(results, f"All easy questions routed to light ({light_count}/{total_bench})",
           light_count == total_bench,
           f"{light_count}/{total_bench} via light path, avg {avg_lat:.1f}s")
 # ── medium benchmark ──────────────────────────────────────────────────────────
 if _run_medium:
    print(f"\n[{INFO}] Medium routing benchmark")
    print(f"  {len(BENCHMARK['medium'])} questions — must route to medium (light ok for some; complex = fail)")
    print(f"  Chat ID: {CHAT_ID}")
    print()
    LIGHT_ACCEPTABLE = {
        "who won the last FIFA World Cup?",
        "search for a good pasta carbonara recipe",
        "find Python tutorials for beginners",
        "search for the best coffee shops in Tokyo",
    }
    med_results = []
    MEDIUM_TIMEOUT = 120
    for i, question in enumerate(BENCHMARK["medium"], 1):
        tag = f"med-{i:02d}"
        print(f"  [{tag}] {question[:60]!r}")
        t_send = time.monotonic()
        try:
            status, _ = post_json(f"{DEEPAGENTS}/chat",
                                  {"message": question, "chat_id": CHAT_ID}, timeout=5)
            if status != 202:
                print(f"          → [{FAIL}] POST returned {status}")
                med_results.append((question, "?", None, False))
                continue
        except Exception as e:
            print(f"          → [{FAIL}] POST error: {e}")
            med_results.append((question, "?", None, False))
            continue
        t_start = time.monotonic()
        found = None
        while time.monotonic() - t_start < MEDIUM_TIMEOUT:
            since = int(time.monotonic() - t_start) + 60
            lines = fetch_logs(since_s=since)
            found = parse_run_block(lines, question)
            if found:
                break
            time.sleep(3)
        if not found:
            print(f"          → [{FAIL}] no reply within {MEDIUM_TIMEOUT}s")
            med_results.append((question, "timeout", None, False))
            continue
        tier = found.get("tier", "unknown")
        light_ok = question in LIGHT_ACCEPTABLE
        if tier == "medium":
            correct, label, note = True, PASS, "medium ✓"
        elif tier == "light":
            correct = light_ok
            label = PASS if light_ok else WARN
            note = "light (acceptable)" if light_ok else "light (should be medium)"
        elif tier == "complex":
            correct, label, note = False, FAIL, "complex — wrong escalation"
        else:
            correct, label, note = False, FAIL, f"unknown tier {tier!r}"
        print(f"          → [{label}] {note}  latency={found['reply_total']:.1f}s  llm={found['llm']:.1f}s")
        med_results.append((question, tier, found["reply_total"], correct))
        time.sleep(1)
    print(f"\n  {'#':<4}  {'Tier':<8}  {'Latency':>8}  {'Question'}")
    print(f"  {'─'*4}  {'─'*8}  {'─'*8}  {'─'*55}")
    for idx, (q, tier, lat, ok) in enumerate(med_results, 1):
        lat_str = f"{lat:.1f}s" if lat is not None else "timeout"
        ok_str = "✓" if ok else ("~" if tier == "light" else "✗")
        print(f"  {ok_str} {idx:<3}  {tier:<8}  {lat_str:>8}  {q[:55]!r}")
    total_med     = len(med_results)
    medium_count  = sum(1 for _, tier, _, _ in med_results if tier == "medium")
    light_count   = sum(1 for _, tier, _, _ in med_results if tier == "light")
    complex_count = sum(1 for _, tier, _, _ in med_results if tier == "complex")
    timeout_count = sum(1 for _, tier, _, _ in med_results if tier == "timeout")
    light_misroute = sum(1 for q, tier, _, _ in med_results
                         if tier == "light" and q not in LIGHT_ACCEPTABLE)
    lats = [lat for _, _, lat, _ in med_results if lat is not None]
    print(f"\n  Breakdown: medium={medium_count}  light={light_count}  "
          f"complex={complex_count}  timeout={timeout_count}")
    if light_misroute:
        print(f"  [{WARN}] {light_misroute} question(s) answered via light when medium expected")
    if lats:
        print(f"  Avg latency: {sum(lats)/len(lats):.1f}s  min={min(lats):.1f}s  max={max(lats):.1f}s")
    report(results,
           f"Medium questions: no complex escalation ({medium_count + light_count}/{total_med} routed)",
           complex_count == 0,
           f"medium={medium_count} light={light_count} complex={complex_count} timeout={timeout_count}")
    if timeout_count:
        report(results, f"Medium questions: all completed within {MEDIUM_TIMEOUT}s", False,
               f"{timeout_count} question(s) timed out")
 # ── hard benchmark ────────────────────────────────────────────────────────────
 if _run_hard:
    print(f"\n[{INFO}] Hard routing benchmark")
    print(f"  {len(BENCHMARK['hard'])} /think questions — must route to 'complex'")
    print(f"  Acceptable fallback: 'medium' if VRAM eviction timed out")
    print(f"  Fail condition: tier=light or timeout")
    print(f"  Chat ID: {CHAT_ID}")
    print()
    hard_results  = []
    COMPLEX_TIMEOUT = 300
    _VRAM_ENTER = "[vram] enter_complex_mode"
    _VRAM_EXIT  = "[vram] exit_complex_mode"
    for i, question in enumerate(BENCHMARK["hard"], 1):
        tag = f"hard-{i:02d}"
        short_q = question[len("/think "):].strip()[:60]
        print(f"  [{tag}] /think {short_q!r}")
        t_send = time.monotonic()
        try:
            status, _ = post_json(f"{DEEPAGENTS}/chat",
                                  {"message": question, "chat_id": CHAT_ID}, timeout=5)
            if status != 202:
                print(f"          → [{FAIL}] POST returned {status}")
                hard_results.append((question, "?", None, False))
                continue
        except Exception as e:
            print(f"          → [{FAIL}] POST error: {e}")
            hard_results.append((question, "?", None, False))
            continue
        t_start = time.monotonic()
        found = None
        while time.monotonic() - t_start < COMPLEX_TIMEOUT:
            since = int(time.monotonic() - t_start) + 90
            lines = fetch_logs(since_s=since)
            found = parse_run_block(lines, question[len("/think "):].strip())
            if found:
                break
            time.sleep(5)
        elapsed = time.monotonic() - t_send
        if not found:
            print(f"          → [{FAIL}] no reply within {COMPLEX_TIMEOUT}s")
            hard_results.append((question, "timeout", None, False))
            continue
        tier = found.get("tier", "unknown")
        if tier == "complex":
            ok, label, note = True, PASS, "complex ✓"
        elif tier == "medium":
            ok, label, note = True, WARN, "medium (VRAM fallback — check [vram] logs)"
        else:
            ok, label, note = False, FAIL, f"tier={tier} — unexpected"
        lines_block = fetch_logs(since_s=int(elapsed) + 120)
        recent = "\n".join(lines_block[-200:])
        vram_enter_seen = _VRAM_ENTER in recent
        vram_note = ""
        if tier == "complex":
            vram_note = " [vram:flush✓]" if vram_enter_seen else f" [{WARN}:no vram flush log]"
        print(f"          → [{label}] {note}  latency={found['reply_total']:.1f}s  llm={found['llm']:.1f}s{vram_note}")
        hard_results.append((question, tier, found["reply_total"], ok))
        time.sleep(5)
    print(f"\n  {'#':<4}  {'Tier':<8}  {'Latency':>8}  {'Question (/think ...)'}")
    print(f"  {'─'*4}  {'─'*8}  {'─'*8}  {'─'*55}")
    for idx, (q, tier, lat, ok) in enumerate(hard_results, 1):
        lat_str = f"{lat:.1f}s" if lat is not None else "timeout"
        ok_str = "✓" if tier == "complex" else ("~" if tier == "medium" else "✗")
        short = q[len("/think "):].strip()[:55]
        print(f"  {ok_str} {idx:<3}  {tier:<8}  {lat_str:>8}  {short!r}")
    total_hard    = len(hard_results)
    complex_count = sum(1 for _, t, _, _ in hard_results if t == "complex")
    medium_fb     = sum(1 for _, t, _, _ in hard_results if t == "medium")
    light_count   = sum(1 for _, t, _, _ in hard_results if t == "light")
    timeout_count = sum(1 for _, t, _, _ in hard_results if t == "timeout")
    lats = [lat for _, _, lat, _ in hard_results if lat is not None]
    print(f"\n  Breakdown: complex={complex_count}  medium(fallback)={medium_fb}  "
          f"light={light_count}  timeout={timeout_count}")
    if medium_fb:
        print(f"  [{WARN}] {medium_fb} question(s) fell back to medium (VRAM eviction timeout)")
    if light_count:
        print(f"  [{FAIL}] {light_count} question(s) routed to light — /think prefix not detected")
    if lats:
        print(f"  Avg latency: {sum(lats)/len(lats):.1f}s  min={min(lats):.1f}s  max={max(lats):.1f}s")
    report(results,
           f"Hard questions routed to complex (not light) ({complex_count + medium_fb}/{total_hard})",
           light_count == 0 and timeout_count == 0,
           f"complex={complex_count} medium_fallback={medium_fb} light={light_count} timeout={timeout_count}")
 # ── summary ───────────────────────────────────────────────────────────────────
 print_summary(results)
 sys.exit(0 if all(ok for _, ok in results) else 1)
--- a/tests/requirements.txt
+++ b/tests/requirements.txt
@@ -0,0 +1,2 @@
 pytest>=8.0
 pytest-asyncio>=0.23
--- a/tests/unit/init.py
+++ b/tests/unit/init.py
--- a/tests/unit/conftest.py
+++ b/tests/unit/conftest.py
@@ -0,0 +1,80 @@
 """
 Stub out all third-party packages that Adolf's source modules import.
 This lets the unit tests run without a virtualenv or Docker environment.
 Stubs are installed into sys.modules before any test file is collected.
 """
 import sys
 from unittest.mock import MagicMock
 # ── helpers ────────────────────────────────────────────────────────────────────
 def _mock(name: str) -> MagicMock:
    m = MagicMock(name=name)
    sys.modules[name] = m
    return m
 # ── pydantic: BaseModel must be a real class so `class Foo(BaseModel)` works ──
 class _FakeBaseModel:
    model_fields: dict = {}
    def __init_subclass__(cls, **kwargs):
        pass
    def __init__(self, **data):
        for k, v in data.items():
            setattr(self, k, v)
 _pydantic = _mock("pydantic")
 _pydantic.BaseModel = _FakeBaseModel
 # ── httpx: used by channels.py, vram_manager.py, agent.py ────────────────────
 _mock("httpx")
 # ── fastapi ───────────────────────────────────────────────────────────────────
 _fastapi = _mock("fastapi")
 _mock("fastapi.responses")
 # ── langchain stack ───────────────────────────────────────────────────────────
 _mock("langchain_openai")
 _lc_core = _mock("langchain_core")
 _lc_msgs = _mock("langchain_core.messages")
 _mock("langchain_core.tools")
 # Provide real-ish message classes so router.py can instantiate them
 class _FakeMsg:
    def __init__(self, content=""):
        self.content = content
 class SystemMessage(_FakeMsg):
    pass
 class HumanMessage(_FakeMsg):
    pass
 class AIMessage(_FakeMsg):
    def __init__(self, content="", tool_calls=None):
        super().__init__(content)
        self.tool_calls = tool_calls or []
 _lc_msgs.SystemMessage = SystemMessage
 _lc_msgs.HumanMessage = HumanMessage
 _lc_msgs.AIMessage = AIMessage
 _mock("langchain_mcp_adapters")
 _mock("langchain_mcp_adapters.client")
 _mock("langchain_community")
 _mock("langchain_community.utilities")
 # ── deepagents (agent_factory.py) ─────────────────────────────────────────────
 _mock("deepagents")
--- a/tests/unit/test_agent_helpers.py
+++ b/tests/unit/test_agent_helpers.py
@@ -0,0 +1,198 @@
 """
 Unit tests for agent.py helper functions:
  - _strip_think(text)
  - _extract_final_text(result)
 agent.py has heavy FastAPI/LangChain imports; conftest.py stubs them out so
 these pure functions can be imported and tested in isolation.
 """
 import pytest
 # conftest.py has already installed all stubs into sys.modules.
 # The FastAPI app is instantiated at module level in agent.py —
 # with the mocked fastapi, that just creates a MagicMock() object
 # and the route decorators are no-ops.
 from agent import _strip_think, _extract_final_text, _extract_urls
 # ── _strip_think ───────────────────────────────────────────────────────────────
 class TestStripThink:
    def test_removes_single_think_block(self):
        text = "<think>internal reasoning</think>Final answer."
        assert _strip_think(text) == "Final answer."
    def test_removes_multiline_think_block(self):
        text = "<think>\nLine one.\nLine two.\n</think>\nResult here."
        assert _strip_think(text) == "Result here."
    def test_no_think_block_unchanged(self):
        text = "This is a plain answer with no think block."
        assert _strip_think(text) == text
    def test_removes_multiple_think_blocks(self):
        text = "<think>step 1</think>middle<think>step 2</think>end"
        assert _strip_think(text) == "middleend"
    def test_strips_surrounding_whitespace(self):
        text = "  <think>stuff</think>  answer  "
        assert _strip_think(text) == "answer"
    def test_empty_think_block(self):
        text = "<think></think>Hello."
        assert _strip_think(text) == "Hello."
    def test_empty_string(self):
        assert _strip_think("") == ""
    def test_only_think_block_returns_empty(self):
        text = "<think>nothing useful</think>"
        assert _strip_think(text) == ""
    def test_think_block_with_nested_tags(self):
        text = "<think>I should use <b>bold</b> here</think>Done."
        assert _strip_think(text) == "Done."
    def test_preserves_markdown(self):
        text = "<think>plan</think>## Report\n\n- Point one\n- Point two"
        result = _strip_think(text)
        assert result == "## Report\n\n- Point one\n- Point two"
 # ── _extract_final_text ────────────────────────────────────────────────────────
 class TestExtractFinalText:
    def _ai_msg(self, content: str, tool_calls=None):
        """Create a minimal AIMessage-like object."""
        class AIMessage:
            pass
        m = AIMessage()
        m.content = content
        m.tool_calls = tool_calls or []
        return m
    def _human_msg(self, content: str):
        class HumanMessage:
            pass
        m = HumanMessage()
        m.content = content
        return m
    def test_returns_last_ai_message_content(self):
        result = {
            "messages": [
                self._human_msg("what is 2+2"),
                self._ai_msg("The answer is 4."),
            ]
        }
        assert _extract_final_text(result) == "The answer is 4."
    def test_returns_last_of_multiple_ai_messages(self):
        result = {
            "messages": [
                self._ai_msg("First response."),
                self._human_msg("follow-up"),
                self._ai_msg("Final response."),
            ]
        }
        assert _extract_final_text(result) == "Final response."
    def test_skips_empty_ai_messages(self):
        result = {
            "messages": [
                self._ai_msg("Real answer."),
                self._ai_msg(""),  # empty — should be skipped
            ]
        }
        assert _extract_final_text(result) == "Real answer."
    def test_strips_think_tags_from_ai_message(self):
        result = {
            "messages": [
                self._ai_msg("<think>reasoning here</think>Clean reply."),
            ]
        }
        assert _extract_final_text(result) == "Clean reply."
    def test_falls_back_to_output_field(self):
        result = {
            "messages": [],
            "output": "Fallback output.",
        }
        assert _extract_final_text(result) == "Fallback output."
    def test_strips_think_from_output_field(self):
        result = {
            "messages": [],
            "output": "<think>thoughts</think>Actual output.",
        }
        assert _extract_final_text(result) == "Actual output."
    def test_returns_none_when_no_content(self):
        result = {"messages": []}
        assert _extract_final_text(result) is None
    def test_returns_none_when_no_messages_and_no_output(self):
        result = {"messages": [], "output": ""}
        # output is falsy → returns None
        assert _extract_final_text(result) is None
    def test_skips_non_ai_messages(self):
        result = {
            "messages": [
                self._human_msg("user question"),
            ]
        }
        assert _extract_final_text(result) is None
    def test_handles_ai_message_with_tool_calls_but_no_content(self):
        """AIMessage that only has tool_calls (no content) should be skipped."""
        msg = self._ai_msg("", tool_calls=[{"name": "web_search", "args": {}}])
        result = {"messages": [msg]}
        assert _extract_final_text(result) is None
    def test_multiline_think_stripped_correctly(self):
        result = {
            "messages": [
                self._ai_msg("<think>\nLong\nreasoning\nblock\n</think>\n## Report\n\nSome content."),
            ]
        }
        assert _extract_final_text(result) == "## Report\n\nSome content."
 # ── _extract_urls ──────────────────────────────────────────────────────────────
 class TestExtractUrls:
    def test_single_url(self):
        assert _extract_urls("check this out https://example.com please") == ["https://example.com"]
    def test_multiple_urls(self):
        urls = _extract_urls("see https://foo.com and https://bar.org/path?q=1")
        assert urls == ["https://foo.com", "https://bar.org/path?q=1"]
    def test_no_urls(self):
        assert _extract_urls("no links here at all") == []
    def test_http_and_https(self):
        urls = _extract_urls("http://old.site and https://new.site")
        assert "http://old.site" in urls
        assert "https://new.site" in urls
    def test_url_at_start_of_message(self):
        assert _extract_urls("https://example.com is interesting") == ["https://example.com"]
    def test_url_only(self):
        assert _extract_urls("https://example.com/page") == ["https://example.com/page"]
    def test_url_with_path_and_query(self):
        url = "https://example.com/articles/123?ref=home&page=2"
        assert _extract_urls(url) == [url]
    def test_empty_string(self):
        assert _extract_urls("") == []
    def test_does_not_include_surrounding_quotes(self):
        # URLs inside quotes should not include the quote character
        urls = _extract_urls('visit "https://example.com" today')
        assert urls == ["https://example.com"]
--- a/tests/unit/test_channels.py
+++ b/tests/unit/test_channels.py
@@ -0,0 +1,125 @@
 """Unit tests for channels.py — register, deliver, pending_replies queue."""
 import asyncio
 import pytest
 from unittest.mock import AsyncMock, patch
 import channels
@pytest.fixture(autouse=True)
 def reset_channels_state():
    """Clear module-level state before and after every test."""
    channels._callbacks.clear()
    channels.pending_replies.clear()
    yield
    channels._callbacks.clear()
    channels.pending_replies.clear()
 # ── register ───────────────────────────────────────────────────────────────────
 class TestRegister:
    def test_register_stores_callback(self):
        cb = AsyncMock()
        channels.register("test_channel", cb)
        assert channels._callbacks["test_channel"] is cb
    def test_register_overwrites_existing(self):
        cb1 = AsyncMock()
        cb2 = AsyncMock()
        channels.register("ch", cb1)
        channels.register("ch", cb2)
        assert channels._callbacks["ch"] is cb2
    def test_register_multiple_channels(self):
        cb_a = AsyncMock()
        cb_b = AsyncMock()
        channels.register("a", cb_a)
        channels.register("b", cb_b)
        assert channels._callbacks["a"] is cb_a
        assert channels._callbacks["b"] is cb_b
 # ── deliver ────────────────────────────────────────────────────────────────────
 class TestDeliver:
    async def test_deliver_enqueues_reply(self):
        channels.register("cli", AsyncMock())
        await channels.deliver("cli-alvis", "cli", "hello world")
        q = channels.pending_replies["cli-alvis"]
        assert not q.empty()
        assert await q.get() == "hello world"
    async def test_deliver_calls_channel_callback(self):
        cb = AsyncMock()
        channels.register("telegram", cb)
        await channels.deliver("tg-123", "telegram", "reply text")
        cb.assert_awaited_once_with("tg-123", "reply text")
    async def test_deliver_unknown_channel_still_enqueues(self):
        """No registered callback for channel → reply still goes to the queue."""
        await channels.deliver("cli-bob", "nonexistent", "fallback reply")
        q = channels.pending_replies["cli-bob"]
        assert await q.get() == "fallback reply"
    async def test_deliver_unknown_channel_does_not_raise(self):
        """Missing callback must not raise an exception."""
        await channels.deliver("cli-x", "ghost_channel", "msg")
    async def test_deliver_creates_queue_if_absent(self):
        channels.register("cli", AsyncMock())
        assert "cli-new" not in channels.pending_replies
        await channels.deliver("cli-new", "cli", "hi")
        assert "cli-new" in channels.pending_replies
    async def test_deliver_reuses_existing_queue(self):
        """Second deliver to the same session appends to the same queue."""
        channels.register("cli", AsyncMock())
        await channels.deliver("cli-alvis", "cli", "first")
        await channels.deliver("cli-alvis", "cli", "second")
        q = channels.pending_replies["cli-alvis"]
        assert await q.get() == "first"
        assert await q.get() == "second"
    async def test_deliver_telegram_sends_to_callback(self):
        sent = []
        async def fake_tg(session_id, text):
            sent.append((session_id, text))
        channels.register("telegram", fake_tg)
        await channels.deliver("tg-999", "telegram", "test message")
        assert sent == [("tg-999", "test message")]
 # ── register_defaults ──────────────────────────────────────────────────────────
 class TestRegisterDefaults:
    def test_registers_telegram_and_cli(self):
        channels.register_defaults()
        assert "telegram" in channels._callbacks
        assert "cli" in channels._callbacks
    async def test_cli_callback_is_noop(self):
        """CLI send callback does nothing (replies are handled via SSE queue)."""
        channels.register_defaults()
        cb = channels._callbacks["cli"]
        # Should not raise and should return None
        result = await cb("cli-alvis", "some reply")
        assert result is None
    async def test_telegram_callback_chunks_long_messages(self):
        """Telegram callback splits messages > 4000 chars into chunks."""
        channels.register_defaults()
        cb = channels._callbacks["telegram"]
        long_text = "x" * 9000  # > 4000 chars → should produce 3 chunks
        with patch("channels.httpx.AsyncClient") as mock_client_cls:
            mock_client = AsyncMock()
            mock_client.__aenter__ = AsyncMock(return_value=mock_client)
            mock_client.__aexit__ = AsyncMock(return_value=False)
            mock_client.post = AsyncMock()
            mock_client_cls.return_value = mock_client
            await cb("tg-123", long_text)
            # 9000 chars / 4000 per chunk = 3 POST calls
            assert mock_client.post.await_count == 3
--- a/tests/unit/test_router.py
+++ b/tests/unit/test_router.py
@@ -0,0 +1,200 @@
 """Unit tests for router.py — Router, _parse_tier, _format_history, _LIGHT_PATTERNS."""
 import pytest
 from unittest.mock import AsyncMock, MagicMock, patch
 from router import Router, _parse_tier, _format_history, _LIGHT_PATTERNS
 # ── _LIGHT_PATTERNS regex ──────────────────────────────────────────────────────
 class TestLightPatterns:
    @pytest.mark.parametrize("text", [
        "hi", "Hi", "HI",
        "hello", "hey", "yo", "sup",
        "good morning", "good evening", "good night", "good afternoon",
        "bye", "goodbye", "see you", "cya", "later", "ttyl",
        "thanks", "thank you", "thx", "ty",
        "ok", "okay", "k", "cool", "great", "awesome", "perfect",
        "sounds good", "got it", "nice", "sure",
        "how are you", "how are you?", "how are you doing today?",
        "what's up",
        "what day comes after Monday?",
        "what day follows Friday?",
        "what comes after summer?",
        "what does NASA stand for?",
        "what does AI stand for?",
        # with trailing punctuation
        "hi!", "hello.", "thanks!",
    ])
    def test_matches(self, text):
        assert _LIGHT_PATTERNS.match(text.strip()), f"Expected light match for: {text!r}"
    @pytest.mark.parametrize("text", [
        "what is the capital of France",
        "tell me about bitcoin",
        "what is 2+2",
        "write me a poem",
        "search for news about the election",
        "what did we talk about last time",
        "what is my name",
        "/think compare these frameworks",
        "how do I install Python",
        "explain machine learning",
        "",  # empty string doesn't match the pattern
    ])
    def test_no_match(self, text):
        assert not _LIGHT_PATTERNS.match(text.strip()), f"Expected NO light match for: {text!r}"
 # ── _parse_tier ────────────────────────────────────────────────────────────────
 class TestParseTier:
    @pytest.mark.parametrize("raw,expected", [
        ("light", "light"),
        ("Light", "light"),
        ("LIGHT\n", "light"),
        ("medium", "medium"),
        ("Medium.", "medium"),
        ("complex", "complex"),
        ("Complex!", "complex"),
        # descriptive words → light
        ("simplefact", "light"),
        ("trivial question", "light"),
        ("basic", "light"),
        ("easy answer", "light"),
        ("general knowledge", "light"),
        # unknown → medium
        ("unknown_category", "medium"),
        ("", "medium"),
        ("I don't know", "medium"),
        # complex only if 'complex' appears in first 60 chars
        ("this is a complex query requiring search", "complex"),
        # _parse_tier checks "complex" before "medium", so complex wins even if medium appears first
        ("medium complexity, not complex", "complex"),
    ])
    def test_parse_tier(self, raw, expected):
        assert _parse_tier(raw) == expected
 # ── _format_history ────────────────────────────────────────────────────────────
 class TestFormatHistory:
    def test_empty(self):
        assert _format_history([]) == "(none)"
    def test_single_user_message(self):
        history = [{"role": "user", "content": "hello there"}]
        result = _format_history(history)
        assert "user: hello there" in result
    def test_multiple_turns(self):
        history = [
            {"role": "user", "content": "What is Python?"},
            {"role": "assistant", "content": "Python is a programming language."},
        ]
        result = _format_history(history)
        assert "user: What is Python?" in result
        assert "assistant: Python is a programming language." in result
    def test_truncates_long_content(self):
        long_content = "x" * 300
        history = [{"role": "user", "content": long_content}]
        result = _format_history(history)
        # content is truncated to 200 chars in _format_history
        assert len(result) < 250
    def test_missing_keys_handled(self):
        # Should not raise — uses .get() with defaults
        history = [{"role": "user"}]  # no content key
        result = _format_history(history)
        assert "user:" in result
 # ── Router.route() ─────────────────────────────────────────────────────────────
 class TestRouterRoute:
    def _make_router(self, classify_response: str, reply_response: str = "Sure!") -> Router:
        """Return a Router with a mock model that returns given classification and reply."""
        model = MagicMock()
        classify_msg = MagicMock()
        classify_msg.content = classify_response
        reply_msg = MagicMock()
        reply_msg.content = reply_response
        # First ainvoke call → classification; second → reply
        model.ainvoke = AsyncMock(side_effect=[classify_msg, reply_msg])
        return Router(model=model)
    async def test_force_complex_bypasses_classification(self):
        router = self._make_router("medium")
        tier, reply = await router.route("some question", [], force_complex=True)
        assert tier == "complex"
        assert reply is None
        # Model should NOT have been called
        router.model.ainvoke.assert_not_called()
    async def test_regex_light_skips_llm_classification(self):
        # Regex match bypasses classification entirely; the only ainvoke call is the reply.
        model = MagicMock()
        reply_msg = MagicMock()
        reply_msg.content = "I'm doing great!"
        model.ainvoke = AsyncMock(return_value=reply_msg)
        router = Router(model=model)
        tier, reply = await router.route("how are you", [], force_complex=False)
        assert tier == "light"
        assert reply == "I'm doing great!"
        # Exactly one model call — no classification step
        assert router.model.ainvoke.call_count == 1
    async def test_llm_classifies_medium(self):
        router = self._make_router("medium")
        tier, reply = await router.route("what is the bitcoin price?", [], force_complex=False)
        assert tier == "medium"
        assert reply is None
    async def test_llm_classifies_light_generates_reply(self):
        router = self._make_router("light", "Paris is the capital of France.")
        tier, reply = await router.route("what is the capital of France?", [], force_complex=False)
        assert tier == "light"
        assert reply == "Paris is the capital of France."
    async def test_llm_classifies_complex_downgraded_to_medium(self):
        # Without /think prefix, complex classification → downgraded to medium
        router = self._make_router("complex")
        tier, reply = await router.route("compare React and Vue", [], force_complex=False)
        assert tier == "medium"
        assert reply is None
    async def test_llm_error_falls_back_to_medium(self):
        model = MagicMock()
        model.ainvoke = AsyncMock(side_effect=Exception("connection error"))
        router = Router(model=model)
        tier, reply = await router.route("some question", [], force_complex=False)
        assert tier == "medium"
        assert reply is None
    async def test_light_reply_empty_falls_back_to_medium(self):
        """If the light reply comes back empty, router returns medium instead."""
        router = self._make_router("light", "")  # empty reply
        tier, reply = await router.route("what is 2+2", [], force_complex=False)
        assert tier == "medium"
        assert reply is None
    async def test_strips_think_tags_from_classification(self):
        """Router strips <think>...</think> from model output before parsing tier."""
        model = MagicMock()
        classify_msg = MagicMock()
        classify_msg.content = "<think>Hmm let me think...</think>medium"
        reply_msg = MagicMock()
        reply_msg.content = "I'm fine!"
        model.ainvoke = AsyncMock(side_effect=[classify_msg, reply_msg])
        router = Router(model=model)
        tier, _ = await router.route("what is the news?", [], force_complex=False)
        assert tier == "medium"
    async def test_think_prefix_forces_complex(self):
        """/think prefix is already stripped by agent.py; force_complex=True is passed."""
        router = self._make_router("medium")
        tier, reply = await router.route("analyse this", [], force_complex=True)
        assert tier == "complex"
        assert reply is None
--- a/tests/unit/test_vram_manager.py
+++ b/tests/unit/test_vram_manager.py
@@ -0,0 +1,164 @@
 """Unit tests for vram_manager.py — VRAMManager flush/poll/prewarm logic."""
 import asyncio
 import pytest
 from unittest.mock import AsyncMock, MagicMock, patch
 from vram_manager import VRAMManager
 BASE_URL = "http://localhost:11434"
 def _make_manager() -> VRAMManager:
    return VRAMManager(base_url=BASE_URL)
 def _mock_client(get_response=None, post_response=None):
    """Return a context-manager mock for httpx.AsyncClient."""
    client = AsyncMock()
    client.__aenter__ = AsyncMock(return_value=client)
    client.__aexit__ = AsyncMock(return_value=False)
    if get_response is not None:
        client.get = AsyncMock(return_value=get_response)
    if post_response is not None:
        client.post = AsyncMock(return_value=post_response)
    return client
 # ── _flush ─────────────────────────────────────────────────────────────────────
 class TestFlush:
    async def test_sends_keep_alive_zero(self):
        client = _mock_client(post_response=MagicMock())
        with patch("vram_manager.httpx.AsyncClient", return_value=client):
            mgr = _make_manager()
            await mgr._flush("qwen3:4b")
        client.post.assert_awaited_once()
        _, kwargs = client.post.await_args
        body = kwargs.get("json") or client.post.call_args[1].get("json") or client.post.call_args[0][1]
        assert body["model"] == "qwen3:4b"
        assert body["keep_alive"] == 0
    async def test_posts_to_correct_endpoint(self):
        client = _mock_client(post_response=MagicMock())
        with patch("vram_manager.httpx.AsyncClient", return_value=client):
            mgr = _make_manager()
            await mgr._flush("qwen3:8b")
        url = client.post.call_args[0][0]
        assert url == f"{BASE_URL}/api/generate"
    async def test_ignores_exceptions_silently(self):
        client = AsyncMock()
        client.__aenter__ = AsyncMock(return_value=client)
        client.__aexit__ = AsyncMock(return_value=False)
        client.post = AsyncMock(side_effect=Exception("connection refused"))
        with patch("vram_manager.httpx.AsyncClient", return_value=client):
            mgr = _make_manager()
            # Should not raise
            await mgr._flush("qwen3:4b")
 # ── _prewarm ───────────────────────────────────────────────────────────────────
 class TestPrewarm:
    async def test_sends_keep_alive_300(self):
        client = _mock_client(post_response=MagicMock())
        with patch("vram_manager.httpx.AsyncClient", return_value=client):
            mgr = _make_manager()
            await mgr._prewarm("qwen3:4b")
        _, kwargs = client.post.await_args
        body = kwargs.get("json") or client.post.call_args[1].get("json") or client.post.call_args[0][1]
        assert body["keep_alive"] == 300
        assert body["model"] == "qwen3:4b"
    async def test_ignores_exceptions_silently(self):
        client = AsyncMock()
        client.__aenter__ = AsyncMock(return_value=client)
        client.__aexit__ = AsyncMock(return_value=False)
        client.post = AsyncMock(side_effect=Exception("timeout"))
        with patch("vram_manager.httpx.AsyncClient", return_value=client):
            mgr = _make_manager()
            await mgr._prewarm("qwen3:4b")
 # ── _poll_evicted ──────────────────────────────────────────────────────────────
 class TestPollEvicted:
    async def test_returns_true_when_models_absent(self):
        resp = MagicMock()
        resp.json.return_value = {"models": [{"name": "some_other_model"}]}
        client = _mock_client(get_response=resp)
        with patch("vram_manager.httpx.AsyncClient", return_value=client):
            mgr = _make_manager()
            result = await mgr._poll_evicted(["qwen3:4b", "qwen2.5:1.5b"], timeout=5)
        assert result is True
    async def test_returns_false_on_timeout_when_model_still_loaded(self):
        resp = MagicMock()
        resp.json.return_value = {"models": [{"name": "qwen3:4b"}]}
        client = _mock_client(get_response=resp)
        with patch("vram_manager.httpx.AsyncClient", return_value=client):
            mgr = _make_manager()
            result = await mgr._poll_evicted(["qwen3:4b"], timeout=0.1)
        assert result is False
    async def test_returns_true_immediately_if_already_empty(self):
        resp = MagicMock()
        resp.json.return_value = {"models": []}
        client = _mock_client(get_response=resp)
        with patch("vram_manager.httpx.AsyncClient", return_value=client):
            mgr = _make_manager()
            result = await mgr._poll_evicted(["qwen3:4b"], timeout=5)
        assert result is True
    async def test_handles_poll_error_and_continues(self):
        """If /api/ps errors, polling continues until timeout."""
        client = AsyncMock()
        client.__aenter__ = AsyncMock(return_value=client)
        client.__aexit__ = AsyncMock(return_value=False)
        client.get = AsyncMock(side_effect=Exception("network error"))
        with patch("vram_manager.httpx.AsyncClient", return_value=client):
            mgr = _make_manager()
            result = await mgr._poll_evicted(["qwen3:4b"], timeout=0.2)
        assert result is False
 # ── enter_complex_mode / exit_complex_mode ─────────────────────────────────────
 class TestComplexMode:
    async def test_enter_complex_mode_returns_true_on_success(self):
        mgr = _make_manager()
        mgr._flush = AsyncMock()
        mgr._poll_evicted = AsyncMock(return_value=True)
        result = await mgr.enter_complex_mode()
        assert result is True
    async def test_enter_complex_mode_flushes_medium_models(self):
        mgr = _make_manager()
        mgr._flush = AsyncMock()
        mgr._poll_evicted = AsyncMock(return_value=True)
        await mgr.enter_complex_mode()
        flushed = {call.args[0] for call in mgr._flush.call_args_list}
        assert "qwen3:4b" in flushed
        assert "qwen2.5:1.5b" in flushed
    async def test_enter_complex_mode_returns_false_on_eviction_timeout(self):
        mgr = _make_manager()
        mgr._flush = AsyncMock()
        mgr._poll_evicted = AsyncMock(return_value=False)
        result = await mgr.enter_complex_mode()
        assert result is False
    async def test_exit_complex_mode_flushes_complex_and_prewarms_medium(self):
        mgr = _make_manager()
        mgr._flush = AsyncMock()
        mgr._prewarm = AsyncMock()
        await mgr.exit_complex_mode()
        # Must flush 8b
        flushed = {call.args[0] for call in mgr._flush.call_args_list}
        assert "qwen3:8b" in flushed
        # Must prewarm medium models
        prewarmed = {call.args[0] for call in mgr._prewarm.call_args_list}
        assert "qwen3:4b" in prewarmed
        assert "qwen2.5:1.5b" in prewarmed
--- a/tests/use_cases/apple_pie_research.md
+++ b/tests/use_cases/apple_pie_research.md
@@ -0,0 +1,41 @@
 # Use Case: Apple Pie Research
 Verify that a deep research query triggers the complex tier, uses web search and
 page fetching, and produces a substantive, well-sourced recipe response.
 ## Steps
 **1. Send the research query** (the `/think` prefix forces complex tier):
 ```bash
 curl -s -X POST http://localhost:8000/message \
  -H "Content-Type: application/json" \
  -d '{"text": "/think what is the best recipe for an apple pie?", "session_id": "use-case-apple-pie", "channel": "cli", "user_id": "claude"}'
 ```
 **2. Wait for the streaming reply** (complex tier can take up to 5 minutes):
 ```bash
 curl -s -N --max-time 300 "http://localhost:8000/stream/use-case-apple-pie"
 ```
 **3. Confirm tier and tool usage in agent logs:**
 ```bash
 docker compose -f /home/alvis/adolf/docker-compose.yml logs deepagents \
  --since=600s | grep -E "tier=complex|web_search|fetch_url|crawl4ai"
 ```
 ## Evaluate (use your judgment)
 Check each of the following:
 - **Tier**: logs show `tier=complex` for this session
 - **Tool use**: logs show `web_search` or `fetch_url` calls during the request
 - **Ingredients**: response lists specific apple pie ingredients (apples, flour, butter, sugar, etc.)
 - **Method**: response includes preparation or baking steps
 - **Sources**: response cites real URLs it fetched, not invented links
 - **Quality**: response is structured and practical — not a refusal, stub, or generic placeholder
 Report PASS only if all six criteria are met. For any failure, state which criterion
 failed and quote the relevant part of the response or logs.
--- a/tests/use_cases/cli_startup.md
+++ b/tests/use_cases/cli_startup.md
@@ -0,0 +1,18 @@
 # Use Case: CLI Startup
 Verify the Adolf CLI container starts cleanly, shows the welcome banner,
 and exits without error when the user closes input.
 ## Steps
 ```bash
 echo "" | docker compose --profile tools run --rm -T cli \
  python3 cli.py --url http://deepagents:8000 --session use-case-cli-startup
 echo "exit code: $?"
 ```
 ## Pass if
 - Output contains `Adolf CLI`
 - Output contains the session name and gateway URL
 - Exit code is 0
--- a/tests/use_cases/weather_now.md
+++ b/tests/use_cases/weather_now.md
@@ -0,0 +1,40 @@
 # Use Case: Current Weather Query
 Verify how Adolf handles a real-time information request ("what's the weather now?").
 This question requires live data that an LLM cannot answer from training alone.
 ## Steps
 **1. Send the weather query:**
 ```bash
 curl -s -X POST http://localhost:8000/message \
  -H "Content-Type: application/json" \
  -d '{"text": "whats the weather right now?", "session_id": "use-case-weather", "channel": "cli", "user_id": "claude"}'
 ```
 **2. Stream the reply** (medium tier should respond within 30s):
 ```bash
 curl -s -N --max-time 60 "http://localhost:8000/stream/use-case-weather"
 ```
 **3. Check routing tier and any tool usage in logs:**
 ```bash
 docker compose -f /home/alvis/adolf/docker-compose.yml logs deepagents \
  --since=120s | grep -E "tier=|web_search|fetch_url|crawl4ai"
 ```
 ## Evaluate (use your judgment)
 Check each of the following:
 - **Routing**: which tier was selected? Was it appropriate for a real-time query?
 - **Tool use**: did the agent use web_search or any external data source?
 - **Accuracy**: does the response contain actual current weather data (temperature, conditions) or is it a guess/refusal?
 - **Honesty**: if the agent cannot fetch weather, does it say so — or does it hallucinate fake data?
 - **Helpfulness**: does the response suggest how the user could get weather info (e.g. check a website, use /think)?
 Report PASS only if the response is both honest and helpful. A hallucinated weather
 report is a FAIL. A honest "I can't check weather" with guidance is a PASS.
Author	SHA1	Message	Date
alvis	887d4b8d90	voice benchmark: rename --dry-run → --no-inference, fix log extraction - --no-inference applies to all tiers (not just complex) - metadata key: dry_run → no_inference - extract_tier_from_logs: forward iteration (not reversed), updated regex - GPU check skipped when --no-inference - Fix TypeError in misclassified print when actual=None Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-24 07:58:05 +00:00
alvis	4e6d3090c2	Remove benchmark.json from gitignore — dataset is now tracked Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-24 07:53:35 +00:00
alvis	5b09a99a7f	Routing: 100% accuracy on realistic home assistant dataset - router.py: skip light reply generation when no_inference=True; add control words (да/нет/стоп/отмена/повтори/подожди/etc.) to _LIGHT_PATTERNS - agent.py: pass no_inference to router.route(); skip preflight IO in no_inference mode - benchmarks/benchmark.json: replace definition-heavy queries with realistic Alexa/Google-Home style queries (greetings, smart home, timers, shopping, weather, personal memory, cooking) — 30 light / 60 medium / 30 complex Routing benchmark: 120/120 (100%), all under 0.1s per query Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-24 07:53:01 +00:00
alvis	3fb90ae083	Skip _reply_semaphore in no_inference mode No GPU inference happens in this mode, so serialization is not needed. Without this, timed-out routing benchmark queries hold the semaphore and cascade-block all subsequent queries. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-24 07:40:07 +00:00
alvis	4d37ac65b2	Skip preflight IO (memory/URL/fast-tools) when no_inference=True In no_inference mode only the routing decision matters — fetching memories and URLs adds latency without affecting the classification. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-24 07:37:55 +00:00
alvis	b7d5896076	routing benchmark: 1s strict deadline per query QUERY_TIMEOUT=1s — classification and routing must complete within 1 second or the query is recorded as 'timeout'. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-24 07:35:13 +00:00
alvis	fc53632c7b	Merge pull request 'feat: rename dry_run to no_inference for all tiers' (#17 ) from worktree-agent-afc013ce into main Reviewed-on: #17	2026-03-24 07:27:04 +00:00
alvis	47a1166be6	Merge pull request 'feat: rename --dry-run to --no-inference in run_benchmark.py' (#18 ) from feat/no-inference-benchmark into main Reviewed-on: #18	2026-03-24 07:26:44 +00:00
alvis	74e5b1758d	Merge pull request 'feat: add run_routing_benchmark.py — routing-only benchmark' (#19 ) from feat/routing-benchmark into main Reviewed-on: #19	2026-03-24 07:26:31 +00:00
alvis	0fbdbf3a5e	Add run_routing_benchmark.py — dedicated routing-only benchmark Tests routing accuracy for all tiers with no_inference=True hardcoded. Fast (QUERY_TIMEOUT=30s), no GPU check, shares benchmark.json dataset. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-24 07:25:16 +00:00
alvis	77db739819	Rename --dry-run to --no-inference, apply to all tiers in run_benchmark.py No-inference mode now skips LLM for all tiers (not just complex), GPU check is auto-skipped, and the metadata key matches agent.py. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-24 03:49:09 +00:00
alvis	9c2f27eed4	Rename dry_run → no_inference, extend to all tiers in agent.py When no_inference=True, routing decision is captured but all LLM inference is skipped — yields constant "I don't know" immediately. Also disables fast-tool short-circuit so routing path always runs. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-24 03:43:42 +00:00
alvis	a363347ae5	Merge pull request 'Fix routing: add Russian tech def patterns to light, strengthen medium smart home' (#13 ) from fix/routing-accuracy into main Reviewed-on: #13	2026-03-24 02:51:17 +00:00
alvis	1d2787766e	Merge pull request 'Remove Bifrost: replace test 4 with LiteLLM health check' (#14 ) from fix/remove-bifrost into main Reviewed-on: #14	2026-03-24 02:48:40 +00:00
alvis	abf792a2ec	Remove Bifrost: replace test 4 with LiteLLM health check - Remove BIFROST constant and fetch_bifrost_logs() from common.py - Add LITELLM constant (localhost:4000) - Replace test_memory.py test 4 (Bifrost pass-through) with LiteLLM health check Fixes #5 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-24 02:46:01 +00:00
alvis	537e927146	Fix routing: add Russian tech def patterns to light, strengthen medium smart home - _LIGHT_PATTERNS: add что\s+такое, что\s+означает, сколько бит/байт, compound greetings (привет, как дела) — these fell through to embedding which sometimes misclassified short Russian phrases as medium - _MEDIUM_PATTERNS: add non-verb-first smart home patterns (свет/лампочка as subject, режим/сцена commands) for benchmark queries with different phrasing Fixes #8, #9 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-24 02:45:42 +00:00
alvis	186e16284b	Merge pull request 'Fix tier logging: capture actual_tier, fix parse_run_block regex, remove reply_text truncation' (#11 ) from fix/tier-logging into main Reviewed-on: #11	2026-03-24 02:44:35 +00:00
alvis	0b428e4ada	Merge pull request 'Fix benchmark log extraction: first tier match, increase log tail to 300' (#12 ) from fix/benchmark-log-extraction into main Reviewed-on: #12	2026-03-24 02:43:26 +00:00
alvis	98095679be	Fix benchmark log extraction: first tier match, increase log tail to 300 - Remove reversed() from extract_tier_from_logs: first match = routing decision (dry-run complex logs tier=complex early, then overwrites with tier=medium at done) - Increase log tail from 80→300 to handle concurrent log activity Fixes #7, #10 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-24 02:42:27 +00:00
alvis	8ef4897869	Fix tier logging: capture actual_tier, fix parse_run_block regex, remove reply_text truncation - Add tier_capture param to _run_agent_pipeline; append tier after determination - Capture actual_tier in run_agent_task from tier_capture list - Log tier in replied-in line: [agent] replied in Xs tier=Y - Remove reply_text[:200] truncation (was breaking benchmark keyword matching) - Update parse_run_block regex to match new log format; llm/send fields now None Fixes #1, #3, #4 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-24 02:41:59 +00:00
Alvis	1f5e272600	Switch from Bifrost to LiteLLM; add Matrix channel; update rules Infrastructure: - docker-compose.yml: replace bifrost container with LiteLLM proxy (host.docker.internal:4000); complex model → deepseek-r1:free via OpenRouter; add Matrix URL env var; mount logs volume - bifrost-config.json: add auth_config + postgres config_store (archived) Routing: - router.py: full semantic 3-tier classifier rewrite — nomic-embed-text centroids for light/medium/complex; regex pre-classifiers for all tiers; Russian utterance sets expanded - agent.py: wire LiteLLM URL; add dry_run support; add Matrix channel Channels: - channels.py: add Matrix adapter (_matrix_send via mx- session prefix) Rules / docs: - agent-pipeline.md: remove /think prefix requirement; document automatic complex tier classification - llm-inference.md: update BIFROST_URL → LITELLM_URL references; add remote model note for complex tier - ARCHITECTURE.md: deleted (superseded by README.md) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-24 02:14:13 +00:00
Alvis	54cb940279	Update docs: add benchmarks/ section, fix complex tier description - CLAUDE.md: add benchmark commands (run_benchmark.py flags, dry-run, categories, voice benchmark) - README.md: add benchmarks/ to Files tree; fix incorrect claim that complex tier requires /think prefix — it is auto-classified via regex and embedding similarity; fix "Complex agent (/think prefix)" heading Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-24 02:13:14 +00:00
Alvis	bd951f943f	Move benchmark scripts into benchmarks/ subdir - benchmarks/run_benchmark.py (was run_benchmark.py) - benchmarks/run_voice_benchmark.py (was run_voice_benchmark.py) - Scripts use Path(__file__).parent so paths resolve correctly in subdir - .gitignore updated: ignore benchmarks/benchmark.json, results_latest.json, voice_results*.json, voice_audio/ Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-24 02:02:46 +00:00
Alvis	ab68bba935	Add routing benchmark scripts; gitignore dataset and results - run_benchmark.py: sends queries to /message, extracts tier= from docker logs, reports per-tier accuracy, saves results_latest.json - run_voice_benchmark.py: voice path benchmark - .gitignore: ignore benchmark.json (dataset) and results_latest.json (runtime output); benchmark scripts are tracked, data files are not Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-24 02:00:17 +00:00
Alvis	3ae1cefbd4	WeatherTool: fetch open-meteo directly, skip LLM for fast tool replies - Replace SearXNG search with direct open-meteo.com API call (no key needed) - WeatherTool now returns a ready-to-deliver reply string - agent.py: short-circuit router+LLM when fast tools return a result (tier=fast) - router.py: fast tool match no longer triggers light reply generation Weather latency: 105-190s → ~1s Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-15 09:42:55 +00:00
Alvis	957360f6ce	Restructure CLAUDE.md per official Claude Code recommendations CLAUDE.md: 178→25 lines — commands + @ARCHITECTURE.md import only Rules split into .claude/rules/ (load at startup, topic-scoped): llm-inference.md — Bifrost-only, semaphore, model name format, timeouts agent-pipeline.md — tier rules, no tools in medium, memory outside loop fast-tools.md — extension guide (path-scoped: fast_tools.py + agent.py) secrets.md — .env keys, Vaultwarden, no hardcoding Path-scoped rule: fast-tools.md only loads when editing fast_tools.py or agent.py Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-13 07:19:09 +00:00
Alvis	3ed47b45da	Split CLAUDE.md per official Claude Code recommendations CLAUDE.md: lean — commands, key conventions, fast tool guide, @ARCHITECTURE.md import routecheck/CLAUDE.md: purpose, access paths, env vars, gotchas openmemory/CLAUDE.md: tools, two Ollama instances, prompts, notes Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-13 07:15:51 +00:00
Alvis	eba805f787	Update docs: fast tools, routecheck service, commute tool - Request flow: add fast_tool_runner.run_matching() to pre-flight gather - New Fast Tools section: WeatherTool + CommuteTool table, extension guide - New routecheck section: captcha UI, internal API, proxy requirements - Services table: add routecheck:8090 - Files tree: add fast_tools.py, routecheck/, updated .env note Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-13 07:10:30 +00:00
Alvis	32089ed596	Add routecheck service and CommuteTool fast tool routecheck/ — FastAPI service (port 8090): - Image captcha (PIL: arithmetic problem, noise, wave distortion) - POST /api/captcha/new + /api/captcha/solve → short-lived token - GET /api/route?from=lat,lon&to=lat,lon&token=... → Yandex Routing API - Internal bypass via INTERNAL_TOKEN env var (for CommuteTool) - HTTPS proxy forwarded to reach Yandex API from container CommuteTool (fast_tools.py): - Matches commute/traffic/arrival time queries - Calls routecheck /api/route with ROUTECHECK_TOKEN - Hardcoded route: Balashikha home → Moscow center - Returns traffic-adjusted travel time + delay annotation Needs: YANDEX_ROUTING_KEY + ROUTECHECK_TOKEN in .env Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-13 07:08:48 +00:00
Alvis	d2ca1926f8	WeatherTool: use Russian query for Celsius sources 'погода Балашиха сейчас' returns Russian weather sites (gismeteo, meteotrend) that report in °C, vs English queries which return Fahrenheit snippets that the model misreads as Celsius. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-13 06:25:53 +00:00
Alvis	af181ba7ec	Rename RealTimeSearchTool → WeatherTool, fetch Balashikha weather via SearXNG WeatherTool queries SearXNG with a fixed 'weather Balashikha Moscow now' query instead of passing the user message as-is. SearXNG has external internet access and returns snippets with actual current conditions. Direct wttr.in fetch not possible — deepagents container has no external internet routing. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-13 05:40:10 +00:00
Alvis	f5fc2e9bfb	Introduce FastTools: pre-flight classifier + context enrichment New fast_tools.py module: - FastTool base class (matches + run interface) - RealTimeSearchTool: SearXNG search for weather/news/prices/scores - FastToolRunner: classifier that checks all tools, runs matching ones concurrently and returns combined context Router accepts FastToolRunner; any_matches() forces medium tier before LLM classification (replaces _MEDIUM_FORCE_PATTERNS regex). agent.py: _REALTIME_RE and _searxng_search_async removed; pre-flight gather now includes fast_tool_runner.run_matching() alongside URL fetch and memory retrieval. To add a new fast tool: subclass FastTool, add to the list in agent.py. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-13 05:18:44 +00:00
Alvis	436299f7e2	Add real-time query handling: pre-search enrichment + routing fix - router.py: add _MEDIUM_FORCE_PATTERNS to block weather/news/price queries from light tier regardless of LLM classification - agent.py: add _REALTIME_RE and _searxng_search_async(); real-time queries now run SearXNG search concurrently with URL fetch + memory retrieval, injecting snippets into medium system prompt - tests/use_cases/weather_now.md: use case test for weather queries Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-13 05:08:08 +00:00
Alvis	8cd41940f0	Update docs: streaming, CLI container, use_cases tests - /stream/{session_id} SSE endpoint replaces /reply/ for CLI - Medium tier streams per-token via astream() with in_think filtering - CLI now runs as Docker container (Dockerfile.cli, profile:tools) - Correct medium model to qwen3:4b with real-time think block filtering - Add use_cases/ test category to commands section - Update files tree and services table Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-12 17:31:36 +00:00
Alvis	b04e8a0925	Add Rich token streaming: server SSE + CLI live display + CLI container Server (agent.py): - _stream_queues: per-session asyncio.Queue for token chunks - _push_stream_chunk() / _end_stream() helpers - Medium tier: astream() with <think> block filtering — real token streaming - Light tier: full reply pushed as single chunk then [DONE] - Complex tier: full reply pushed after agent completes then [DONE] - GET /stream/{session_id} SSE endpoint (data: <chunk>\n\n, data: [DONE]\n\n) - medium_model promoted to module-level global for astream() access CLI (cli.py): - stream_reply(): reads /stream/ SSE, renders tokens live with Rich Live (transient) - Final reply rendered as Markdown after stream completes - os.getlogin() replaced with os.getenv("USER") for container compatibility Dockerfile.cli + docker-compose cli service (profiles: tools): - Run: docker compose --profile tools run --rm -it cli Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-12 17:26:52 +00:00
Alvis	edc9a96f7a	Add use_cases test category as Claude Code skill instructions Use cases are markdown files that Claude Code reads, executes step by step using its tools, and evaluates with its own judgment — not assertion scripts. - cli_startup.md: pipe EOF into cli.py, verify banner and exit code 0 - apple_pie_research.md: /think query → complex tier → web_search + fetch → evaluate recipe quality, sources, and structure Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-12 17:01:13 +00:00
Alvis	a35ba83db7	Add use_cases test category with CLI startup test tests/use_cases/ holds scenario-driven tests run by the Claude Code agent, which acts as both the test runner and mock user. Each test prints a structured transcript; Claude evaluates correctness. First test: test_cli_startup.py — spawns cli.py with a subprocess, reads the welcome banner, sends EOF, and verifies exit code 0. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-12 16:10:04 +00:00
Alvis	021104f510	Split monolithic test_pipeline.py into focused integration test scripts - common.py: shared config, URL constants, benchmark questions, all helpers (get, post_json, check_sse, qdrant_count, fetch_logs, parse_run_block, wait_for, etc.) - test_health.py: service health checks (deepagents, bifrost, GPU/CPU Ollama, Qdrant, SearXNG) - test_memory.py: name store/recall pipeline, memory benchmark (5 facts + 10 recalls), dedup test - test_routing.py: easy/medium/hard tier routing benchmarks with --easy/medium/hard-only flags - Removed test_pipeline.py Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-12 16:02:57 +00:00
Alvis	50097d6092	Embed Crawl4AI at all tiers, restore qwen3:4b medium, update docs - Pre-routing URL fetch: any message with URLs gets content fetched async (httpx.AsyncClient) before routing via _fetch_urls_from_message() - URL context and memories gathered concurrently with asyncio.gather - Light tier upgraded to medium when URL content is present - url_context injected into system prompt for medium and complex agents - Complex agent retains web_search/fetch_url tools + receives pre-fetched content - Medium model restored to qwen3:4b (was temporarily qwen2.5:1.5b) - Unit tests added for _extract_urls - ARCHITECTURE.md: added Tool Handling, Crawl4AI Integration, Memory Pipeline sections - CLAUDE.md: updated request flow and Crawl4AI integration docs Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-12 15:49:34 +00:00
Alvis	f9618a9bbf	Integrate Bifrost LLM gateway, add test suite, implement memory pipeline - Add Bifrost (maximhq/bifrost) as LLM gateway: all inference routes through bifrost:8080/v1 with retry logic and observability; VRAMManager keeps direct Ollama access for VRAM flush/prewarm operations - Switch medium model from qwen3:4b to qwen2.5:1.5b (direct call, no tools) via _DirectModel wrapper; complex keeps create_deep_agent with qwen3:8b - Implement out-of-agent memory pipeline: _retrieve_memories pre-fetches relevant context (injected into all tiers), _store_memory runs as background task after each reply writing to openmemory/Qdrant - Add tests/unit/ with 133 tests covering router, channels, vram_manager, agent helpers; move integration test to tests/integration/ - Add bifrost-config.json with GPU Ollama (qwen2.5:0.5b/1.5b, qwen3:4b/8b, gemma3:4b) and CPU Ollama providers - Integration test 28/29 pass (only grammy fails — no TELEGRAM_BOT_TOKEN) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-12 13:50:12 +00:00