Update Adolf wiki: benchmarking section, fix complex tier docs, add benchmarks/ files

2026-03-24 02:13:06 +00:00
parent 9a7c64d902
commit e3242f4ec2
1 changed files with 33 additions and 14 deletions
--- a/unnamed.md
+++ b/unnamed.md
@@ -19,10 +19,10 @@ Pre-flight (asyncio.gather — all parallel):
     ↓
 Fast tool matched? → deliver reply directly (no LLM)
     ↓ (if no fast tool)
-Router (qwen2.5:1.5b)
+Router (qwen2.5:1.5b + nomic-embed-text)
  - light:   simple/conversational → router answers directly (~2–4s)
  - medium:  default → qwen3:4b single call (~10–20s)
-  - complex: /think prefix → qwen3:8b + web_search + fetch_url (~60–120s)
+  - complex: deep research queries → remote model + web_search + fetch_url (~60–120s)
     ↓
 channels.deliver() → Telegram / CLI SSE stream
     ↓
@@ -34,11 +34,11 @@ asyncio.create_task(_store_memory()) — background
 | Tier | Model | Trigger | Latency |
 |------|-------|---------|---------|
 | Fast | — (no LLM) | Fast tool matched (weather, commute) | ~1s |
-| Light | qwen2.5:1.5b (router) | Regex or LLM classifies "light" | ~2–4s |
+| Light | qwen2.5:1.5b (router) | Regex or embedding classifies "light" | ~2–4s |
 | Medium | qwen3:4b | Default | ~10–20s |
-| Complex | qwen3:8b | `/think` prefix only | ~60–120s |
+| Complex | deepseek-r1 (remote via LiteLLM) | Regex pre-classifier or embedding similarity | ~60–120s |

-Complex tier is locked behind `/think` — LLM classification of "complex" is downgraded to medium.
+Routing is automatic — no prefix needed. Complex tier triggers on Russian research keywords (`исследуй`, `изучи все`, `напиши подробный`, etc.) and embedding similarity. Force complex tier via `adolf-deep` model name in OpenAI-compatible API.

 ## Fast Tools

@@ -46,7 +46,7 @@ Pre-flight tools run concurrently before any LLM call. If matched, the result is

 | Tool | Pattern | Source | Latency |
 |------|---------|--------|---------|
-| `WeatherTool` | weather/forecast/temperature/... | open-meteo.com API (Balashikha, no key) | ~200ms |
+| `WeatherTool` | weather/forecast/temperature/... | SearXNG → Russian weather sites | ~1s |
 | `CommuteTool` | commute/traffic/пробки/... | routecheck:8090 → Yandex Routing API | ~1–2s |

 ## Memory Pipeline
@@ -64,15 +64,31 @@ GTX 1070 (8 GB). Flush qwen3:4b before loading qwen3:8b for complex tier.
 3. Fallback to medium on timeout
 4. After complex reply: flush 8b, pre-warm medium + router

+## Benchmarking
+
+Routing accuracy benchmark: `benchmarks/run_benchmark.py`
+
+120 queries across 3 tiers and 10 categories (Russian + English). Sends each query to `/message`, waits for SSE `[DONE]`, extracts `tier=` from `docker logs deepagents`, compares to expected tier.
+
+```bash
+cd ~/adolf/benchmarks
+python3 run_benchmark.py                            # full run
+python3 run_benchmark.py --tier light               # light only
+python3 run_benchmark.py --tier complex --dry-run   # complex routing, no API cost
+python3 run_benchmark.py --list-categories
+```
+
+Latest known results (open issues #7–#10 in Gitea):
+- Light: 11/30 (37%) — tech definition queries mis-routed to medium
+- Medium: 13/50 (26%) — smart home commands mis-routed to light; many timeouts
+- Complex: 0/40 (0%) — log extraction failures + pattern gaps
+
+Dataset (`benchmark.json`) and results (`results_latest.json`) are gitignored.
+
 ## SearXNG

 Port 11437. Used by `web_search` tool in complex tier.

-Disabled slow/broken engines: **startpage** (3s timeout), **google news** (timeout), **qwant news/images/videos** (access denied).
-Fast enabled engines: bing, duckduckgo, brave, google, yahoo (~300–1000ms).
-
-Config: `/mnt/ssd/ai/searxng/config/settings.yml`
-
 ## Compose Stack

 Repo: `~/adolf/` — `http://localhost:3000/alvis/adolf`
@@ -91,12 +107,15 @@ Requires `~/adolf/.env`: `TELEGRAM_BOT_TOKEN`, `ROUTECHECK_TOKEN`, `YANDEX_ROUTI
 ~/adolf/
 ├── docker-compose.yml      Services: bifrost, deepagents, openmemory, grammy, crawl4ai, routecheck, cli
 ├── agent.py                FastAPI gateway, run_agent_task, fast tool short-circuit, memory pipeline
-├── fast_tools.py           WeatherTool (open-meteo), CommuteTool (routecheck), FastToolRunner
-├── router.py               Router — regex + qwen2.5:1.5b classification
+├── fast_tools.py           WeatherTool, CommuteTool, FastToolRunner
+├── router.py               Router — regex pre-classifiers + nomic-embed-text 3-way cosine similarity
 ├── channels.py             Channel registry + deliver()
 ├── vram_manager.py         VRAMManager — flush/poll/prewarm Ollama VRAM
 ├── agent_factory.py        _DirectModel (medium) / create_deep_agent (complex)
 ├── cli.py                  Rich Live streaming REPL
+├── benchmarks/
+│   ├── run_benchmark.py    Routing accuracy benchmark (120 queries, 3 tiers)
+│   └── run_voice_benchmark.py  Voice path benchmark
 ├── routecheck/             Yandex Routing API proxy (port 8090)
 ├── openmemory/             FastMCP + mem0 MCP server (port 8765)
 └── grammy/                 grammY Telegram bot (port 3001)