18 Commits

Author SHA1 Message Date
887d4b8d90 voice benchmark: rename --dry-run → --no-inference, fix log extraction
- --no-inference applies to all tiers (not just complex)
- metadata key: dry_run → no_inference
- extract_tier_from_logs: forward iteration (not reversed), updated regex
- GPU check skipped when --no-inference
- Fix TypeError in misclassified print when actual=None

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-24 07:58:05 +00:00
4e6d3090c2 Remove benchmark.json from gitignore — dataset is now tracked
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-24 07:53:35 +00:00
5b09a99a7f Routing: 100% accuracy on realistic home assistant dataset
- router.py: skip light reply generation when no_inference=True;
  add control words (да/нет/стоп/отмена/повтори/подожди/etc.) to _LIGHT_PATTERNS
- agent.py: pass no_inference to router.route(); skip preflight IO in no_inference mode
- benchmarks/benchmark.json: replace definition-heavy queries with realistic
  Alexa/Google-Home style queries (greetings, smart home, timers, shopping,
  weather, personal memory, cooking) — 30 light / 60 medium / 30 complex

Routing benchmark: 120/120 (100%), all under 0.1s per query

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-24 07:53:01 +00:00
3fb90ae083 Skip _reply_semaphore in no_inference mode
No GPU inference happens in this mode, so serialization is not needed.
Without this, timed-out routing benchmark queries hold the semaphore
and cascade-block all subsequent queries.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-24 07:40:07 +00:00
4d37ac65b2 Skip preflight IO (memory/URL/fast-tools) when no_inference=True
In no_inference mode only the routing decision matters — fetching
memories and URLs adds latency without affecting the classification.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-24 07:37:55 +00:00
b7d5896076 routing benchmark: 1s strict deadline per query
QUERY_TIMEOUT=1s — classification and routing must complete within
1 second or the query is recorded as 'timeout'.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-24 07:35:13 +00:00
fc53632c7b Merge pull request 'feat: rename dry_run to no_inference for all tiers' (#17) from worktree-agent-afc013ce into main
Reviewed-on: #17
2026-03-24 07:27:04 +00:00
47a1166be6 Merge pull request 'feat: rename --dry-run to --no-inference in run_benchmark.py' (#18) from feat/no-inference-benchmark into main
Reviewed-on: #18
2026-03-24 07:26:44 +00:00
74e5b1758d Merge pull request 'feat: add run_routing_benchmark.py — routing-only benchmark' (#19) from feat/routing-benchmark into main
Reviewed-on: #19
2026-03-24 07:26:31 +00:00
0fbdbf3a5e Add run_routing_benchmark.py — dedicated routing-only benchmark
Tests routing accuracy for all tiers with no_inference=True hardcoded.
Fast (QUERY_TIMEOUT=30s), no GPU check, shares benchmark.json dataset.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-24 07:25:16 +00:00
77db739819 Rename --dry-run to --no-inference, apply to all tiers in run_benchmark.py
No-inference mode now skips LLM for all tiers (not just complex),
GPU check is auto-skipped, and the metadata key matches agent.py.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-24 03:49:09 +00:00
9c2f27eed4 Rename dry_run → no_inference, extend to all tiers in agent.py
When no_inference=True, routing decision is captured but all LLM
inference is skipped — yields constant "I don't know" immediately.
Also disables fast-tool short-circuit so routing path always runs.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-24 03:43:42 +00:00
a363347ae5 Merge pull request 'Fix routing: add Russian tech def patterns to light, strengthen medium smart home' (#13) from fix/routing-accuracy into main
Reviewed-on: #13
2026-03-24 02:51:17 +00:00
1d2787766e Merge pull request 'Remove Bifrost: replace test 4 with LiteLLM health check' (#14) from fix/remove-bifrost into main
Reviewed-on: #14
2026-03-24 02:48:40 +00:00
abf792a2ec Remove Bifrost: replace test 4 with LiteLLM health check
- Remove BIFROST constant and fetch_bifrost_logs() from common.py
- Add LITELLM constant (localhost:4000)
- Replace test_memory.py test 4 (Bifrost pass-through) with LiteLLM health check

Fixes #5

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-24 02:46:01 +00:00
537e927146 Fix routing: add Russian tech def patterns to light, strengthen medium smart home
- _LIGHT_PATTERNS: add что\s+такое, что\s+означает, сколько бит/байт,
  compound greetings (привет, как дела) — these fell through to embedding
  which sometimes misclassified short Russian phrases as medium
- _MEDIUM_PATTERNS: add non-verb-first smart home patterns (свет/лампочка
  as subject, режим/сцена commands) for benchmark queries with different phrasing

Fixes #8, #9

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-24 02:45:42 +00:00
186e16284b Merge pull request 'Fix tier logging: capture actual_tier, fix parse_run_block regex, remove reply_text truncation' (#11) from fix/tier-logging into main
Reviewed-on: #11
2026-03-24 02:44:35 +00:00
8ef4897869 Fix tier logging: capture actual_tier, fix parse_run_block regex, remove reply_text truncation
- Add tier_capture param to _run_agent_pipeline; append tier after determination
- Capture actual_tier in run_agent_task from tier_capture list
- Log tier in replied-in line: [agent] replied in Xs tier=Y
- Remove reply_text[:200] truncation (was breaking benchmark keyword matching)
- Update parse_run_block regex to match new log format; llm/send fields now None

Fixes #1, #3, #4

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-24 02:41:59 +00:00
9 changed files with 453 additions and 95 deletions

1
.gitignore vendored
View File

@@ -2,7 +2,6 @@ __pycache__/
*.pyc
logs/*.jsonl
adolf_tuning_data/voice_audio/
benchmarks/benchmark.json
benchmarks/results_latest.json
benchmarks/voice_results*.json
benchmarks/voice_audio/

View File

@@ -2,7 +2,7 @@ import asyncio
import json as _json_module
import os
import time
from contextlib import asynccontextmanager
from contextlib import asynccontextmanager, nullcontext
from pathlib import Path
from fastapi import FastAPI, BackgroundTasks, Request
@@ -431,20 +431,25 @@ async def _run_agent_pipeline(
history: list[dict],
session_id: str,
tier_override: str | None = None,
dry_run: bool = False,
no_inference: bool = False,
tier_capture: list | None = None,
) -> AsyncGenerator[str, None]:
"""Core pipeline: pre-flight → routing → inference. Yields text chunks.
tier_override: "light" | "medium" | "complex" | None (auto-route)
dry_run: if True and tier=complex, log tier=complex but use medium model (avoids API cost)
no_inference: if True, routing decision is still made but inference is skipped — yields "I don't know" immediately
Caller is responsible for scheduling _store_memory after consuming all chunks.
"""
async with _reply_semaphore:
async with (nullcontext() if no_inference else _reply_semaphore):
t0 = time.monotonic()
clean_message = message
print(f"[agent] running: {clean_message[:80]!r}", flush=True)
# Fetch URL content, memories, and fast-tool context concurrently
# Skip preflight IO in no_inference mode — only routing decision needed
if no_inference:
url_context = memories = fast_context = None
else:
url_context, memories, fast_context = await asyncio.gather(
_fetch_urls_from_message(clean_message),
_retrieve_memories(clean_message, session_id),
@@ -470,7 +475,7 @@ async def _run_agent_pipeline(
try:
# Short-circuit: fast tool already has the answer
if fast_context and tier_override is None and not url_context:
if fast_context and tier_override is None and not url_context and not no_inference:
tier = "fast"
final_text = fast_context
llm_elapsed = time.monotonic() - t0
@@ -484,23 +489,22 @@ async def _run_agent_pipeline(
tier = tier_override
light_reply = None
if tier_override == "light":
tier, light_reply = await router.route(clean_message, enriched_history)
tier, light_reply = await router.route(clean_message, enriched_history, no_inference=no_inference)
tier = "light"
else:
tier, light_reply = await router.route(clean_message, enriched_history)
tier, light_reply = await router.route(clean_message, enriched_history, no_inference=no_inference)
if url_context and tier == "light":
tier = "medium"
light_reply = None
print("[agent] URL in message → upgraded light→medium", flush=True)
# Dry-run: log as complex but infer with medium (no remote API call)
effective_tier = tier
if dry_run and tier == "complex":
effective_tier = "medium"
print(f"[agent] tier=complex (dry-run) → using medium model, message={clean_message[:60]!r}", flush=True)
else:
print(f"[agent] tier={tier} message={clean_message[:60]!r}", flush=True)
tier = effective_tier
if tier_capture is not None:
tier_capture.append(tier)
if no_inference:
yield "I don't know"
return
if tier == "light":
final_text = light_reply
@@ -591,16 +595,15 @@ async def run_agent_task(
t0 = time.monotonic()
meta = metadata or {}
dry_run = bool(meta.get("dry_run", False))
no_inference = bool(meta.get("no_inference", False))
is_benchmark = bool(meta.get("benchmark", False))
history = _conversation_buffers.get(session_id, [])
final_text = None
actual_tier = "unknown"
tier_capture: list = []
# Patch pipeline to capture tier for logging
# We read it from logs post-hoc; capture via a wrapper
async for chunk in _run_agent_pipeline(message, history, session_id, dry_run=dry_run):
async for chunk in _run_agent_pipeline(message, history, session_id, no_inference=no_inference, tier_capture=tier_capture):
await _push_stream_chunk(session_id, chunk)
if final_text is None:
final_text = chunk
@@ -608,6 +611,7 @@ async def run_agent_task(
final_text += chunk
await _end_stream(session_id)
actual_tier = tier_capture[0] if tier_capture else "unknown"
elapsed_ms = int((time.monotonic() - t0) * 1000)
@@ -621,8 +625,8 @@ async def run_agent_task(
except Exception as e:
print(f"[agent] delivery error (non-fatal): {e}", flush=True)
print(f"[agent] replied in {elapsed_ms / 1000:.1f}s", flush=True)
print(f"[agent] reply_text: {final_text[:200]}", flush=True)
print(f"[agent] replied in {elapsed_ms / 1000:.1f}s tier={actual_tier}", flush=True)
print(f"[agent] reply_text: {final_text}", flush=True)
# Update conversation buffer
buf = _conversation_buffers.get(session_id, [])

137
benchmarks/benchmark.json Normal file
View File

@@ -0,0 +1,137 @@
{
"description": "Adolf routing benchmark — домашние сценарии, Alexa/Google-Home стиль, русский язык",
"tiers": {
"light": "Приветствия, прощания, подтверждения, простые разговорные фразы. Не требуют поиска или действий.",
"medium": "Управление домом, погода/пробки, таймеры, напоминания, покупки, личная память, быстрые вопросы.",
"complex": "Глубокое исследование, сравнение технологий, подробные руководства с несколькими источниками."
},
"queries": [
{"id": 1, "tier": "light", "category": "greetings", "query": "привет"},
{"id": 2, "tier": "light", "category": "greetings", "query": "пока"},
{"id": 3, "tier": "light", "category": "greetings", "query": "спасибо"},
{"id": 4, "tier": "light", "category": "greetings", "query": "привет, как дела?"},
{"id": 5, "tier": "light", "category": "greetings", "query": "окей"},
{"id": 6, "tier": "light", "category": "greetings", "query": "добрый вечер"},
{"id": 7, "tier": "light", "category": "greetings", "query": "доброе утро"},
{"id": 8, "tier": "light", "category": "greetings", "query": "добрый день"},
{"id": 9, "tier": "light", "category": "greetings", "query": "hi"},
{"id": 10, "tier": "light", "category": "greetings", "query": "thanks"},
{"id": 11, "tier": "light", "category": "greetings", "query": "отлично, спасибо"},
{"id": 12, "tier": "light", "category": "greetings", "query": "понятно"},
{"id": 13, "tier": "light", "category": "greetings", "query": "ясно"},
{"id": 14, "tier": "light", "category": "greetings", "query": "ладно"},
{"id": 15, "tier": "light", "category": "greetings", "query": "договорились"},
{"id": 16, "tier": "light", "category": "greetings", "query": "good morning"},
{"id": 17, "tier": "light", "category": "greetings", "query": "good night"},
{"id": 18, "tier": "light", "category": "greetings", "query": "всё понятно"},
{"id": 19, "tier": "light", "category": "greetings", "query": "да"},
{"id": 20, "tier": "light", "category": "greetings", "query": "нет"},
{"id": 21, "tier": "light", "category": "greetings", "query": "не нужно"},
{"id": 22, "tier": "light", "category": "greetings", "query": "отмена"},
{"id": 23, "tier": "light", "category": "greetings", "query": "стоп"},
{"id": 24, "tier": "light", "category": "greetings", "query": "подожди"},
{"id": 25, "tier": "light", "category": "greetings", "query": "повтори"},
{"id": 26, "tier": "light", "category": "greetings", "query": "ты тут?"},
{"id": 27, "tier": "light", "category": "greetings", "query": "слышишь меня?"},
{"id": 28, "tier": "light", "category": "greetings", "query": "всё ок"},
{"id": 29, "tier": "light", "category": "greetings", "query": "хорошо"},
{"id": 30, "tier": "light", "category": "greetings", "query": "пожалуйста"},
{"id": 31, "tier": "medium", "category": "weather_commute", "query": "какая сегодня погода в Балашихе"},
{"id": 32, "tier": "medium", "category": "weather_commute", "query": "пойдет ли сегодня дождь"},
{"id": 33, "tier": "medium", "category": "weather_commute", "query": "какая температура на улице сейчас"},
{"id": 34, "tier": "medium", "category": "weather_commute", "query": "будет ли снег сегодня"},
{"id": 35, "tier": "medium", "category": "weather_commute", "query": "погода на завтра"},
{"id": 36, "tier": "medium", "category": "weather_commute", "query": "сколько ехать до Москвы сейчас"},
{"id": 37, "tier": "medium", "category": "weather_commute", "query": "какие пробки на дороге до Москвы"},
{"id": 38, "tier": "medium", "category": "weather_commute", "query": "время в пути на работу"},
{"id": 39, "tier": "medium", "category": "weather_commute", "query": "есть ли пробки сейчас"},
{"id": 40, "tier": "medium", "category": "weather_commute", "query": "стоит ли брать зонтик"},
{"id": 41, "tier": "medium", "category": "smart_home_control", "query": "включи свет в гостиной"},
{"id": 42, "tier": "medium", "category": "smart_home_control", "query": "выключи свет на кухне"},
{"id": 43, "tier": "medium", "category": "smart_home_control", "query": "какая температура дома"},
{"id": 44, "tier": "medium", "category": "smart_home_control", "query": "установи температуру 22 градуса"},
{"id": 45, "tier": "medium", "category": "smart_home_control", "query": "включи свет в спальне на 50 процентов"},
{"id": 46, "tier": "medium", "category": "smart_home_control", "query": "выключи все лампочки"},
{"id": 47, "tier": "medium", "category": "smart_home_control", "query": "какие устройства сейчас включены"},
{"id": 48, "tier": "medium", "category": "smart_home_control", "query": "закрыты ли все окна"},
{"id": 49, "tier": "medium", "category": "smart_home_control", "query": "включи вентилятор в детской"},
{"id": 50, "tier": "medium", "category": "smart_home_control", "query": "есть ли кто-нибудь дома"},
{"id": 51, "tier": "medium", "category": "smart_home_control", "query": "включи ночной режим"},
{"id": 52, "tier": "medium", "category": "smart_home_control", "query": "какое потребление электричества сегодня"},
{"id": 53, "tier": "medium", "category": "smart_home_control", "query": "выключи телевизор"},
{"id": 54, "tier": "medium", "category": "smart_home_control", "query": "открой шторы в гостиной"},
{"id": 55, "tier": "medium", "category": "smart_home_control", "query": "установи будильник на 7 утра"},
{"id": 56, "tier": "medium", "category": "smart_home_control", "query": "включи кофемашину"},
{"id": 57, "tier": "medium", "category": "smart_home_control", "query": "выключи свет во всём доме"},
{"id": 58, "tier": "medium", "category": "smart_home_control", "query": "сколько у нас датчиков движения"},
{"id": 59, "tier": "medium", "category": "smart_home_control", "query": "состояние всех дверных замков"},
{"id": 60, "tier": "medium", "category": "smart_home_control", "query": "включи режим кино в гостиной"},
{"id": 61, "tier": "medium", "category": "smart_home_control", "query": "прибавь яркость в детской"},
{"id": 62, "tier": "medium", "category": "smart_home_control", "query": "закрой все шторы"},
{"id": 63, "tier": "medium", "category": "smart_home_control", "query": "кто последний открывал входную дверь"},
{"id": 64, "tier": "medium", "category": "smart_home_control", "query": "заблокируй входную дверь"},
{"id": 65, "tier": "medium", "category": "smart_home_control", "query": "покажи камеру у входа"},
{"id": 66, "tier": "medium", "category": "timers_reminders", "query": "поставь таймер на 10 минут"},
{"id": 67, "tier": "medium", "category": "timers_reminders", "query": "напомни мне позвонить врачу в 15:00"},
{"id": 68, "tier": "medium", "category": "timers_reminders", "query": "поставь будильник на завтра в 6:30"},
{"id": 69, "tier": "medium", "category": "timers_reminders", "query": "напомни выключить плиту через 20 минут"},
{"id": 70, "tier": "medium", "category": "timers_reminders", "query": "сколько времени осталось на таймере"},
{"id": 71, "tier": "medium", "category": "shopping_cooking", "query": "добавь молоко в список покупок"},
{"id": 72, "tier": "medium", "category": "shopping_cooking", "query": "что есть в списке покупок"},
{"id": 73, "tier": "medium", "category": "shopping_cooking", "query": "добавь хлеб и яйца в список покупок"},
{"id": 74, "tier": "medium", "category": "shopping_cooking", "query": "сколько граммов муки нужно для блинов на 4 человека"},
{"id": 75, "tier": "medium", "category": "shopping_cooking", "query": "какой рецепт борща ты знаешь"},
{"id": 76, "tier": "medium", "category": "personal_memory", "query": "как меня зовут"},
{"id": 77, "tier": "medium", "category": "personal_memory", "query": "где я живу"},
{"id": 78, "tier": "medium", "category": "personal_memory", "query": "что мы обсуждали в прошлый раз"},
{"id": 79, "tier": "medium", "category": "personal_memory", "query": "что ты знаешь о моем домашнем сервере"},
{"id": 80, "tier": "medium", "category": "personal_memory", "query": "напомни, какие сервисы я запускаю"},
{"id": 81, "tier": "medium", "category": "personal_memory", "query": "что я говорил о своей сети"},
{"id": 82, "tier": "medium", "category": "personal_memory", "query": "что я просил тебя запомнить"},
{"id": 83, "tier": "medium", "category": "quick_info", "query": "какой сейчас курс биткоина"},
{"id": 84, "tier": "medium", "category": "quick_info", "query": "курс доллара к рублю сейчас"},
{"id": 85, "tier": "medium", "category": "quick_info", "query": "есть ли проблемы у Cloudflare сегодня"},
{"id": 86, "tier": "medium", "category": "quick_info", "query": "какая последняя версия Docker"},
{"id": 87, "tier": "medium", "category": "quick_info", "query": "какие новые функции в Home Assistant 2024"},
{"id": 88, "tier": "medium", "category": "quick_info", "query": "как проверить использование диска в Linux"},
{"id": 89, "tier": "medium", "category": "quick_info", "query": "как перезапустить Docker контейнер"},
{"id": 90, "tier": "medium", "category": "quick_info", "query": "как посмотреть логи Docker контейнера"},
{"id": 91, "tier": "complex", "category": "infrastructure", "query": "исследуй и сравни Proxmox, Unraid и TrueNAS для домашней лаборатории"},
{"id": 92, "tier": "complex", "category": "infrastructure", "query": "напиши подробное руководство по безопасности домашнего сервера, подключенного к интернету"},
{"id": 93, "tier": "complex", "category": "infrastructure", "query": "исследуй все доступные дашборды для самохостинга и сравни их функции"},
{"id": 94, "tier": "complex", "category": "infrastructure", "query": "исследуй лучший стек мониторинга для самохостинга в 2024 году со всеми вариантами"},
{"id": 95, "tier": "complex", "category": "infrastructure", "query": "сравни все системы резервного копирования для Linux: Restic, Borg, Duplicati, Timeshift"},
{"id": 96, "tier": "complex", "category": "infrastructure", "query": "напиши полное руководство по настройке обратного прокси Caddy для домашнего сервера с SSL"},
{"id": 97, "tier": "complex", "category": "network", "query": "исследуй и сравни WireGuard, OpenVPN и Tailscale для домашней VPN с детальными плюсами и минусами"},
{"id": 98, "tier": "complex", "category": "network", "query": "исследуй лучшие практики сегментации домашней сети с VLAN и правилами файрвола"},
{"id": 99, "tier": "complex", "category": "network", "query": "изучи все самохостируемые DNS решения и их возможности"},
{"id": 100, "tier": "complex", "category": "network", "query": "исследуй лучшие самохостируемые системы мониторинга сети: Zabbix, Grafana, Prometheus, Netdata"},
{"id": 101, "tier": "complex", "category": "home_assistant", "query": "исследуй и сравни все платформы умного дома: Home Assistant, OpenHAB и Domoticz"},
{"id": 102, "tier": "complex", "category": "home_assistant", "query": "изучи лучшие Zigbee координаторы и их совместимость с Home Assistant в 2024 году"},
{"id": 103, "tier": "complex", "category": "home_assistant", "query": "напиши детальный отчет о поддержке протокола Matter и совместимых устройствах"},
{"id": 104, "tier": "complex", "category": "home_assistant", "query": "исследуй все способы интеграции умных ламп с Home Assistant: Zigbee, WiFi, Bluetooth"},
{"id": 105, "tier": "complex", "category": "home_assistant", "query": "найди и сравни все варианты датчиков движения для умного дома с оценками и ценами"},
{"id": 106, "tier": "complex", "category": "home_assistant", "query": "напиши подробное руководство по настройке автоматизаций в Home Assistant для умного освещения"},
{"id": 107, "tier": "complex", "category": "home_assistant", "query": "исследуй все варианты голосового управления умным домом на русском языке, включая локальные решения"},
{"id": 108, "tier": "complex", "category": "home_assistant", "query": "исследуй все протоколы умного дома и их плюсы и минусы: Zigbee, Z-Wave, WiFi, Thread, Bluetooth"},
{"id": 109, "tier": "complex", "category": "media_files", "query": "исследуй и сравни все самохостируемые решения для хранения фотографий с детальным сравнением функций"},
{"id": 110, "tier": "complex", "category": "media_files", "query": "изучи лучшие самохостируемые медиасерверы: Jellyfin, Plex и Emby — с характеристиками и отзывами"},
{"id": 111, "tier": "complex", "category": "media_files", "query": "сравни все самохостируемые облачные хранилища: Nextcloud, Seafile, Owncloud — производительность и функции"},
{"id": 112, "tier": "complex", "category": "research", "query": "исследуй последние достижения в локальном LLM инференсе и оборудовании для него"},
{"id": 113, "tier": "complex", "category": "research", "query": "изучи лучшие опенсорс альтернативы Google сервисов для приватного домашнего окружения"},
{"id": 114, "tier": "complex", "category": "research", "query": "изучи все варианты локального запуска языковых моделей на видеокарте 8 ГБ VRAM"},
{"id": 115, "tier": "complex", "category": "research", "query": "найди и сравни все фреймворки для создания локальных AI ассистентов с открытым исходным кодом"},
{"id": 116, "tier": "complex", "category": "research", "query": "изучи все доступные локальные ассистенты с голосовым управлением на русском языке"},
{"id": 117, "tier": "complex", "category": "infrastructure", "query": "изучи свежие CVE и уязвимости в популярном самохостируемом ПО: Gitea, Nextcloud, Jellyfin"},
{"id": 118, "tier": "complex", "category": "infrastructure", "query": "напиши детальное сравнение систем управления конфигурацией: Ansible, Salt, Puppet для домашнего окружения"},
{"id": 119, "tier": "complex", "category": "network", "query": "исследуй все самохостируемые решения для блокировки рекламы: Pi-hole, AdGuard Home, NextDNS"},
{"id": 120, "tier": "complex", "category": "research", "query": "напиши подробный отчет о технологиях синтеза речи с открытым исходным кодом на русском языке"}
]
}

View File

@@ -11,7 +11,7 @@ Usage:
python3 run_benchmark.py --category <name>
python3 run_benchmark.py --ids 1,2,3
python3 run_benchmark.py --list-categories
python3 run_benchmark.py --dry-run # complex queries use medium model (no API cost)
python3 run_benchmark.py --no-inference # skip all LLM inference — routing decisions only, all tiers
IMPORTANT: Always check GPU is free before running. This script does it automatically.
@@ -121,10 +121,10 @@ def extract_tier_from_logs(logs_before: str, logs_after: str) -> str | None:
before_lines = set(logs_before.splitlines())
new_lines = [l for l in logs_after.splitlines() if l not in before_lines]
for line in new_lines:
m = re.search(r"tier=(\w+(?:\s*\(dry-run\))?)", line)
m = re.search(r"tier=(\w+(?:\s*\(no-inference\))?)", line)
if m:
tier_raw = m.group(1)
# Normalise: "complex (dry-run)" → "complex"
# Normalise: "complex (no-inference)" → "complex"
return tier_raw.split()[0]
return None
@@ -135,14 +135,14 @@ async def post_message(
client: httpx.AsyncClient,
query_id: int,
query: str,
dry_run: bool = False,
no_inference: bool = False,
) -> bool:
payload = {
"text": query,
"session_id": f"benchmark-{query_id}",
"channel": "cli",
"user_id": "benchmark",
"metadata": {"dry_run": dry_run, "benchmark": True},
"metadata": {"no_inference": no_inference, "benchmark": True},
}
try:
r = await client.post(f"{ADOLF_URL}/message", json=payload, timeout=10)
@@ -172,7 +172,7 @@ def filter_queries(queries, tier, category, ids):
# ── Main run ───────────────────────────────────────────────────────────────────
async def run(queries: list[dict], dry_run: bool = False) -> list[dict]:
async def run(queries: list[dict], no_inference: bool = False) -> list[dict]:
results = []
async with httpx.AsyncClient() as client:
@@ -186,7 +186,7 @@ async def run(queries: list[dict], dry_run: bool = False) -> list[dict]:
total = len(queries)
correct = 0
dry_label = " [DRY-RUN: complex→medium]" if dry_run else ""
dry_label = " [NO-INFERENCE: routing only]" if no_inference else ""
print(f"\nRunning {total} queries{dry_label}\n")
print(f"{'ID':>3} {'EXPECTED':8} {'ACTUAL':8} {'OK':3} {'TIME':6} {'CATEGORY':22} QUERY")
print("" * 110)
@@ -197,8 +197,6 @@ async def run(queries: list[dict], dry_run: bool = False) -> list[dict]:
category = q["category"]
query_text = q["query"]
# In dry-run, complex queries still use complex classification (logged), but medium infers
send_dry = dry_run and expected == "complex"
session_id = f"benchmark-{qid}"
print(f"{qid:>3} {expected:8} ", end="", flush=True)
@@ -206,7 +204,7 @@ async def run(queries: list[dict], dry_run: bool = False) -> list[dict]:
logs_before = get_log_tail(300)
t0 = time.monotonic()
ok_post = await post_message(client, qid, query_text, dry_run=send_dry)
ok_post = await post_message(client, qid, query_text, no_inference=no_inference)
if not ok_post:
print(f"{'?':8} {'ERR':3} {'?':6} {category:22} {query_text[:40]}")
results.append({"id": qid, "expected": expected, "actual": None, "ok": False})
@@ -245,7 +243,7 @@ async def run(queries: list[dict], dry_run: bool = False) -> list[dict]:
"elapsed": round(elapsed, 1),
"category": category,
"query": query_text,
"dry_run": send_dry,
"no_inference": no_inference,
})
print("" * 110)
@@ -281,9 +279,9 @@ def main():
parser.add_argument("--ids", help="Comma-separated IDs")
parser.add_argument("--list-categories", action="store_true")
parser.add_argument(
"--dry-run",
"--no-inference",
action="store_true",
help="For complex queries: route classification is tested but medium model is used for inference (no API cost)",
help="Skip LLM inference for all tiers — only routing decisions are tested (no GPU/API cost)",
)
parser.add_argument(
"--skip-gpu-check",
@@ -302,7 +300,7 @@ def main():
return
# ALWAYS check GPU and RAM before running
if not preflight_checks(skip_gpu_check=args.skip_gpu_check):
if not preflight_checks(skip_gpu_check=args.no_inference):
sys.exit(1)
ids = [int(i) for i in args.ids.split(",")] if args.ids else None
@@ -311,7 +309,7 @@ def main():
print("No queries match filters.")
sys.exit(1)
asyncio.run(run(queries, dry_run=args.dry_run))
asyncio.run(run(queries, no_inference=args.no_inference))
if __name__ == "__main__":

View File

@@ -0,0 +1,218 @@
#!/usr/bin/env python3
"""
Adolf routing benchmark — tests routing decisions only, no LLM inference.
Sends each query with no_inference=True, waits for the routing decision to
appear in docker logs, and records whether the correct tier was selected.
Usage:
python3 run_routing_benchmark.py [options]
python3 run_routing_benchmark.py --tier light|medium|complex
python3 run_routing_benchmark.py --category <name>
python3 run_routing_benchmark.py --ids 1,2,3
python3 run_routing_benchmark.py --list-categories
No GPU check needed — inference is disabled for all queries.
Adolf must be running at http://localhost:8000.
"""
import argparse
import asyncio
import json
import re
import subprocess
import sys
import time
from pathlib import Path
import httpx
ADOLF_URL = "http://localhost:8000"
DATASET = Path(__file__).parent / "benchmark.json"
RESULTS = Path(__file__).parent / "routing_results_latest.json"
QUERY_TIMEOUT = 1 # 1s strict deadline — routing must decide within 1 second
# ── Log helpers ────────────────────────────────────────────────────────────────
def get_log_tail(n: int = 50) -> str:
result = subprocess.run(
["docker", "logs", "deepagents", "--tail", str(n)],
capture_output=True, text=True,
)
return result.stdout + result.stderr
def extract_tier_from_logs(logs_before: str, logs_after: str) -> str | None:
"""Find new tier= lines that appeared after we sent the query."""
before_lines = set(logs_before.splitlines())
new_lines = [line for line in logs_after.splitlines() if line not in before_lines]
for line in new_lines:
m = re.search(r"tier=(\w+(?:\s*\(no-inference\))?)", line)
if m:
tier_raw = m.group(1)
return tier_raw.split()[0]
return None
# ── Request helpers ────────────────────────────────────────────────────────────
async def post_message(client: httpx.AsyncClient, query_id: int, query: str) -> bool:
payload = {
"text": query,
"session_id": f"routing-bench-{query_id}",
"channel": "cli",
"user_id": "benchmark",
"metadata": {"no_inference": True, "benchmark": True},
}
try:
r = await client.post(f"{ADOLF_URL}/message", json=payload, timeout=10)
r.raise_for_status()
return True
except Exception as e:
print(f" POST_ERROR: {e}", end="")
return False
# ── Dataset ────────────────────────────────────────────────────────────────────
def load_dataset() -> list[dict]:
with open(DATASET) as f:
return json.load(f)["queries"]
def filter_queries(queries, tier, category, ids):
if tier:
queries = [q for q in queries if q["tier"] == tier]
if category:
queries = [q for q in queries if q["category"] == category]
if ids:
queries = [q for q in queries if q["id"] in ids]
return queries
# ── Main run ───────────────────────────────────────────────────────────────────
async def run(queries: list[dict]) -> list[dict]:
results = []
async with httpx.AsyncClient() as client:
try:
r = await client.get(f"{ADOLF_URL}/health", timeout=5)
r.raise_for_status()
except Exception as e:
print(f"ERROR: Adolf not reachable: {e}", file=sys.stderr)
sys.exit(1)
total = len(queries)
correct = 0
print(f"\nRunning {total} queries [NO-INFERENCE: routing only]\n")
print(f"{'ID':>3} {'EXPECTED':8} {'ACTUAL':8} {'OK':3} {'TIME':6} {'CATEGORY':22} QUERY")
print("" * 110)
for q in queries:
qid = q["id"]
expected = q["tier"]
category = q["category"]
query_text = q["query"]
session_id = f"routing-bench-{qid}"
print(f"{qid:>3} {expected:8} ", end="", flush=True)
logs_before = get_log_tail(300)
t0 = time.monotonic()
ok_post = await post_message(client, qid, query_text)
if not ok_post:
print(f"{'?':8} {'ERR':3} {'?':6} {category:22} {query_text[:40]}")
results.append({"id": qid, "expected": expected, "actual": None, "ok": False})
continue
try:
async with client.stream(
"GET", f"{ADOLF_URL}/stream/{session_id}", timeout=QUERY_TIMEOUT
) as sse:
async for line in sse.aiter_lines():
if "data: [DONE]" in line:
break
except Exception:
pass # timeout or connection issue — check logs anyway
logs_after = get_log_tail(300)
actual = extract_tier_from_logs(logs_before, logs_after)
if actual is None:
actual = "timeout"
elapsed = time.monotonic() - t0
match = actual == expected or (actual == "fast" and expected == "medium")
if match:
correct += 1
mark = "" if match else ""
actual_str = actual
print(f"{actual_str:8} {mark:3} {elapsed:5.1f}s {category:22} {query_text[:40]}")
results.append({
"id": qid,
"expected": expected,
"actual": actual_str,
"ok": match,
"elapsed": round(elapsed, 1),
"category": category,
"query": query_text,
})
print("" * 110)
accuracy = correct / total * 100 if total else 0
print(f"\nAccuracy: {correct}/{total} ({accuracy:.0f}%)")
for tier_name in ["light", "medium", "complex"]:
tier_qs = [r for r in results if r["expected"] == tier_name]
if tier_qs:
tier_ok = sum(1 for r in tier_qs if r["ok"])
print(f" {tier_name:8}: {tier_ok}/{len(tier_qs)}")
wrong = [r for r in results if not r["ok"]]
if wrong:
print(f"\nMisclassified ({len(wrong)}):")
for r in wrong:
print(f" id={r['id']:3} expected={r['expected']:8} actual={r['actual']:8} {r['query'][:60]}")
with open(RESULTS, "w") as f:
json.dump(results, f, indent=2, ensure_ascii=False)
print(f"\nResults saved to {RESULTS}")
return results
def main():
parser = argparse.ArgumentParser(
description="Adolf routing benchmark — routing decisions only, no LLM inference",
)
parser.add_argument("--tier", choices=["light", "medium", "complex"])
parser.add_argument("--category")
parser.add_argument("--ids", help="Comma-separated IDs")
parser.add_argument("--list-categories", action="store_true")
args = parser.parse_args()
queries = load_dataset()
if args.list_categories:
cats = sorted(set(q["category"] for q in queries))
tiers = {t: sum(1 for q in queries if q["tier"] == t) for t in ["light", "medium", "complex"]}
print(f"Total: {len(queries)} | Tiers: {tiers}")
print(f"Categories: {cats}")
return
ids = [int(i) for i in args.ids.split(",")] if args.ids else None
queries = filter_queries(queries, args.tier, args.category, ids)
if not queries:
print("No queries match filters.")
sys.exit(1)
asyncio.run(run(queries))
if __name__ == "__main__":
main()

View File

@@ -12,7 +12,7 @@ Usage:
python3 run_voice_benchmark.py [options]
python3 run_voice_benchmark.py --tier light|medium|complex
python3 run_voice_benchmark.py --ids 1,2,3
python3 run_voice_benchmark.py --dry-run # complex queries use medium model
python3 run_voice_benchmark.py --no-inference # skip LLM inference — routing only, all tiers
IMPORTANT: Always check GPU is free before running. Done automatically.
@@ -210,9 +210,9 @@ def get_log_tail(n: int = 60) -> str:
def extract_tier_from_logs(logs_before: str, logs_after: str) -> str | None:
before_lines = set(logs_before.splitlines())
new_lines = [l for l in logs_after.splitlines() if l not in before_lines]
for line in reversed(new_lines):
m = re.search(r"tier=(\w+(?:\s*\(dry-run\))?)", line)
new_lines = [line for line in logs_after.splitlines() if line not in before_lines]
for line in new_lines:
m = re.search(r"tier=(\w+(?:\s*\(no-inference\))?)", line)
if m:
return m.group(1).split()[0]
return None
@@ -222,14 +222,14 @@ async def post_to_adolf(
client: httpx.AsyncClient,
query_id: int,
text: str,
dry_run: bool = False,
no_inference: bool = False,
) -> bool:
payload = {
"text": text,
"session_id": f"voice-bench-{query_id}",
"channel": "cli",
"user_id": "benchmark",
"metadata": {"dry_run": dry_run, "benchmark": True, "voice": True},
"metadata": {"no_inference": no_inference, "benchmark": True, "voice": True},
}
try:
r = await client.post(f"{ADOLF_URL}/message", json=payload, timeout=10)
@@ -259,7 +259,7 @@ def filter_queries(queries, tier, category, ids):
# ── Main run ───────────────────────────────────────────────────────────────────
async def run(queries: list[dict], dry_run: bool = False, save_audio: bool = False) -> None:
async def run(queries: list[dict], no_inference: bool = False, save_audio: bool = False) -> None:
async with httpx.AsyncClient() as client:
# Check Adolf
try:
@@ -272,7 +272,7 @@ async def run(queries: list[dict], dry_run: bool = False, save_audio: bool = Fal
total = len(queries)
results = []
dry_label = " [DRY-RUN]" if dry_run else ""
dry_label = " [NO-INFERENCE: routing only]" if no_inference else ""
print(f"Voice benchmark: {total} queries{dry_label}\n")
print(f"{'ID':>3} {'EXP':8} {'ACT':8} {'OK':3} {'WER':5} {'TRANSCRIPT'}")
print("" * 100)
@@ -312,11 +312,10 @@ async def run(queries: list[dict], dry_run: bool = False, save_audio: bool = Fal
wer_count += 1
# Step 3: Send to Adolf
send_dry = dry_run and expected == "complex"
logs_before = get_log_tail(60)
t0 = time.monotonic()
ok_post = await post_to_adolf(client, qid, transcript, dry_run=send_dry)
ok_post = await post_to_adolf(client, qid, transcript, no_inference=no_inference)
if not ok_post:
print(f"{'?':8} {'ERR':3} {wer:4.2f} {transcript[:50]}")
results.append({"id": qid, "expected": expected, "actual": None, "ok": False, "wer": wer, "transcript": transcript})
@@ -349,7 +348,7 @@ async def run(queries: list[dict], dry_run: bool = False, save_audio: bool = Fal
"original": original,
"transcript": transcript,
"elapsed": round(elapsed, 1),
"dry_run": send_dry,
"no_inference": no_inference,
})
await asyncio.sleep(0.5)
@@ -374,7 +373,7 @@ async def run(queries: list[dict], dry_run: bool = False, save_audio: bool = Fal
if wrong:
print(f"\nMisclassified after voice ({len(wrong)}):")
for r in wrong:
print(f" id={r['id']:3} expected={r.get('expected','?'):8} actual={r.get('actual','?'):8} transcript={r.get('transcript','')[:50]}")
print(f" id={r['id']:3} expected={r.get('expected') or '?':8} actual={r.get('actual') or '?':8} transcript={r.get('transcript','')[:50]}")
high_wer = [r for r in results if r.get("wer") and r["wer"] > 0.3]
if high_wer:
@@ -402,14 +401,14 @@ def main():
parser.add_argument("--tier", choices=["light", "medium", "complex"])
parser.add_argument("--category")
parser.add_argument("--ids", help="Comma-separated IDs")
parser.add_argument("--dry-run", action="store_true",
help="Complex queries use medium model for inference (no API cost)")
parser.add_argument("--no-inference", action="store_true",
help="Skip LLM inference for all tiers — routing decisions only (no GPU/API cost)")
parser.add_argument("--save-audio", action="store_true",
help="Save synthesized WAV files to voice_audio/ directory")
parser.add_argument("--skip-gpu-check", action="store_true")
args = parser.parse_args()
if not preflight_checks(skip_gpu_check=args.skip_gpu_check):
if not preflight_checks(skip_gpu_check=args.skip_gpu_check or args.no_inference):
sys.exit(1)
queries = load_dataset()
@@ -419,7 +418,7 @@ def main():
print("No queries match filters.")
sys.exit(1)
asyncio.run(run(queries, dry_run=args.dry_run, save_audio=args.save_audio))
asyncio.run(run(queries, no_inference=args.no_inference, save_audio=args.save_audio))
if __name__ == "__main__":

View File

@@ -52,6 +52,17 @@ _LIGHT_PATTERNS = re.compile(
r"|окей|хорошо|отлично|понятно|ок|ладно|договорились|спс|благодарю"
r"|пожалуйста|не за что|всё понятно|ясно"
r"|как дела|как ты|как жизнь|всё хорошо|всё ок"
# Assistant control words / confirmations
r"|да|нет|стоп|отмена|отменить|подожди|повтори|повторить|не нужно|не надо"
r"|слышишь\s+меня|ты\s+тут|отлично[,!]?\s+спасибо"
r"|yes|no|stop|cancel|wait|repeat"
# Russian tech definitions — static knowledge (no tools needed)
r"|что\s+такое\s+\S+"
r"|что\s+означает\s+\S+"
r"|сколько\s+(?:бит|байт|байтов|мегабайт|мегабайтов|гигабайт|гигабайтов)(?:\s+\w+)*"
# Compound Russian greetings
r"|привет[,!]?\s+как\s+дела"
r"|добрый\s+(?:день|вечер|утро)[,!]?\s+как\s+дела"
r")[\s!.?]*$",
re.IGNORECASE,
)
@@ -314,6 +325,10 @@ _MEDIUM_PATTERNS = re.compile(
r"|курс (?:доллара|биткоина|евро|рубл)"
r"|(?:последние |свежие )?новости\b"
r"|(?:погода|температура)\s+(?:на завтра|на неделю)"
# Smart home commands that don't use verb-first pattern
r"|(?:свет|лампочк|освещени)\w*\s+(?:включ|выключ|убавь|прибавь)"
r"|(?:дома|в доме|по всему дому)\s+(?:свет|лампочк)"
r"|(?:режим|сцена)\s+(?:ночной|утренний|вечерний|кинотеатр)"
r")",
re.IGNORECASE,
)
@@ -411,10 +426,11 @@ class Router:
self,
message: str,
history: list[dict],
no_inference: bool = False,
) -> tuple[str, Optional[str]]:
"""
Returns (tier, reply_or_None).
For light tier: also generates the reply inline.
For light tier: also generates the reply inline (unless no_inference=True).
For medium/complex: reply is None.
"""
if self._fast_tool_runner and self._fast_tool_runner.any_matches(message.strip()):
@@ -424,6 +440,8 @@ class Router:
if _LIGHT_PATTERNS.match(message.strip()):
print("[router] regex→light", flush=True)
if no_inference:
return "light", None
return await self._generate_light_reply(message, history)
if _COMPLEX_PATTERNS.search(message.strip()):
@@ -436,7 +454,7 @@ class Router:
tier = await self._classify_by_embedding(message)
if tier != "light":
if tier != "light" or no_inference:
return tier, None
return await self._generate_light_reply(message, history)

View File

@@ -11,7 +11,7 @@ import urllib.request
# ── config ────────────────────────────────────────────────────────────────────
DEEPAGENTS = "http://localhost:8000"
BIFROST = "http://localhost:8080"
LITELLM = "http://localhost:4000"
OPENMEMORY = "http://localhost:8765"
GRAMMY_HOST = "localhost"
GRAMMY_PORT = 3001
@@ -156,19 +156,6 @@ def fetch_logs(since_s=600):
return []
def fetch_bifrost_logs(since_s=120):
"""Return bifrost container log lines from the last since_s seconds."""
try:
r = subprocess.run(
["docker", "compose", "-f", COMPOSE_FILE, "logs", "bifrost",
f"--since={int(since_s)}s", "--no-log-prefix"],
capture_output=True, text=True, timeout=10,
)
return r.stdout.splitlines()
except Exception:
return []
def parse_run_block(lines, msg_prefix):
"""
Scan log lines for the LAST '[agent] running: <msg_prefix>' block.
@@ -199,14 +186,13 @@ def parse_run_block(lines, msg_prefix):
if txt:
last_ai_text = txt
m = re.search(r"replied in ([\d.]+)s \(llm=([\d.]+)s, send=([\d.]+)s\)", line)
m = re.search(r"replied in ([\d.]+)s(?:\s+tier=(\w+))?", line)
if m:
tier_m = re.search(r"\btier=(\w+)", line)
tier = tier_m.group(1) if tier_m else "unknown"
tier = m.group(2) if m.group(2) else "unknown"
reply_data = {
"reply_total": float(m.group(1)),
"llm": float(m.group(2)),
"send": float(m.group(3)),
"llm": None,
"send": None,
"tier": tier,
"reply_text": last_ai_text,
"memory_s": None,

View File

@@ -6,7 +6,7 @@ Tests:
1. Name store — POST "remember that your name is <RandomName>"
2. Qdrant point — verifies a new vector was written after store
3. Name recall — POST "what is your name?" → reply must contain <RandomName>
4. Bifrost — verifies store/recall requests passed through Bifrost
4. LiteLLM — verifies LiteLLM proxy is reachable (replaced Bifrost)
5. Timing profile — breakdown of store and recall latencies
6. Memory benchmark — store 5 personal facts, recall with 10 questions
7. Dedup test — same fact stored twice must not grow Qdrant by 2 points
@@ -24,11 +24,11 @@ import time
import urllib.request
from common import (
DEEPAGENTS, QDRANT, COMPOSE_FILE, DEFAULT_CHAT_ID,
DEEPAGENTS, LITELLM, QDRANT, COMPOSE_FILE, DEFAULT_CHAT_ID,
NAMES,
INFO, PASS, FAIL, WARN,
report, print_summary, tf,
get, post_json, qdrant_count, fetch_logs, fetch_bifrost_logs,
get, post_json, qdrant_count, fetch_logs,
parse_run_block, wait_for,
)
@@ -155,14 +155,13 @@ if _run_name:
report(results, "Agent replied to recall message", False, "timeout")
report(results, f"Reply contains '{random_name}'", False, "no reply")
# ── 4. Bifrost pass-through check ─────────────────────────────────────────
bifrost_lines = fetch_bifrost_logs(since_s=300)
report(results, "Bifrost container has log output (requests forwarded)",
len(bifrost_lines) > 0, f"{len(bifrost_lines)} lines in bifrost logs")
bifrost_raw = "\n".join(bifrost_lines)
report(results, " Bifrost log shows AsyncOpenAI agent requests",
"AsyncOpenAI" in bifrost_raw,
f"{'found' if 'AsyncOpenAI' in bifrost_raw else 'NOT found'} in bifrost logs")
# ── 4. LiteLLM proxy reachable (replaced Bifrost) ─────────────────────────
try:
status, _ = get(f"{LITELLM}/health", timeout=5)
litellm_ok = status == 200
except Exception:
litellm_ok = False
report(results, "LiteLLM proxy reachable", litellm_ok)
# ── 5. Timing profile ─────────────────────────────────────────────────────
print(f"\n[{INFO}] 5. Timing profile")