Benchmark: complex tier never triggered — 0% accuracy (40 queries) #10

New Issue

alvis · 2026-03-24T01:58:46Z

alvis commented

2026-03-24 01:58:46 +00:00

Problem

All 40 complex-tier queries in the benchmark returned either medium or ? (timeout). 0 complex queries were correctly routed. All complex benchmark queries used dry_run=true.

Root Cause

1. dry_run log format mismatch

In agent.py (_run_agent_pipeline), when dry_run=True and tier=complex, the log line is:

[agent] tier=complex (dry-run) → using medium model

But extract_tier_from_logs regex is tier=(\w+(?:\s*\(dry-run\))?) — it should match complex (dry-run), then normalise to complex via .split()[0]. However if the log line is slightly different or the tail window misses it, extraction fails.

2. _COMPLEX_PATTERNS regex not matching

The complex regex requires (?:^|\s) before the keyword. Example failures:

"исследуй лучший стек мониторинга" → medium ("исследуй" should match but may not)
"изучи лучшие Zigbee координаторы" → medium ("изучи" alone without "все" not in patterns)
"напиши подробный отчет" → medium

The regex uses re.search() so anchoring is not the issue. Likely изучи лучши (without "все") is not in _COMPLEX_PATTERNS.

Fix

Test each complex pattern regex against the actual failing queries
Add изучи лучши (without "все") as a standalone complex trigger
Fix benchmark dry-run logging: emit tier=complex on a separate line so extraction is unambiguous
Add benchmark-specific log marker: [benchmark] session={id} tier={tier} for reliable grep

Impact

0/40 complex queries correctly classified — the entire complex tier is effectively broken in current routing.

## Problem All 40 complex-tier queries in the benchmark returned either `medium` or `?` (timeout). **0 complex queries were correctly routed.** All complex benchmark queries used `dry_run=true`. ## Root Cause ### 1. dry_run log format mismatch In `agent.py` (`_run_agent_pipeline`), when `dry_run=True` and `tier=complex`, the log line is: ``` [agent] tier=complex (dry-run) → using medium model ``` But `extract_tier_from_logs` regex is `tier=(\w+(?:\s*\(dry-run\))?)` — it should match `complex (dry-run)`, then normalise to `complex` via `.split()[0]`. However if the log line is slightly different or the tail window misses it, extraction fails. ### 2. _COMPLEX_PATTERNS regex not matching The complex regex requires `(?:^|\s)` before the keyword. Example failures: - "исследуй лучший стек мониторинга" → medium ("исследуй" should match but may not) - "изучи лучшие Zigbee координаторы" → medium ("изучи" alone without "все" not in patterns) - "напиши подробный отчет" → medium The regex uses `re.search()` so anchoring is not the issue. Likely `изучи лучши` (without "все") is not in `_COMPLEX_PATTERNS`. ## Fix - Test each complex pattern regex against the actual failing queries - Add `изучи лучши` (without "все") as a standalone complex trigger - Fix benchmark dry-run logging: emit `tier=complex` on a separate line so extraction is unambiguous - Add benchmark-specific log marker: `[benchmark] session={id} tier={tier}` for reliable grep ## Impact 0/40 complex queries correctly classified — the entire complex tier is effectively broken in current routing.

alvis referenced this issue

2026-03-24 02:42:27 +00:00

Fix benchmark log extraction: first tier match, increase log tail to 300 #12

alvis referenced this issue from a commit

2026-03-24 02:42:29 +00:00

Fix benchmark log extraction: first tier match, increase log tail to 300

alvis referenced this issue

2026-03-24 02:47:54 +00:00

Fix benchmark log extraction: first tier match, increase log tail to 300 #16

alvis closed this issue

2026-03-24 02:52:44 +00:00

Sign in to join this conversation.