Benchmark: complex tier never triggered — 0% accuracy (40 queries) #10

Closed
opened 2026-03-24 01:58:46 +00:00 by alvis · 0 comments
Owner

Problem

All 40 complex-tier queries in the benchmark returned either medium or ? (timeout). 0 complex queries were correctly routed. All complex benchmark queries used dry_run=true.

Root Cause

1. dry_run log format mismatch

In agent.py (_run_agent_pipeline), when dry_run=True and tier=complex, the log line is:

[agent] tier=complex (dry-run) → using medium model

But extract_tier_from_logs regex is tier=(\w+(?:\s*\(dry-run\))?) — it should match complex (dry-run), then normalise to complex via .split()[0]. However if the log line is slightly different or the tail window misses it, extraction fails.

2. _COMPLEX_PATTERNS regex not matching

The complex regex requires (?:^|\s) before the keyword. Example failures:

  • "исследуй лучший стек мониторинга" → medium ("исследуй" should match but may not)
  • "изучи лучшие Zigbee координаторы" → medium ("изучи" alone without "все" not in patterns)
  • "напиши подробный отчет" → medium

The regex uses re.search() so anchoring is not the issue. Likely изучи лучши (without "все") is not in _COMPLEX_PATTERNS.

Fix

  • Test each complex pattern regex against the actual failing queries
  • Add изучи лучши (without "все") as a standalone complex trigger
  • Fix benchmark dry-run logging: emit tier=complex on a separate line so extraction is unambiguous
  • Add benchmark-specific log marker: [benchmark] session={id} tier={tier} for reliable grep

Impact

0/40 complex queries correctly classified — the entire complex tier is effectively broken in current routing.

## Problem All 40 complex-tier queries in the benchmark returned either `medium` or `?` (timeout). **0 complex queries were correctly routed.** All complex benchmark queries used `dry_run=true`. ## Root Cause ### 1. dry_run log format mismatch In `agent.py` (`_run_agent_pipeline`), when `dry_run=True` and `tier=complex`, the log line is: ``` [agent] tier=complex (dry-run) → using medium model ``` But `extract_tier_from_logs` regex is `tier=(\w+(?:\s*\(dry-run\))?)` — it should match `complex (dry-run)`, then normalise to `complex` via `.split()[0]`. However if the log line is slightly different or the tail window misses it, extraction fails. ### 2. _COMPLEX_PATTERNS regex not matching The complex regex requires `(?:^|\s)` before the keyword. Example failures: - "исследуй лучший стек мониторинга" → medium ("исследуй" should match but may not) - "изучи лучшие Zigbee координаторы" → medium ("изучи" alone without "все" not in patterns) - "напиши подробный отчет" → medium The regex uses `re.search()` so anchoring is not the issue. Likely `изучи лучши` (without "все") is not in `_COMPLEX_PATTERNS`. ## Fix - Test each complex pattern regex against the actual failing queries - Add `изучи лучши` (without "все") as a standalone complex trigger - Fix benchmark dry-run logging: emit `tier=complex` on a separate line so extraction is unambiguous - Add benchmark-specific log marker: `[benchmark] session={id} tier={tier}` for reliable grep ## Impact 0/40 complex queries correctly classified — the entire complex tier is effectively broken in current routing.
alvis closed this issue 2026-03-24 02:52:44 +00:00
Sign in to join this conversation.
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: alvis/adolf#10