Benchmark: ~50 queries return "?" due to tier= log extraction timeout #7
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Problem
In
run_benchmark.py,extract_tier_from_logs()uses a 80-line log tail diff to findtier=<value>log lines. About 50 out of 120 queries returnactual: "?"withelapsed=15.4s(theQUERY_TIMEOUT).Root Cause
The benchmark waits for SSE
[DONE](up to 300s), then readsdocker logs deepagents --tail 80. If the container produced many lines during inference (tool calls, memory ops, etc.), the relevant[router] tier=...or[agent] tier=...line scrolls out of the 80-line window.Also,
extract_tier_from_logscomparesbefore_linesas a set — duplicate log lines from different queries will be filtered out incorrectly.Fix
--tailfrom 80 to 200+tier=<value>with a uniquesession_id=benchmark-{id}marker so extraction can grep by sessionImpact
50/120 queries unclassifiable in latest benchmark run (
results_latest.json)