Benchmark: ~50 queries return "?" due to tier= log extraction timeout #7

New Issue

alvis · 2026-03-24T01:57:56Z

alvis commented

2026-03-24 01:57:56 +00:00

Problem

In run_benchmark.py, extract_tier_from_logs() uses a 80-line log tail diff to find tier=<value> log lines. About 50 out of 120 queries return actual: "?" with elapsed=15.4s (the QUERY_TIMEOUT).

Root Cause

The benchmark waits for SSE [DONE] (up to 300s), then reads docker logs deepagents --tail 80. If the container produced many lines during inference (tool calls, memory ops, etc.), the relevant [router] tier=... or [agent] tier=... line scrolls out of the 80-line window.

Also, extract_tier_from_logs compares before_lines as a set — duplicate log lines from different queries will be filtered out incorrectly.

Fix

Increase --tail from 80 to 200+
Use timestamps or sequence numbers to isolate lines, not a set diff
Consider logging tier=<value> with a unique session_id=benchmark-{id} marker so extraction can grep by session

Impact

50/120 queries unclassifiable in latest benchmark run (results_latest.json)

## Problem In `run_benchmark.py`, `extract_tier_from_logs()` uses a 80-line log tail diff to find `tier=<value>` log lines. About 50 out of 120 queries return `actual: "?"` with `elapsed=15.4s` (the `QUERY_TIMEOUT`). ## Root Cause The benchmark waits for SSE `[DONE]` (up to 300s), then reads `docker logs deepagents --tail 80`. If the container produced many lines during inference (tool calls, memory ops, etc.), the relevant `[router] tier=...` or `[agent] tier=...` line scrolls out of the 80-line window. Also, `extract_tier_from_logs` compares `before_lines` as a set — duplicate log lines from different queries will be filtered out incorrectly. ## Fix - Increase `--tail` from 80 to 200+ - Use timestamps or sequence numbers to isolate lines, not a set diff - Consider logging `tier=<value>` with a unique `session_id=benchmark-{id}` marker so extraction can grep by session ## Impact 50/120 queries unclassifiable in latest benchmark run (`results_latest.json`)

alvis referenced a pull request that will close this issue

2026-03-24 02:42:27 +00:00

Fix benchmark log extraction: first tier match, increase log tail to 300 #12

alvis referenced this issue from a commit

2026-03-24 02:42:29 +00:00

Fix benchmark log extraction: first tier match, increase log tail to 300

alvis closed this issue

2026-03-24 02:43:26 +00:00

alvis referenced a pull request that will close this issue

2026-03-24 02:47:54 +00:00

Fix benchmark log extraction: first tier match, increase log tail to 300 #16

Sign in to join this conversation.