Benchmark: ~50 queries return "?" due to tier= log extraction timeout #7

Closed
opened 2026-03-24 01:57:56 +00:00 by alvis · 0 comments
Owner

Problem

In run_benchmark.py, extract_tier_from_logs() uses a 80-line log tail diff to find tier=<value> log lines. About 50 out of 120 queries return actual: "?" with elapsed=15.4s (the QUERY_TIMEOUT).

Root Cause

The benchmark waits for SSE [DONE] (up to 300s), then reads docker logs deepagents --tail 80. If the container produced many lines during inference (tool calls, memory ops, etc.), the relevant [router] tier=... or [agent] tier=... line scrolls out of the 80-line window.

Also, extract_tier_from_logs compares before_lines as a set — duplicate log lines from different queries will be filtered out incorrectly.

Fix

  • Increase --tail from 80 to 200+
  • Use timestamps or sequence numbers to isolate lines, not a set diff
  • Consider logging tier=<value> with a unique session_id=benchmark-{id} marker so extraction can grep by session

Impact

50/120 queries unclassifiable in latest benchmark run (results_latest.json)

## Problem In `run_benchmark.py`, `extract_tier_from_logs()` uses a 80-line log tail diff to find `tier=<value>` log lines. About 50 out of 120 queries return `actual: "?"` with `elapsed=15.4s` (the `QUERY_TIMEOUT`). ## Root Cause The benchmark waits for SSE `[DONE]` (up to 300s), then reads `docker logs deepagents --tail 80`. If the container produced many lines during inference (tool calls, memory ops, etc.), the relevant `[router] tier=...` or `[agent] tier=...` line scrolls out of the 80-line window. Also, `extract_tier_from_logs` compares `before_lines` as a set — duplicate log lines from different queries will be filtered out incorrectly. ## Fix - Increase `--tail` from 80 to 200+ - Use timestamps or sequence numbers to isolate lines, not a set diff - Consider logging `tier=<value>` with a unique `session_id=benchmark-{id}` marker so extraction can grep by session ## Impact 50/120 queries unclassifiable in latest benchmark run (`results_latest.json`)
alvis closed this issue 2026-03-24 02:43:26 +00:00
Sign in to join this conversation.
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: alvis/adolf#7