Benchmark: light tier over-classified as medium (tech definition queries) #8

Closed
opened 2026-03-24 01:58:05 +00:00 by alvis · 0 comments
Owner

Problem

Queries that should be light (static tech definitions) are routed to medium:

  • "что такое API" → medium
  • "сколько бит в байте" → medium
  • "привет, как дела?" → medium
  • "что такое брандмауэр" → medium
  • "что такое Z-Wave" → medium
  • "что такое Matter протокол" → medium

Root Cause

The _LIGHT_PATTERNS regex only matches exact greetings/acks. Everything else falls through to the semantic embedder. The embedder centroids overlap for short Russian tech definitions — they embed similarly to medium-tier queries because both are short question forms.

Also _LIGHT_PATTERNS uses ^...$ anchoring but the text includes trailing punctuation (?, !) which may not always match the [\s!.?]*$ suffix.

Fix

  • Strengthen _LIGHT_PATTERNS to catch что такое <term> and что означает <term> patterns directly
  • Add more Russian tech-definition utterances to _LIGHT_UTTERANCES to pull the light centroid closer
  • Consider a separate regex pass for сколько <unit> в <unit> patterns

Impact

~10 light queries misclassified as medium in latest run

## Problem Queries that should be `light` (static tech definitions) are routed to `medium`: - "что такое API" → medium - "сколько бит в байте" → medium - "привет, как дела?" → medium - "что такое брандмауэр" → medium - "что такое Z-Wave" → medium - "что такое Matter протокол" → medium ## Root Cause The `_LIGHT_PATTERNS` regex only matches exact greetings/acks. Everything else falls through to the semantic embedder. The embedder centroids overlap for short Russian tech definitions — they embed similarly to medium-tier queries because both are short question forms. Also `_LIGHT_PATTERNS` uses `^...$` anchoring but the text includes trailing punctuation (`?`, `!`) which may not always match the `[\s!.?]*$` suffix. ## Fix - Strengthen `_LIGHT_PATTERNS` to catch `что такое <term>` and `что означает <term>` patterns directly - Add more Russian tech-definition utterances to `_LIGHT_UTTERANCES` to pull the light centroid closer - Consider a separate regex pass for `сколько <unit> в <unit>` patterns ## Impact ~10 light queries misclassified as medium in latest run
alvis closed this issue 2026-03-24 02:51:17 +00:00
Sign in to join this conversation.
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: alvis/adolf#8