Update docs: add benchmarks/ section, fix complex tier description

- CLAUDE.md: add benchmark commands (run_benchmark.py flags, dry-run, categories, voice benchmark) - README.md: add benchmarks/ to Files tree; fix incorrect claim that complex tier requires /think prefix — it is auto-classified via regex and embedding similarity; fix "Complex agent (/think prefix)" heading Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-24 02:13:14 +00:00
parent bd951f943f
commit 54cb940279
2 changed files with 225 additions and 1 deletions
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -18,8 +18,24 @@ python3 test_routing.py [--easy-only|--medium-only|--hard-only]

 # Use case tests — read the .md file and follow its steps as Claude Code
 # example: read tests/use_cases/weather_now.md and execute it
+
+# Routing benchmark — measures tier classification accuracy across 120 queries
+# Run from benchmarks/ — Adolf must be running. DO NOT run during active use (holds GPU).
+cd benchmarks
+python3 run_benchmark.py                       # full run (120 queries)
+python3 run_benchmark.py --tier light          # light tier only (30 queries)
+python3 run_benchmark.py --tier medium         # medium tier only (50 queries)
+python3 run_benchmark.py --tier complex --dry-run  # complex tier, medium model (no API cost)
+python3 run_benchmark.py --category smart_home_control
+python3 run_benchmark.py --ids 1,2,3
+python3 run_benchmark.py --list-categories
+
+# Voice benchmark
+python3 run_voice_benchmark.py
+
+# benchmark.json (dataset) and results_latest.json are gitignored — not committed
 ```

 ## Architecture

-@ARCHITECTURE.md
+@README.md