Update docs: add benchmarks/ section, fix complex tier description
- CLAUDE.md: add benchmark commands (run_benchmark.py flags, dry-run, categories, voice benchmark) - README.md: add benchmarks/ to Files tree; fix incorrect claim that complex tier requires /think prefix — it is auto-classified via regex and embedding similarity; fix "Complex agent (/think prefix)" heading Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
18
CLAUDE.md
18
CLAUDE.md
@@ -18,8 +18,24 @@ python3 test_routing.py [--easy-only|--medium-only|--hard-only]
|
||||
|
||||
# Use case tests — read the .md file and follow its steps as Claude Code
|
||||
# example: read tests/use_cases/weather_now.md and execute it
|
||||
|
||||
# Routing benchmark — measures tier classification accuracy across 120 queries
|
||||
# Run from benchmarks/ — Adolf must be running. DO NOT run during active use (holds GPU).
|
||||
cd benchmarks
|
||||
python3 run_benchmark.py # full run (120 queries)
|
||||
python3 run_benchmark.py --tier light # light tier only (30 queries)
|
||||
python3 run_benchmark.py --tier medium # medium tier only (50 queries)
|
||||
python3 run_benchmark.py --tier complex --dry-run # complex tier, medium model (no API cost)
|
||||
python3 run_benchmark.py --category smart_home_control
|
||||
python3 run_benchmark.py --ids 1,2,3
|
||||
python3 run_benchmark.py --list-categories
|
||||
|
||||
# Voice benchmark
|
||||
python3 run_voice_benchmark.py
|
||||
|
||||
# benchmark.json (dataset) and results_latest.json are gitignored — not committed
|
||||
```
|
||||
|
||||
## Architecture
|
||||
|
||||
@ARCHITECTURE.md
|
||||
@README.md
|
||||
|
||||
Reference in New Issue
Block a user