- CLAUDE.md: add benchmark commands (run_benchmark.py flags, dry-run, categories, voice benchmark) - README.md: add benchmarks/ to Files tree; fix incorrect claim that complex tier requires /think prefix — it is auto-classified via regex and embedding similarity; fix "Complex agent (/think prefix)" heading Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
1.4 KiB
1.4 KiB
CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
Commands
# Start all services
docker compose up --build
# Interactive CLI (requires services running)
docker compose --profile tools run --rm -it cli
# Integration tests — run from tests/integration/, require all services up
python3 test_health.py
python3 test_memory.py [--name-only|--bench-only|--dedup-only]
python3 test_routing.py [--easy-only|--medium-only|--hard-only]
# Use case tests — read the .md file and follow its steps as Claude Code
# example: read tests/use_cases/weather_now.md and execute it
# Routing benchmark — measures tier classification accuracy across 120 queries
# Run from benchmarks/ — Adolf must be running. DO NOT run during active use (holds GPU).
cd benchmarks
python3 run_benchmark.py # full run (120 queries)
python3 run_benchmark.py --tier light # light tier only (30 queries)
python3 run_benchmark.py --tier medium # medium tier only (50 queries)
python3 run_benchmark.py --tier complex --dry-run # complex tier, medium model (no API cost)
python3 run_benchmark.py --category smart_home_control
python3 run_benchmark.py --ids 1,2,3
python3 run_benchmark.py --list-categories
# Voice benchmark
python3 run_voice_benchmark.py
# benchmark.json (dataset) and results_latest.json are gitignored — not committed
Architecture
@README.md