Files

Alvis 54cb940279 Update docs: add benchmarks/ section, fix complex tier description

- CLAUDE.md: add benchmark commands (run_benchmark.py flags, dry-run,
  categories, voice benchmark)
- README.md: add benchmarks/ to Files tree; fix incorrect claim that
  complex tier requires /think prefix — it is auto-classified via regex
  and embedding similarity; fix "Complex agent (/think prefix)" heading

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-03-24 02:13:14 +00:00

1.4 KiB

Raw Blame History

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Commands

# Start all services
docker compose up --build

# Interactive CLI (requires services running)
docker compose --profile tools run --rm -it cli

# Integration tests — run from tests/integration/, require all services up
python3 test_health.py
python3 test_memory.py [--name-only|--bench-only|--dedup-only]
python3 test_routing.py [--easy-only|--medium-only|--hard-only]

# Use case tests — read the .md file and follow its steps as Claude Code
# example: read tests/use_cases/weather_now.md and execute it

# Routing benchmark — measures tier classification accuracy across 120 queries
# Run from benchmarks/ — Adolf must be running. DO NOT run during active use (holds GPU).
cd benchmarks
python3 run_benchmark.py                       # full run (120 queries)
python3 run_benchmark.py --tier light          # light tier only (30 queries)
python3 run_benchmark.py --tier medium         # medium tier only (50 queries)
python3 run_benchmark.py --tier complex --dry-run  # complex tier, medium model (no API cost)
python3 run_benchmark.py --category smart_home_control
python3 run_benchmark.py --ids 1,2,3
python3 run_benchmark.py --list-categories

# Voice benchmark
python3 run_voice_benchmark.py

# benchmark.json (dataset) and results_latest.json are gitignored — not committed

Architecture

@README.md

1.4 KiB Raw Blame History

CLAUDE.md

Commands

Architecture

1.4 KiB

Raw Blame History