From 957360f6ceddd8bf6fc2b30948a1437d3579fedc Mon Sep 17 00:00:00 2001 From: Alvis Date: Fri, 13 Mar 2026 07:19:09 +0000 Subject: [PATCH] Restructure CLAUDE.md per official Claude Code recommendations MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit CLAUDE.md: 178→25 lines — commands + @ARCHITECTURE.md import only Rules split into .claude/rules/ (load at startup, topic-scoped): llm-inference.md — Bifrost-only, semaphore, model name format, timeouts agent-pipeline.md — tier rules, no tools in medium, memory outside loop fast-tools.md — extension guide (path-scoped: fast_tools.py + agent.py) secrets.md — .env keys, Vaultwarden, no hardcoding Path-scoped rule: fast-tools.md only loads when editing fast_tools.py or agent.py Co-Authored-By: Claude Sonnet 4.6 --- .claude/rules/agent-pipeline.md | 20 ++++++++++++++++++++ .claude/rules/fast-tools.md | 24 ++++++++++++++++++++++++ .claude/rules/llm-inference.md | 7 +++++++ .claude/rules/secrets.md | 7 +++++++ CLAUDE.md | 20 ++------------------ 5 files changed, 60 insertions(+), 18 deletions(-) create mode 100644 .claude/rules/agent-pipeline.md create mode 100644 .claude/rules/fast-tools.md create mode 100644 .claude/rules/llm-inference.md create mode 100644 .claude/rules/secrets.md diff --git a/.claude/rules/agent-pipeline.md b/.claude/rules/agent-pipeline.md new file mode 100644 index 0000000..9001d39 --- /dev/null +++ b/.claude/rules/agent-pipeline.md @@ -0,0 +1,20 @@ +# Agent Pipeline Rules + +## Tiers +- Complex tier requires `/think ` prefix. Any LLM classification of "complex" is downgraded to medium. Do not change this. +- Medium is the default tier. Light is only for trivial static-knowledge queries matched by regex or LLM. +- Light tier upgrade to medium is automatic when URL content is pre-fetched or a fast tool matches. + +## Medium agent +- `_DirectModel` makes a single `ainvoke()` call with no tool schema. Do not add tools to the medium agent. +- `qwen3:4b` behaves unreliably when a tool array is present in the request — inject context via system prompt instead. + +## Memory +- `add_memory` and `search_memory` are called directly in `run_agent_task()`, outside the agent loop. +- Never add memory tools to any agent's tool list. +- Memory storage (`_store_memory`) runs as an asyncio background task after the semaphore is released. + +## Fast tools +- `FastToolRunner.run_matching()` runs in the pre-flight `asyncio.gather` alongside URL fetch and memory retrieval. +- Fast tool results are injected as a system prompt block, not returned to the user directly. +- When `any_matches()` is true, the router forces medium tier before LLM classification. diff --git a/.claude/rules/fast-tools.md b/.claude/rules/fast-tools.md new file mode 100644 index 0000000..9a7c062 --- /dev/null +++ b/.claude/rules/fast-tools.md @@ -0,0 +1,24 @@ +--- +paths: + - "fast_tools.py" + - "agent.py" +--- + +# Fast Tools — Extension Guide + +To add a new fast tool: + +1. In `fast_tools.py`, subclass `FastTool` and implement: + - `name` (str property) — unique identifier, used in logs + - `matches(message: str) -> bool` — regex or logic; keep it cheap, runs on every message + - `run(message: str) -> str` — async fetch; return a short context block or `""` on failure; never raise + +2. In `agent.py`, add an instance to the `_fast_tool_runner` list (module level, after env vars are defined). + +3. The router will automatically force medium tier when `matches()` returns true — no router changes needed. + +## Constraints +- `run()` must return in under 15s — it runs in the pre-flight gather that blocks routing. +- Return `""` or a `[tool error: ...]` string on failure — never raise exceptions. +- Keep returned context under ~1000 chars — larger contexts slow down `qwen3:4b` streaming significantly. +- The deepagents container has no direct external internet. Use SearXNG (`host.docker.internal:11437`) or internal services. diff --git a/.claude/rules/llm-inference.md b/.claude/rules/llm-inference.md new file mode 100644 index 0000000..e75bfa0 --- /dev/null +++ b/.claude/rules/llm-inference.md @@ -0,0 +1,7 @@ +# LLM Inference Rules + +- All LLM calls must use `base_url=BIFROST_URL` with model name `ollama/`. Never call Ollama directly for inference. +- `_reply_semaphore` (asyncio.Semaphore(1)) serializes all GPU inference. Never bypass it or add a second semaphore. +- Model names in code always use the `ollama/` prefix: `ollama/qwen3:4b`, `ollama/qwen3:8b`, `ollama/qwen2.5:1.5b`. +- Timeout values: router=30s, medium=180s, complex=600s. Do not reduce them — GPU inference under load is slow. +- `VRAMManager` is the only component that contacts Ollama directly (for flush/prewarm/poll). This is intentional — Bifrost cannot manage VRAM. diff --git a/.claude/rules/secrets.md b/.claude/rules/secrets.md new file mode 100644 index 0000000..0f46710 --- /dev/null +++ b/.claude/rules/secrets.md @@ -0,0 +1,7 @@ +# Secrets and Environment + +- `.env` is required at project root and must never be committed. It is in `.gitignore`. +- Required keys: `TELEGRAM_BOT_TOKEN`, `ROUTECHECK_TOKEN`, `YANDEX_ROUTING_KEY`. +- `ROUTECHECK_TOKEN` is a shared secret between `deepagents` and `routecheck` containers — generate once with `python3 -c "import uuid; print(uuid.uuid4())"`. +- All tokens are stored in Vaultwarden (AI collection). Fetch with `bw get password ""` — see `~/.claude/CLAUDE.md` for the full procedure. +- Do not hardcode tokens, URLs, or credentials anywhere in source code. diff --git a/CLAUDE.md b/CLAUDE.md index 8d24ba4..e0b51d7 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -11,31 +11,15 @@ docker compose up --build # Interactive CLI (requires services running) docker compose --profile tools run --rm -it cli -# Integration tests (run from tests/integration/, requires all services) +# Integration tests — run from tests/integration/, require all services up python3 test_health.py python3 test_memory.py [--name-only|--bench-only|--dedup-only] python3 test_routing.py [--easy-only|--medium-only|--hard-only] # Use case tests — read the .md file and follow its steps as Claude Code -# e.g.: read tests/use_cases/weather_now.md and execute it +# example: read tests/use_cases/weather_now.md and execute it ``` -## Key Conventions - -- **Models via Bifrost only** — all LLM calls use `base_url=BIFROST_URL` with `ollama/` prefix. Never call Ollama directly for inference. -- **One inference at a time** — `_reply_semaphore` serializes GPU use. Do not bypass it. -- **No tools in medium agent** — `_DirectModel` is a plain `ainvoke()` call. Context is injected via system prompt. `qwen3:4b` is unreliable with tool schemas. -- **Fast tools are pre-flight** — `FastToolRunner` runs before routing and before any LLM call. Results are injected as context, not returned to the user directly. -- **Memory outside agent loop** — `add_memory`/`search_memory` are called directly, never passed to agent tool lists. -- **Complex tier is opt-in** — `/think ` prefix only. LLM classification of "complex" is always downgraded to medium. -- **`.env` is required** — `TELEGRAM_BOT_TOKEN`, `ROUTECHECK_TOKEN`, `YANDEX_ROUTING_KEY`. Never commit it. - -## Adding a Fast Tool - -1. Subclass `FastTool` in `fast_tools.py` — implement `name`, `matches(message) → bool`, `run(message) → str` -2. Add instance to `_fast_tool_runner` list in `agent.py` -3. The router will automatically force medium tier when `matches()` returns true - ## Architecture @ARCHITECTURE.md