Restructure CLAUDE.md per official Claude Code recommendations

CLAUDE.md: 178→25 lines — commands + @ARCHITECTURE.md import only Rules split into .claude/rules/ (load at startup, topic-scoped): llm-inference.md — Bifrost-only, semaphore, model name format, timeouts agent-pipeline.md — tier rules, no tools in medium, memory outside loop fast-tools.md — extension guide (path-scoped: fast_tools.py + agent.py) secrets.md — .env keys, Vaultwarden, no hardcoding Path-scoped rule: fast-tools.md only loads when editing fast_tools.py or agent.py Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-13 07:19:09 +00:00
parent 3ed47b45da
commit 957360f6ce
5 changed files with 60 additions and 18 deletions
--- a/.claude/rules/llm-inference.md
+++ b/.claude/rules/llm-inference.md
@@ -0,0 +1,7 @@
+# LLM Inference Rules
+
+- All LLM calls must use `base_url=BIFROST_URL` with model name `ollama/<model>`. Never call Ollama directly for inference.
+- `_reply_semaphore` (asyncio.Semaphore(1)) serializes all GPU inference. Never bypass it or add a second semaphore.
+- Model names in code always use the `ollama/` prefix: `ollama/qwen3:4b`, `ollama/qwen3:8b`, `ollama/qwen2.5:1.5b`.
+- Timeout values: router=30s, medium=180s, complex=600s. Do not reduce them — GPU inference under load is slow.
+- `VRAMManager` is the only component that contacts Ollama directly (for flush/prewarm/poll). This is intentional — Bifrost cannot manage VRAM.