CLAUDE.md: 178→25 lines — commands + @ARCHITECTURE.md import only Rules split into .claude/rules/ (load at startup, topic-scoped): llm-inference.md — Bifrost-only, semaphore, model name format, timeouts agent-pipeline.md — tier rules, no tools in medium, memory outside loop fast-tools.md — extension guide (path-scoped: fast_tools.py + agent.py) secrets.md — .env keys, Vaultwarden, no hardcoding Path-scoped rule: fast-tools.md only loads when editing fast_tools.py or agent.py Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
640 B
640 B
LLM Inference Rules
- All LLM calls must use
base_url=BIFROST_URLwith model nameollama/<model>. Never call Ollama directly for inference. _reply_semaphore(asyncio.Semaphore(1)) serializes all GPU inference. Never bypass it or add a second semaphore.- Model names in code always use the
ollama/prefix:ollama/qwen3:4b,ollama/qwen3:8b,ollama/qwen2.5:1.5b. - Timeout values: router=30s, medium=180s, complex=600s. Do not reduce them — GPU inference under load is slow.
VRAMManageris the only component that contacts Ollama directly (for flush/prewarm/poll). This is intentional — Bifrost cannot manage VRAM.