CLAUDE.md: 178→25 lines — commands + @ARCHITECTURE.md import only Rules split into .claude/rules/ (load at startup, topic-scoped): llm-inference.md — Bifrost-only, semaphore, model name format, timeouts agent-pipeline.md — tier rules, no tools in medium, memory outside loop fast-tools.md — extension guide (path-scoped: fast_tools.py + agent.py) secrets.md — .env keys, Vaultwarden, no hardcoding Path-scoped rule: fast-tools.md only loads when editing fast_tools.py or agent.py Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
8 lines
640 B
Markdown
8 lines
640 B
Markdown
# LLM Inference Rules
|
|
|
|
- All LLM calls must use `base_url=BIFROST_URL` with model name `ollama/<model>`. Never call Ollama directly for inference.
|
|
- `_reply_semaphore` (asyncio.Semaphore(1)) serializes all GPU inference. Never bypass it or add a second semaphore.
|
|
- Model names in code always use the `ollama/` prefix: `ollama/qwen3:4b`, `ollama/qwen3:8b`, `ollama/qwen2.5:1.5b`.
|
|
- Timeout values: router=30s, medium=180s, complex=600s. Do not reduce them — GPU inference under load is slow.
|
|
- `VRAMManager` is the only component that contacts Ollama directly (for flush/prewarm/poll). This is intentional — Bifrost cannot manage VRAM.
|