Files

Alvis 957360f6ce Restructure CLAUDE.md per official Claude Code recommendations

CLAUDE.md: 178→25 lines — commands + @ARCHITECTURE.md import only

Rules split into .claude/rules/ (load at startup, topic-scoped):
  llm-inference.md  — Bifrost-only, semaphore, model name format, timeouts
  agent-pipeline.md — tier rules, no tools in medium, memory outside loop
  fast-tools.md     — extension guide (path-scoped: fast_tools.py + agent.py)
  secrets.md        — .env keys, Vaultwarden, no hardcoding

Path-scoped rule: fast-tools.md only loads when editing fast_tools.py or agent.py

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-03-13 07:19:09 +00:00

640 B

Raw Blame History

LLM Inference Rules

All LLM calls must use base_url=BIFROST_URL with model name ollama/<model>. Never call Ollama directly for inference.
_reply_semaphore (asyncio.Semaphore(1)) serializes all GPU inference. Never bypass it or add a second semaphore.
Model names in code always use the ollama/ prefix: ollama/qwen3:4b, ollama/qwen3:8b, ollama/qwen2.5:1.5b.
Timeout values: router=30s, medium=180s, complex=600s. Do not reduce them — GPU inference under load is slow.
VRAMManager is the only component that contacts Ollama directly (for flush/prewarm/poll). This is intentional — Bifrost cannot manage VRAM.

640 B Raw Blame History

LLM Inference Rules

640 B

Raw Blame History