# LLM Inference Rules - All LLM calls must use `base_url=BIFROST_URL` with model name `ollama/`. Never call Ollama directly for inference. - `_reply_semaphore` (asyncio.Semaphore(1)) serializes all GPU inference. Never bypass it or add a second semaphore. - Model names in code always use the `ollama/` prefix: `ollama/qwen3:4b`, `ollama/qwen3:8b`, `ollama/qwen2.5:1.5b`. - Timeout values: router=30s, medium=180s, complex=600s. Do not reduce them — GPU inference under load is slow. - `VRAMManager` is the only component that contacts Ollama directly (for flush/prewarm/poll). This is intentional — Bifrost cannot manage VRAM.