Files
adolf/.claude/rules/llm-inference.md
Alvis 1f5e272600 Switch from Bifrost to LiteLLM; add Matrix channel; update rules
Infrastructure:
- docker-compose.yml: replace bifrost container with LiteLLM proxy
  (host.docker.internal:4000); complex model → deepseek-r1:free via
  OpenRouter; add Matrix URL env var; mount logs volume
- bifrost-config.json: add auth_config + postgres config_store (archived)

Routing:
- router.py: full semantic 3-tier classifier rewrite — nomic-embed-text
  centroids for light/medium/complex; regex pre-classifiers for all tiers;
  Russian utterance sets expanded
- agent.py: wire LiteLLM URL; add dry_run support; add Matrix channel

Channels:
- channels.py: add Matrix adapter (_matrix_send via mx- session prefix)

Rules / docs:
- agent-pipeline.md: remove /think prefix requirement; document automatic
  complex tier classification
- llm-inference.md: update BIFROST_URL → LITELLM_URL references; add
  remote model note for complex tier
- ARCHITECTURE.md: deleted (superseded by README.md)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-24 02:14:13 +00:00

790 B

LLM Inference Rules

  • All LLM calls must use base_url=LITELLM_URL (points to LiteLLM at host.docker.internal:4000/v1). Never call Ollama directly for inference.
  • _reply_semaphore (asyncio.Semaphore(1)) serializes all GPU inference. Never bypass it or add a second semaphore.
  • Local Ollama models use the ollama/ prefix: ollama/qwen3:4b, ollama/qwen2.5:1.5b. Remote models (e.g. OpenRouter) use their full LiteLLM name: openrouter/deepseek-r1.
  • Timeout values: router=30s, medium=180s, complex=600s. Do not reduce them.
  • VRAMManager is the only component that contacts Ollama directly (for flush/prewarm/poll). This is intentional — LiteLLM cannot manage VRAM.
  • Complex tier uses a remote model (DEEPAGENTS_COMPLEX_MODEL) — no VRAM management is needed for it.