Infrastructure: - docker-compose.yml: replace bifrost container with LiteLLM proxy (host.docker.internal:4000); complex model → deepseek-r1:free via OpenRouter; add Matrix URL env var; mount logs volume - bifrost-config.json: add auth_config + postgres config_store (archived) Routing: - router.py: full semantic 3-tier classifier rewrite — nomic-embed-text centroids for light/medium/complex; regex pre-classifiers for all tiers; Russian utterance sets expanded - agent.py: wire LiteLLM URL; add dry_run support; add Matrix channel Channels: - channels.py: add Matrix adapter (_matrix_send via mx- session prefix) Rules / docs: - agent-pipeline.md: remove /think prefix requirement; document automatic complex tier classification - llm-inference.md: update BIFROST_URL → LITELLM_URL references; add remote model note for complex tier - ARCHITECTURE.md: deleted (superseded by README.md) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
790 B
790 B
LLM Inference Rules
- All LLM calls must use
base_url=LITELLM_URL(points to LiteLLM athost.docker.internal:4000/v1). Never call Ollama directly for inference. _reply_semaphore(asyncio.Semaphore(1)) serializes all GPU inference. Never bypass it or add a second semaphore.- Local Ollama models use the
ollama/prefix:ollama/qwen3:4b,ollama/qwen2.5:1.5b. Remote models (e.g. OpenRouter) use their full LiteLLM name:openrouter/deepseek-r1. - Timeout values: router=30s, medium=180s, complex=600s. Do not reduce them.
VRAMManageris the only component that contacts Ollama directly (for flush/prewarm/poll). This is intentional — LiteLLM cannot manage VRAM.- Complex tier uses a remote model (
DEEPAGENTS_COMPLEX_MODEL) — no VRAM management is needed for it.