From 0479842e179f36083f9831e0c8a0702ee8978e08 Mon Sep 17 00:00:00 2001 From: alvis Date: Mon, 20 Apr 2026 12:24:56 +0000 Subject: [PATCH] Add AI page; document LiteLLM + Langfuse + Ollama stack; cross-link Open-WebUI --- AI.md | 108 ++++++++++++++++++++++++++++++++++++++++++++++++++ Home.md | 3 +- Open-WebUI.md | 30 +++++--------- 3 files changed, 119 insertions(+), 22 deletions(-) create mode 100644 AI.md diff --git a/AI.md b/AI.md new file mode 100644 index 0000000..0dd8bfe --- /dev/null +++ b/AI.md @@ -0,0 +1,108 @@ +# AI + +LLM infrastructure for Agap and consumers (oO, Adolf, Omo, Open WebUI, Pipecat voice). + +Configuration is in `agap_git/openai/` (compose + litellm config). + +## Stack + +| Service | Purpose | Port | +|---------|---------|------| +| `ollama` | GPU LLM runtime (qwen, nomic-embed-text) | `11436` → `11434` | +| `ollama-cpu` | CPU-only runtime for small models | `11435` → `11434` | +| `litellm` | OpenAI-compatible proxy, model aliases, rate-limits | `4000` | +| `litellm-db` | Postgres — LiteLLM key/usage persistence | internal | +| `langfuse` | LLM observability (traces, cost, feedback) | `3200` → `3000` | +| `langfuse-db` | Postgres — Langfuse | internal | +| `qdrant` | Vector DB | `6333`, `6334` | +| `searxng` | Meta-search, pluggable to agents | `11437` → `8080` | +| `open-webui` | Chat UI (see [[Open-WebUI]]) | `3125` | +| `faster-whisper` | STT (Whisper v3-turbo, CUDA) | `8880` | +| `silero-tts` | TTS (Silero v4) | `8881` | +| `pipecat` | Voice pipeline (LiveKit + STT + TTS + LLM) | `8882` | + +## Public endpoints + +- `llm.alogins.net` → LiteLLM (`localhost:4000`) +- `lf.alogins.net` → Langfuse (`localhost:3200`) +- `ai.alogins.net` → Open WebUI (`localhost:3125`) +- `voice.alogins.net` → Pipecat (`localhost:8882`) + +## LiteLLM + +OpenAI-compatible endpoint. All LLM traffic for Agap apps routes through it so model swaps are a config change, not a code change. + +- Config: `agap_git/openai/litellm-config.yaml` +- Master key: set as `LITELLM_MASTER_KEY` env on the container +- Langfuse callbacks for every request (success + failure) +- Fallbacks defined per model (e.g. `deepseek/deepseek-r1:free` → `ollama/qwen3.5:4b`) + +### Model aliases + +**oO recommender** (used by `ml/serving` — see oO `CLAUDE.md`): + +| Alias | Backend | Notes | +|-------|---------|-------| +| `tip-generator` | `ollama/qwen2.5:1.5b` on host ollama `:11434` | Tip candidate generation | +| `embedder` | `ollama/nomic-embed-text` on host ollama `:11434` | Task clustering, dedup | +| `judge` | `anthropic/claude-haiku-4-5-20251001` | Offline eval only | + +**Raw models** (full list in `litellm-config.yaml`): qwen2.5 / qwen3 / gemma3 via ollama, plus OpenRouter free tier (llama, deepseek, gemma, mistral, nvidia nemotron, gpt-oss, minimax, hermes). + +### Quick test + +```bash +curl -H "Authorization: Bearer $LITELLM_MASTER_KEY" \ + -H "Content-Type: application/json" \ + https://llm.alogins.net/v1/chat/completions \ + -d '{"model":"tip-generator","messages":[{"role":"user","content":"hi"}]}' +``` + +## Ollama + +Two runtimes: a GPU one (main, `:11436`) and a CPU fallback (`:11435`). Models live under `/mnt/ssd/ai/ollama` and `/mnt/ssd/ai/ollama-cpu`. + +GPU config (compose): +- `OLLAMA_MAX_LOADED_MODELS=2` — two models resident in VRAM +- `OLLAMA_NUM_PARALLEL=1` — serialise inference to avoid GPU contention +- `OLLAMA_NUM_GPU=999` — force all layers to GPU, fail rather than fall back to CPU +- `mem_limit: 4g` + +### Pull / list + +```bash +docker exec ollama ollama pull qwen2.5:1.5b +docker exec ollama ollama list +``` + +## Langfuse + +Observability for LiteLLM. Traces every completion with prompt, response, tokens, cost, latency. Signup disabled (`AUTH_DISABLE_SIGNUP=true`) — create users via admin. + +- Backend DB: `langfuse-db` (Postgres) +- Callbacks wired through LiteLLM (`litellm_settings.success_callback: [langfuse]`) + +## Data + +| Path | Contents | +|------|----------| +| `/mnt/ssd/ai/ollama` | GPU ollama models | +| `/mnt/ssd/ai/ollama-cpu` | CPU ollama models | +| `/mnt/ssd/ai/open-webui` | Open WebUI state | +| `/mnt/ssd/ai/searxng` | Searxng config + cache | +| `/mnt/ssd/ai/faster-whisper` | Whisper model cache | +| `/mnt/ssd/ai/silero-tts` | Silero torch cache | +| `/mnt/ssd/dbs/litellm/postgres` | LiteLLM keys + usage | +| `/mnt/ssd/dbs/langfuse/postgres` | Langfuse traces | +| `/mnt/ssd/dbs/qdrant` | Vector store | + +## Start / stop + +```bash +cd ~/agap_git/openai +docker compose up -d # everything +docker compose up -d litellm # minimal (brings up its deps) +docker compose stop litellm langfuse # stop subset +``` + +Most services are `restart: always`. diff --git a/Home.md b/Home.md index f062064..2e0af85 100644 --- a/Home.md +++ b/Home.md @@ -11,7 +11,8 @@ - [[Immich]] — Photo management and backup - [[Gitea]] — Git hosting -- [[Open-WebUI]] — AI chat interface +- [[AI]] — LLM stack (LiteLLM, Langfuse, Ollama, STT/TTS) +- [[Open-WebUI]] — AI chat interface (part of the AI stack) - [[Home-Assistant]] — KVM virtual machine - [[3X-UI]] — VPN proxy - [[Zabbix]] — Monitoring (Zabbix 7.4, PostgreSQL, Apache) diff --git a/Open-WebUI.md b/Open-WebUI.md index 8785ba2..a467e67 100644 --- a/Open-WebUI.md +++ b/Open-WebUI.md @@ -1,35 +1,23 @@ # Open WebUI -AI chat interface with Ollama, GPU-accelerated via NVIDIA. - -## Configuration - -Configuration is in the `agap_git` repository: -- Compose file: `openai/docker-compose.yml` - -Repository: https://github.com/alvis/agap_git +Chat UI frontend to the [[AI]] stack. GPU-accelerated via NVIDIA Container Toolkit. Routes LLM calls through LiteLLM (`host.docker.internal:4000/v1`), STT via Faster-Whisper, TTS via Silero. ## Access -- Web: `http://agap:3125` +- `http://agap:3125` +- `ai.alogins.net` (via Caddy) + +## Configuration + +`agap_git/openai/docker-compose.yml` (service `open-webui`). ## Data -| Path | Contents | -|------|----------| -| Docker volume `ollama` | Ollama models | -| Docker volume `open-webui` | Web UI data and settings | +- `/mnt/ssd/ai/open-webui` — user DB, chats, settings -## GPU +## GPU prerequisites -Requires NVIDIA GPU support: ```bash sudo ./nvidia-docker-install.sh ./install-cuda.sh ``` - -## Stack - -- Open WebUI (NVIDIA/Ollama variant) -- Ollama (LLM runtime) -- NVIDIA Container Toolkit for GPU acceleration