Add AI page; document LiteLLM + Langfuse + Ollama stack; cross-link Open-WebUI

2026-04-20 12:24:56 +00:00
parent 2a5865d98a
commit 0479842e17
3 changed files with 119 additions and 22 deletions

108
AI.md Normal file

@@ -0,0 +1,108 @@
# AI
LLM infrastructure for Agap and consumers (oO, Adolf, Omo, Open WebUI, Pipecat voice).
Configuration is in `agap_git/openai/` (compose + litellm config).
## Stack
| Service | Purpose | Port |
|---------|---------|------|
| `ollama` | GPU LLM runtime (qwen, nomic-embed-text) | `11436``11434` |
| `ollama-cpu` | CPU-only runtime for small models | `11435``11434` |
| `litellm` | OpenAI-compatible proxy, model aliases, rate-limits | `4000` |
| `litellm-db` | Postgres — LiteLLM key/usage persistence | internal |
| `langfuse` | LLM observability (traces, cost, feedback) | `3200``3000` |
| `langfuse-db` | Postgres — Langfuse | internal |
| `qdrant` | Vector DB | `6333`, `6334` |
| `searxng` | Meta-search, pluggable to agents | `11437``8080` |
| `open-webui` | Chat UI (see [[Open-WebUI]]) | `3125` |
| `faster-whisper` | STT (Whisper v3-turbo, CUDA) | `8880` |
| `silero-tts` | TTS (Silero v4) | `8881` |
| `pipecat` | Voice pipeline (LiveKit + STT + TTS + LLM) | `8882` |
## Public endpoints
- `llm.alogins.net` → LiteLLM (`localhost:4000`)
- `lf.alogins.net` → Langfuse (`localhost:3200`)
- `ai.alogins.net` → Open WebUI (`localhost:3125`)
- `voice.alogins.net` → Pipecat (`localhost:8882`)
## LiteLLM
OpenAI-compatible endpoint. All LLM traffic for Agap apps routes through it so model swaps are a config change, not a code change.
- Config: `agap_git/openai/litellm-config.yaml`
- Master key: set as `LITELLM_MASTER_KEY` env on the container
- Langfuse callbacks for every request (success + failure)
- Fallbacks defined per model (e.g. `deepseek/deepseek-r1:free``ollama/qwen3.5:4b`)
### Model aliases
**oO recommender** (used by `ml/serving` — see oO `CLAUDE.md`):
| Alias | Backend | Notes |
|-------|---------|-------|
| `tip-generator` | `ollama/qwen2.5:1.5b` on host ollama `:11434` | Tip candidate generation |
| `embedder` | `ollama/nomic-embed-text` on host ollama `:11434` | Task clustering, dedup |
| `judge` | `anthropic/claude-haiku-4-5-20251001` | Offline eval only |
**Raw models** (full list in `litellm-config.yaml`): qwen2.5 / qwen3 / gemma3 via ollama, plus OpenRouter free tier (llama, deepseek, gemma, mistral, nvidia nemotron, gpt-oss, minimax, hermes).
### Quick test
```bash
curl -H "Authorization: Bearer $LITELLM_MASTER_KEY" \
-H "Content-Type: application/json" \
https://llm.alogins.net/v1/chat/completions \
-d '{"model":"tip-generator","messages":[{"role":"user","content":"hi"}]}'
```
## Ollama
Two runtimes: a GPU one (main, `:11436`) and a CPU fallback (`:11435`). Models live under `/mnt/ssd/ai/ollama` and `/mnt/ssd/ai/ollama-cpu`.
GPU config (compose):
- `OLLAMA_MAX_LOADED_MODELS=2` — two models resident in VRAM
- `OLLAMA_NUM_PARALLEL=1` — serialise inference to avoid GPU contention
- `OLLAMA_NUM_GPU=999` — force all layers to GPU, fail rather than fall back to CPU
- `mem_limit: 4g`
### Pull / list
```bash
docker exec ollama ollama pull qwen2.5:1.5b
docker exec ollama ollama list
```
## Langfuse
Observability for LiteLLM. Traces every completion with prompt, response, tokens, cost, latency. Signup disabled (`AUTH_DISABLE_SIGNUP=true`) — create users via admin.
- Backend DB: `langfuse-db` (Postgres)
- Callbacks wired through LiteLLM (`litellm_settings.success_callback: [langfuse]`)
## Data
| Path | Contents |
|------|----------|
| `/mnt/ssd/ai/ollama` | GPU ollama models |
| `/mnt/ssd/ai/ollama-cpu` | CPU ollama models |
| `/mnt/ssd/ai/open-webui` | Open WebUI state |
| `/mnt/ssd/ai/searxng` | Searxng config + cache |
| `/mnt/ssd/ai/faster-whisper` | Whisper model cache |
| `/mnt/ssd/ai/silero-tts` | Silero torch cache |
| `/mnt/ssd/dbs/litellm/postgres` | LiteLLM keys + usage |
| `/mnt/ssd/dbs/langfuse/postgres` | Langfuse traces |
| `/mnt/ssd/dbs/qdrant` | Vector store |
## Start / stop
```bash
cd ~/agap_git/openai
docker compose up -d # everything
docker compose up -d litellm # minimal (brings up its deps)
docker compose stop litellm langfuse # stop subset
```
Most services are `restart: always`.

@@ -11,7 +11,8 @@
- [[Immich]] — Photo management and backup - [[Immich]] — Photo management and backup
- [[Gitea]] — Git hosting - [[Gitea]] — Git hosting
- [[Open-WebUI]] — AI chat interface - [[AI]] — LLM stack (LiteLLM, Langfuse, Ollama, STT/TTS)
- [[Open-WebUI]] — AI chat interface (part of the AI stack)
- [[Home-Assistant]] — KVM virtual machine - [[Home-Assistant]] — KVM virtual machine
- [[3X-UI]] — VPN proxy - [[3X-UI]] — VPN proxy
- [[Zabbix]] — Monitoring (Zabbix 7.4, PostgreSQL, Apache) - [[Zabbix]] — Monitoring (Zabbix 7.4, PostgreSQL, Apache)

@@ -1,35 +1,23 @@
# Open WebUI # Open WebUI
AI chat interface with Ollama, GPU-accelerated via NVIDIA. Chat UI frontend to the [[AI]] stack. GPU-accelerated via NVIDIA Container Toolkit. Routes LLM calls through LiteLLM (`host.docker.internal:4000/v1`), STT via Faster-Whisper, TTS via Silero.
## Configuration
Configuration is in the `agap_git` repository:
- Compose file: `openai/docker-compose.yml`
Repository: https://github.com/alvis/agap_git
## Access ## Access
- Web: `http://agap:3125` - `http://agap:3125`
- `ai.alogins.net` (via Caddy)
## Configuration
`agap_git/openai/docker-compose.yml` (service `open-webui`).
## Data ## Data
| Path | Contents | - `/mnt/ssd/ai/open-webui` — user DB, chats, settings
|------|----------|
| Docker volume `ollama` | Ollama models |
| Docker volume `open-webui` | Web UI data and settings |
## GPU ## GPU prerequisites
Requires NVIDIA GPU support:
```bash ```bash
sudo ./nvidia-docker-install.sh sudo ./nvidia-docker-install.sh
./install-cuda.sh ./install-cuda.sh
``` ```
## Stack
- Open WebUI (NVIDIA/Ollama variant)
- Ollama (LLM runtime)
- NVIDIA Container Toolkit for GPU acceleration