Add AI page; document LiteLLM + Langfuse + Ollama stack; cross-link Open-WebUI
108
AI.md
Normal file
108
AI.md
Normal file
@@ -0,0 +1,108 @@
|
||||
# AI
|
||||
|
||||
LLM infrastructure for Agap and consumers (oO, Adolf, Omo, Open WebUI, Pipecat voice).
|
||||
|
||||
Configuration is in `agap_git/openai/` (compose + litellm config).
|
||||
|
||||
## Stack
|
||||
|
||||
| Service | Purpose | Port |
|
||||
|---------|---------|------|
|
||||
| `ollama` | GPU LLM runtime (qwen, nomic-embed-text) | `11436` → `11434` |
|
||||
| `ollama-cpu` | CPU-only runtime for small models | `11435` → `11434` |
|
||||
| `litellm` | OpenAI-compatible proxy, model aliases, rate-limits | `4000` |
|
||||
| `litellm-db` | Postgres — LiteLLM key/usage persistence | internal |
|
||||
| `langfuse` | LLM observability (traces, cost, feedback) | `3200` → `3000` |
|
||||
| `langfuse-db` | Postgres — Langfuse | internal |
|
||||
| `qdrant` | Vector DB | `6333`, `6334` |
|
||||
| `searxng` | Meta-search, pluggable to agents | `11437` → `8080` |
|
||||
| `open-webui` | Chat UI (see [[Open-WebUI]]) | `3125` |
|
||||
| `faster-whisper` | STT (Whisper v3-turbo, CUDA) | `8880` |
|
||||
| `silero-tts` | TTS (Silero v4) | `8881` |
|
||||
| `pipecat` | Voice pipeline (LiveKit + STT + TTS + LLM) | `8882` |
|
||||
|
||||
## Public endpoints
|
||||
|
||||
- `llm.alogins.net` → LiteLLM (`localhost:4000`)
|
||||
- `lf.alogins.net` → Langfuse (`localhost:3200`)
|
||||
- `ai.alogins.net` → Open WebUI (`localhost:3125`)
|
||||
- `voice.alogins.net` → Pipecat (`localhost:8882`)
|
||||
|
||||
## LiteLLM
|
||||
|
||||
OpenAI-compatible endpoint. All LLM traffic for Agap apps routes through it so model swaps are a config change, not a code change.
|
||||
|
||||
- Config: `agap_git/openai/litellm-config.yaml`
|
||||
- Master key: set as `LITELLM_MASTER_KEY` env on the container
|
||||
- Langfuse callbacks for every request (success + failure)
|
||||
- Fallbacks defined per model (e.g. `deepseek/deepseek-r1:free` → `ollama/qwen3.5:4b`)
|
||||
|
||||
### Model aliases
|
||||
|
||||
**oO recommender** (used by `ml/serving` — see oO `CLAUDE.md`):
|
||||
|
||||
| Alias | Backend | Notes |
|
||||
|-------|---------|-------|
|
||||
| `tip-generator` | `ollama/qwen2.5:1.5b` on host ollama `:11434` | Tip candidate generation |
|
||||
| `embedder` | `ollama/nomic-embed-text` on host ollama `:11434` | Task clustering, dedup |
|
||||
| `judge` | `anthropic/claude-haiku-4-5-20251001` | Offline eval only |
|
||||
|
||||
**Raw models** (full list in `litellm-config.yaml`): qwen2.5 / qwen3 / gemma3 via ollama, plus OpenRouter free tier (llama, deepseek, gemma, mistral, nvidia nemotron, gpt-oss, minimax, hermes).
|
||||
|
||||
### Quick test
|
||||
|
||||
```bash
|
||||
curl -H "Authorization: Bearer $LITELLM_MASTER_KEY" \
|
||||
-H "Content-Type: application/json" \
|
||||
https://llm.alogins.net/v1/chat/completions \
|
||||
-d '{"model":"tip-generator","messages":[{"role":"user","content":"hi"}]}'
|
||||
```
|
||||
|
||||
## Ollama
|
||||
|
||||
Two runtimes: a GPU one (main, `:11436`) and a CPU fallback (`:11435`). Models live under `/mnt/ssd/ai/ollama` and `/mnt/ssd/ai/ollama-cpu`.
|
||||
|
||||
GPU config (compose):
|
||||
- `OLLAMA_MAX_LOADED_MODELS=2` — two models resident in VRAM
|
||||
- `OLLAMA_NUM_PARALLEL=1` — serialise inference to avoid GPU contention
|
||||
- `OLLAMA_NUM_GPU=999` — force all layers to GPU, fail rather than fall back to CPU
|
||||
- `mem_limit: 4g`
|
||||
|
||||
### Pull / list
|
||||
|
||||
```bash
|
||||
docker exec ollama ollama pull qwen2.5:1.5b
|
||||
docker exec ollama ollama list
|
||||
```
|
||||
|
||||
## Langfuse
|
||||
|
||||
Observability for LiteLLM. Traces every completion with prompt, response, tokens, cost, latency. Signup disabled (`AUTH_DISABLE_SIGNUP=true`) — create users via admin.
|
||||
|
||||
- Backend DB: `langfuse-db` (Postgres)
|
||||
- Callbacks wired through LiteLLM (`litellm_settings.success_callback: [langfuse]`)
|
||||
|
||||
## Data
|
||||
|
||||
| Path | Contents |
|
||||
|------|----------|
|
||||
| `/mnt/ssd/ai/ollama` | GPU ollama models |
|
||||
| `/mnt/ssd/ai/ollama-cpu` | CPU ollama models |
|
||||
| `/mnt/ssd/ai/open-webui` | Open WebUI state |
|
||||
| `/mnt/ssd/ai/searxng` | Searxng config + cache |
|
||||
| `/mnt/ssd/ai/faster-whisper` | Whisper model cache |
|
||||
| `/mnt/ssd/ai/silero-tts` | Silero torch cache |
|
||||
| `/mnt/ssd/dbs/litellm/postgres` | LiteLLM keys + usage |
|
||||
| `/mnt/ssd/dbs/langfuse/postgres` | Langfuse traces |
|
||||
| `/mnt/ssd/dbs/qdrant` | Vector store |
|
||||
|
||||
## Start / stop
|
||||
|
||||
```bash
|
||||
cd ~/agap_git/openai
|
||||
docker compose up -d # everything
|
||||
docker compose up -d litellm # minimal (brings up its deps)
|
||||
docker compose stop litellm langfuse # stop subset
|
||||
```
|
||||
|
||||
Most services are `restart: always`.
|
||||
3
Home.md
3
Home.md
@@ -11,7 +11,8 @@
|
||||
|
||||
- [[Immich]] — Photo management and backup
|
||||
- [[Gitea]] — Git hosting
|
||||
- [[Open-WebUI]] — AI chat interface
|
||||
- [[AI]] — LLM stack (LiteLLM, Langfuse, Ollama, STT/TTS)
|
||||
- [[Open-WebUI]] — AI chat interface (part of the AI stack)
|
||||
- [[Home-Assistant]] — KVM virtual machine
|
||||
- [[3X-UI]] — VPN proxy
|
||||
- [[Zabbix]] — Monitoring (Zabbix 7.4, PostgreSQL, Apache)
|
||||
|
||||
@@ -1,35 +1,23 @@
|
||||
# Open WebUI
|
||||
|
||||
AI chat interface with Ollama, GPU-accelerated via NVIDIA.
|
||||
|
||||
## Configuration
|
||||
|
||||
Configuration is in the `agap_git` repository:
|
||||
- Compose file: `openai/docker-compose.yml`
|
||||
|
||||
Repository: https://github.com/alvis/agap_git
|
||||
Chat UI frontend to the [[AI]] stack. GPU-accelerated via NVIDIA Container Toolkit. Routes LLM calls through LiteLLM (`host.docker.internal:4000/v1`), STT via Faster-Whisper, TTS via Silero.
|
||||
|
||||
## Access
|
||||
|
||||
- Web: `http://agap:3125`
|
||||
- `http://agap:3125`
|
||||
- `ai.alogins.net` (via Caddy)
|
||||
|
||||
## Configuration
|
||||
|
||||
`agap_git/openai/docker-compose.yml` (service `open-webui`).
|
||||
|
||||
## Data
|
||||
|
||||
| Path | Contents |
|
||||
|------|----------|
|
||||
| Docker volume `ollama` | Ollama models |
|
||||
| Docker volume `open-webui` | Web UI data and settings |
|
||||
- `/mnt/ssd/ai/open-webui` — user DB, chats, settings
|
||||
|
||||
## GPU
|
||||
## GPU prerequisites
|
||||
|
||||
Requires NVIDIA GPU support:
|
||||
```bash
|
||||
sudo ./nvidia-docker-install.sh
|
||||
./install-cuda.sh
|
||||
```
|
||||
|
||||
## Stack
|
||||
|
||||
- Open WebUI (NVIDIA/Ollama variant)
|
||||
- Ollama (LLM runtime)
|
||||
- NVIDIA Container Toolkit for GPU acceleration
|
||||
|
||||
Reference in New Issue
Block a user