From 304198ebbdf2705e9b0383574a678b7797c711a8 Mon Sep 17 00:00:00 2001 From: alvis Date: Sun, 8 Mar 2026 06:51:40 +0000 Subject: [PATCH] Add GPU monitoring docs, Agap Installation page --- Agap-Installation.md | 85 ++++++++++++++++++++++++++++++++++++++++++++ Home.md | 1 + Zabbix.md | 25 +++++++++++++ 3 files changed, 111 insertions(+) create mode 100644 Agap-Installation.md diff --git a/Agap-Installation.md b/Agap-Installation.md new file mode 100644 index 0000000..4bbebd5 --- /dev/null +++ b/Agap-Installation.md @@ -0,0 +1,85 @@ +# Agap Installation + +Steps to set up a fresh Agap server from scratch. + +## 1. GPU & Docker + +```bash +sudo ./nvidia-docker-install.sh # Docker + NVIDIA Container Toolkit +./install-cuda.sh # CUDA toolkit (no driver) +``` + +## 2. Zabbix Agent (host) + +Install agent and plugins: + +```bash +# Add Zabbix repo +wget https://repo.zabbix.com/zabbix/7.4/release/ubuntu/pool/main/z/zabbix-release/zabbix-release_latest_7.4+ubuntu24.04_all.deb +sudo dpkg -i zabbix-release_latest_7.4+ubuntu24.04_all.deb +sudo apt update + +# Install agent and GPU plugin +sudo apt install zabbix-agent2 zabbix-agent2-plugin-nvidia-gpu +``` + +Configure `/etc/zabbix/zabbix_agent2.conf`: + +```ini +Server=127.0.0.1 +ServerActive=127.0.0.1:10051 +Hostname=AgapHost +PluginSocket=/run/zabbix/agent.plugin.sock +ControlSocket=/run/zabbix/agent.sock +Include=/etc/zabbix/zabbix_agent2.d/plugins.d/*.conf +Include=/etc/zabbix/zabbix_agent2.d/*.conf +``` + +```bash +sudo systemctl enable --now zabbix-agent2 +``` + +In Zabbix UI, link these templates to the `AgapHost` host: +- **Linux by Zabbix agent active** +- **Nvidia by Zabbix agent 2 active** + +## 3. Custom Zabbix UserParameters + +Add backup monitoring to `/etc/zabbix/zabbix_agent2.d/gitea_backup.conf`: + +```ini +# Gitea backup +UserParameter=gitea.backup.status,grep -c "Finish dumping" /mnt/backups/gitea/backup.log 2>/dev/null | grep -qx 1 && echo 1 || echo 0 +UserParameter=gitea.backup.age,f=$(ls -t /mnt/backups/gitea/gitea-dump-*.zip 2>/dev/null | head -1); [ -n "$f" ] && echo $(( $(date +%s) - $(stat -c %Y "$f") )) || echo -1 + +# DBS backup +UserParameter=dbs.backup.age,f=/mnt/backups/dbs/.last_sync; [ -f "$f" ] && echo $(( $(date +%s) - $(stat -c %Y "$f") )) || echo -1 + +# Immich backup +UserParameter=immich.backup.age,f=/mnt/backups/media/.last_sync; [ -f "$f" ] && echo $(( $(date +%s) - $(stat -c %Y "$f") )) || echo -1 +``` + +## 4. Root Cron Jobs + +```bash +sudo crontab -e +``` + +Add: +``` +0 3 * * * /home/alvis/agap_git/gitea/backup.sh >> /mnt/backups/gitea/cron.log 2>&1 +30 2 * * * /home/alvis/agap_git/immich-app/backup.sh >> /mnt/backups/media/cron.log 2>&1 +30 3 * * * rsync -a --delete /mnt/ssd/dbs/ /mnt/backups/dbs/ >> /mnt/backups/dbs/cron.log 2>&1 && touch /mnt/backups/dbs/.last_sync +``` + +## 5. Services + +Start all Docker services: + +```bash +cd ~/agap_git +docker compose up -d # Immich +cd gitea && docker compose up -d # Gitea +cd ../openai && docker compose up -d # Open WebUI + Ollama +cd ../zabbix && docker compose up -d # Zabbix +``` diff --git a/Home.md b/Home.md index 78e010f..2bba45f 100644 --- a/Home.md +++ b/Home.md @@ -2,6 +2,7 @@ ## Infrastructure +- [[Agap-Installation]] — Fresh install guide - [[Network]] — Netplan, Caddy, port forwarding - [[Storage]] — LVM setup - [[Backups]] — Gitea and database backups diff --git a/Zabbix.md b/Zabbix.md index b9a283d..8e53daf 100644 --- a/Zabbix.md +++ b/Zabbix.md @@ -64,6 +64,31 @@ Host "HA Agap" receives alerts from Home Assistant via `history.push` API. To add a new HA alert: create a trapper item + trigger on "HA Agap", add `rest_command` in HA `configuration.yaml`, create HA automation to call it. +## GPU Monitoring (AgapHost) + +The host agent monitors the GTX 1070 GPU via the `zabbix-agent2-plugin-nvidia-gpu` package. + +**Installed packages:** +```bash +apt install zabbix-agent2-plugin-nvidia-gpu +``` + +Plugin binary: `/usr/libexec/zabbix/zabbix-agent2-plugin-nvidia-gpu` +Plugin config: `/etc/zabbix/zabbix_agent2.d/plugins.d/nvidia.conf` + +Template linked to AgapHost: **Nvidia by Zabbix agent 2 active** — uses `nvml.*` keys. + +Key metrics reported: +| Item | Key | +|------|-----| +| GPU memory free | `nvml.device.memory.fb.free` | +| GPU memory used | `nvml.device.memory.fb.used` | +| GPU utilization | `nvml.device.utilization.gpu` | +| Temperature | `nvml.device.temperature` | +| Power usage | `nvml.device.power.usage` | + +> ECC memory error items show expected errors — GTX 1070 does not support ECC. + ## Notes - Zabbix server port 10051 is exposed on the host for the host agent