Add Adolf architecture doc and integration test script

- ARCHITECTURE.md: comprehensive pipeline description (copied from Gitea wiki) - test_pipeline.py: tests all services, memory, async timing, and recall Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-23 04:52:40 +00:00
parent 90cb41ec53
commit f6714f9392
2 changed files with 408 additions and 0 deletions
--- a/adolf/ARCHITECTURE.md
+++ b/adolf/ARCHITECTURE.md
@@ -0,0 +1,92 @@
+# Adolf
+
+Persistent AI assistant reachable via Telegram. GPU-accelerated inference with long-term memory and web search.
+
+## Architecture
+
+```
+Telegram user
+     ↕ (long-polling)
+[grammy] Node.js — port 3001
+  - grammY bot polls Telegram
+  - on message: fire-and-forget POST /chat to deepagents
+  - exposes MCP SSE server: tool send_telegram_message(chat_id, text)
+     ↕ fire-and-forget HTTP          ↕ MCP SSE tool call
+[deepagents] Python FastAPI — port 8000
+  - POST /chat → 202 Accepted immediately
+  - background task: run LangGraph react agent
+  - LLM: qwen3:8b via Ollama GPU (host port 11436)
+  - tools: search_memory, get_all_memories, web_search
+  - after reply: async fire-and-forget → store memory on CPU
+     ↕ MCP SSE                        ↕ HTTP (SearXNG)
+[openmemory] Python + mem0 — port 8765    [SearXNG — port 11437]
+  - MCP tools: add_memory, search_memory, get_all_memories
+  - mem0 backend: Qdrant (port 6333) + CPU Ollama (port 11435)
+  - embedder: nomic-embed-text (768 dims)
+  - extractor: gemma3:1b
+  - collection: adolf_memories
+```
+
+## Queuing and Concurrency
+
+Two semaphores prevent resource contention:
+
+| Semaphore | Guards | Notes |
+|-----------|--------|-------|
+| `_reply_semaphore(1)` | GPU Ollama (qwen3:8b) | One LLM inference at a time |
+| `_memory_semaphore(1)` | CPU Ollama (gemma3:1b) | One memory store at a time |
+
+**Reply-first pipeline:**
+1. User message arrives via Telegram → Grammy forwards to deepagents (fire-and-forget)
+2. Deepagents queues behind `_reply_semaphore`, runs agent, sends reply via Grammy MCP tool
+3. After reply is sent, `asyncio.create_task` fires `store_memory_async` in background
+4. Memory task queues behind `_memory_semaphore`, calls `add_memory` on openmemory
+5. openmemory uses CPU Ollama: embedding (~0.3s) + extraction (~1.6s) → stored in Qdrant
+
+Reply latency: ~10–18s (GPU qwen3:8b inference + tool calls).
+Memory latency: ~5–16s (runs async, never blocks replies).
+
+## External Services (from openai/ stack)
+
+| Service | Host Port | Role |
+|---------|-----------|------|
+| Ollama GPU | 11436 | Main LLM (qwen3:8b) |
+| Ollama CPU | 11435 | Memory embedding + extraction |
+| Qdrant | 6333 | Vector store for memories |
+| SearXNG | 11437 | Web search |
+
+## Compose Stack
+
+Config: `agap_git/adolf/docker-compose.yml`
+
+```bash
+cd agap_git/adolf
+docker compose up -d
+```
+
+Requires `TELEGRAM_BOT_TOKEN` in `adolf/.env`.
+
+## Memory
+
+- Stored per `chat_id` (Telegram user ID) as `user_id` in mem0
+- Semantic search via Qdrant (cosine similarity, 768-dim nomic-embed-text vectors)
+- mem0 uses gemma3:1b to extract structured facts before embedding
+- Collection: `adolf_memories` in Qdrant
+
+## Files
+
+```
+adolf/
+├── docker-compose.yml      Services: deepagents, openmemory, grammy
+├── Dockerfile              deepagents container (Python 3.12)
+├── agent.py                FastAPI + LangGraph react agent
+├── .env                    TELEGRAM_BOT_TOKEN (not committed)
+├── openmemory/
+│   ├── server.py           FastMCP + mem0 MCP tools
+│   ├── requirements.txt
+│   └── Dockerfile
+└── grammy/
+    ├── bot.mjs             grammY bot + MCP SSE server
+    ├── package.json
+    └── Dockerfile
+```
--- a/adolf/test_pipeline.py
+++ b/adolf/test_pipeline.py
@@ -0,0 +1,316 @@
+#!/usr/bin/env python3
+"""
+Adolf pipeline integration test.
+
+Tests:
+  1. Service health (deepagents, openmemory, grammy MCP SSE)
+  2. GPU Ollama reachability and model availability
+  3. CPU Ollama reachability and model availability
+  4. Qdrant reachability and adolf_memories collection
+  5. SearXNG reachability and JSON results
+  6. Full chat pipeline — POST /chat returns 202 immediately
+  7. Async memory storage — memories appear in Qdrant after reply
+  8. Memory recall — agent retrieves stored facts on next query
+  9. Async timing — reply logged before memory stored (from deepagents logs)
+
+Usage:
+    python3 test_pipeline.py [--chat-id CHAT_ID]
+
+Does NOT send real Telegram messages — calls deepagents /chat directly and
+reads Qdrant to verify memory. Grammy delivery is confirmed via MCP tool
+visible in deepagents logs ('[agent] replied in Xs ... send=Ys').
+
+Known limitation: gemma3:1b (CPU extraction model) may abstract or
+deduplicate memories rather than storing raw text verbatim.
+"""
+
+import argparse
+import http.client
+import json
+import random
+import sys
+import time
+import urllib.error
+import urllib.request
+
+# ── config ────────────────────────────────────────────────────────────────────
+DEEPAGENTS   = "http://localhost:8000"
+OPENMEMORY   = "http://localhost:8765"
+GRAMMY_HOST  = "localhost"
+GRAMMY_PORT  = 3001
+OLLAMA_GPU   = "http://localhost:11436"
+OLLAMA_CPU   = "http://localhost:11435"
+QDRANT       = "http://localhost:6333"
+SEARXNG      = "http://localhost:11437"
+
+DEFAULT_CHAT_ID = "346967270"
+
+PASS = "\033[32mPASS\033[0m"
+FAIL = "\033[31mFAIL\033[0m"
+INFO = "\033[36mINFO\033[0m"
+
+results = []
+
+
+def report(name, ok, detail=""):
+    tag = PASS if ok else FAIL
+    line = f"  [{tag}] {name}"
+    if detail:
+        line += f" — {detail}"
+    print(line)
+    results.append((name, ok))
+
+
+def get(url, timeout=5):
+    req = urllib.request.Request(url)
+    with urllib.request.urlopen(req, timeout=timeout) as r:
+        return r.status, r.read().decode()
+
+
+def post_json(url, payload, timeout=30):
+    data = json.dumps(payload).encode()
+    req = urllib.request.Request(
+        url, data=data,
+        headers={"Content-Type": "application/json"},
+        method="POST"
+    )
+    with urllib.request.urlopen(req, timeout=timeout) as r:
+        return r.status, json.loads(r.read().decode())
+
+
+def check_sse(host, port, path):
+    """
+    SSE endpoints stream indefinitely — urlopen would hang waiting for body.
+    Use http.client directly to read just the response status line and headers.
+    """
+    try:
+        conn = http.client.HTTPConnection(host, port, timeout=5)
+        conn.request("GET", path, headers={"Accept": "text/event-stream"})
+        r = conn.getresponse()
+        ok = r.status == 200
+        conn.close()
+        return ok, f"HTTP {r.status}"
+    except Exception as e:
+        return False, str(e)
+
+
+# ── 1. service health ─────────────────────────────────────────────────────────
+print(f"\n[{INFO}] 1. Service health")
+
+try:
+    status, body = get(f"{DEEPAGENTS}/health")
+    data = json.loads(body)
+    ok = status == 200 and data.get("agent_ready") is True
+    report("deepagents /health — agent_ready", ok, f"agent_ready={data.get('agent_ready')}")
+except Exception as e:
+    report("deepagents /health", False, str(e))
+
+ok, detail = check_sse("localhost", 8765, "/sse")
+report("openmemory /sse reachable (HTTP 200)", ok, detail)
+
+ok, detail = check_sse(GRAMMY_HOST, GRAMMY_PORT, "/sse")
+report("grammy /sse reachable (HTTP 200)", ok, detail)
+
+
+# ── 2. GPU Ollama ─────────────────────────────────────────────────────────────
+print(f"\n[{INFO}] 2. GPU Ollama (port 11436)")
+
+try:
+    status, body = get(f"{OLLAMA_GPU}/api/tags")
+    models = [m["name"] for m in json.loads(body).get("models", [])]
+    has_qwen = any("qwen3" in m for m in models)
+    report("GPU Ollama reachable", True, f"models: {models}")
+    report("qwen3:8b present on GPU Ollama", has_qwen)
+except Exception as e:
+    report("GPU Ollama reachable", False, str(e))
+    report("qwen3:8b present on GPU Ollama", False, "skipped")
+
+
+# ── 3. CPU Ollama ─────────────────────────────────────────────────────────────
+print(f"\n[{INFO}] 3. CPU Ollama (port 11435)")
+
+try:
+    status, body = get(f"{OLLAMA_CPU}/api/tags")
+    models = [m["name"] for m in json.loads(body).get("models", [])]
+    has_embed = any("nomic-embed-text" in m for m in models)
+    has_gemma = any("gemma3:1b" in m for m in models)
+    report("CPU Ollama reachable", True, f"models: {models}")
+    report("nomic-embed-text present on CPU Ollama", has_embed)
+    report("gemma3:1b present on CPU Ollama", has_gemma)
+except Exception as e:
+    report("CPU Ollama reachable", False, str(e))
+    report("nomic-embed-text present on CPU Ollama", False, "skipped")
+    report("gemma3:1b present on CPU Ollama", False, "skipped")
+
+
+# ── 4. Qdrant ─────────────────────────────────────────────────────────────────
+print(f"\n[{INFO}] 4. Qdrant (port 6333)")
+
+try:
+    status, body = get(f"{QDRANT}/collections")
+    collections = [c["name"] for c in json.loads(body).get("result", {}).get("collections", [])]
+    has_col = "adolf_memories" in collections
+    report("Qdrant reachable", True, f"collections: {collections}")
+    report("adolf_memories collection exists", has_col)
+except Exception as e:
+    report("Qdrant reachable", False, str(e))
+    report("adolf_memories collection exists", False, "skipped")
+
+try:
+    status, body = get(f"{QDRANT}/collections/adolf_memories")
+    info = json.loads(body).get("result", {})
+    dims = info.get("config", {}).get("params", {}).get("vectors", {}).get("size")
+    report("adolf_memories vector dims = 768", dims == 768, f"got {dims}")
+except Exception as e:
+    report("adolf_memories collection info", False, str(e))
+
+
+# ── 5. SearXNG ────────────────────────────────────────────────────────────────
+print(f"\n[{INFO}] 5. SearXNG (port 11437)")
+
+try:
+    t0 = time.monotonic()
+    status, body = get(f"{SEARXNG}/search?q=test&format=json", timeout=15)
+    elapsed = time.monotonic() - t0
+    data = json.loads(body)
+    n_results = len(data.get("results", []))
+    report("SearXNG reachable + JSON format enabled", status == 200 and n_results > 0,
+           f"{n_results} results in {elapsed:.1f}s")
+    report("SearXNG response < 5s", elapsed < 5, f"{elapsed:.2f}s")
+except Exception as e:
+    report("SearXNG reachable", False, str(e))
+    report("SearXNG response < 5s", False, "skipped")
+
+
+# ── 6. POST /chat returns 202 immediately ─────────────────────────────────────
+print(f"\n[{INFO}] 6–8. Full pipeline (chat → reply → memory → recall)")
+print(f"  Using chat_id={DEFAULT_CHAT_ID}")
+
+marker_word = f"testword{random.randint(1000, 9999)}"
+marker_msg = f"My test marker for this run is: {marker_word}. Please acknowledge."
+
+# Record point count before test so we can verify new points are added
+try:
+    _, col_body = get(f"{QDRANT}/collections/adolf_memories")
+    points_before = json.loads(col_body).get("result", {}).get("points_count", 0)
+except Exception:
+    points_before = 0
+
+print(f"\n  [send] '{marker_msg}'")
+print(f"  Qdrant points before: {points_before}")
+
+t_send = time.monotonic()
+try:
+    status, resp = post_json(f"{DEEPAGENTS}/chat",
+                             {"message": marker_msg, "chat_id": DEFAULT_CHAT_ID},
+                             timeout=5)
+    t_accepted = time.monotonic() - t_send
+    report("POST /chat returns 202 immediately (< 1s)", status == 202 and t_accepted < 1,
+           f"status={status}, t={t_accepted:.3f}s")
+except Exception as e:
+    report("POST /chat returns 202 immediately", False, str(e))
+    print("  Cannot continue pipeline tests.")
+    sys.exit(1)
+
+
+# ── 7. Async memory storage ───────────────────────────────────────────────────
+# Wait long enough for: GPU reply (~20s) + async CPU memory store (~20s) = ~40s
+print(f"  Waiting 50s for agent reply + async memory store…")
+for i in range(10):
+    time.sleep(5)
+    print(f"  …{(i+1)*5}s", end="\r")
+print()
+
+try:
+    _, col_body = get(f"{QDRANT}/collections/adolf_memories")
+    points_after = json.loads(col_body).get("result", {}).get("points_count", 0)
+    new_points = points_after - points_before
+    report("New memory point(s) added to Qdrant after reply", new_points > 0,
+           f"{points_before} → {points_after} (+{new_points})")
+except Exception as e:
+    report("Qdrant points after reply", False, str(e))
+
+# Inspect Qdrant payloads — the `data` field holds what mem0 stored
+# Note: gemma3:1b may abstract/rewrite facts; raw marker_word may or may not appear
+try:
+    _, scroll_body = post_json(
+        f"{QDRANT}/collections/adolf_memories/points/scroll",
+        {"limit": 50, "with_payload": True, "with_vector": False},
+        timeout=10
+    )
+    points = scroll_body.get("result", {}).get("points", [])
+    all_data = [str(p.get("payload", {}).get("data", "")) for p in points]
+    marker_in_data = any(marker_word in d for d in all_data)
+    report(
+        f"Marker '{marker_word}' found verbatim in Qdrant payloads",
+        marker_in_data,
+        "(gemma3:1b may abstract facts — check logs if FAIL)" if not marker_in_data else "found"
+    )
+    if not marker_in_data and all_data:
+        print(f"    Most recent stored data: {all_data[-1][:120]!r}")
+except Exception as e:
+    report("Qdrant payload inspection", False, str(e))
+
+
+# ── 8. Memory recall ──────────────────────────────────────────────────────────
+recall_msg = f"What is the test marker word I just told you? (hint: it starts with 'testword')"
+print(f"\n  [recall] '{recall_msg}'")
+
+try:
+    status, _ = post_json(f"{DEEPAGENTS}/chat",
+                          {"message": recall_msg, "chat_id": DEFAULT_CHAT_ID},
+                          timeout=5)
+    report("Recall query accepted (202)", status == 202)
+except Exception as e:
+    report("Recall query accepted", False, str(e))
+
+print(f"  Waiting 35s for recall reply (check Telegram for actual answer)…")
+for i in range(7):
+    time.sleep(5)
+    print(f"  …{(i+1)*5}s", end="\r")
+print()
+print(f"  NOTE: Check Telegram — the bot should reply with '{marker_word}'.")
+print(f"  Check deepagents logs for: search_memory tool call and correct result.")
+
+
+# ── 9. Async timing verification ──────────────────────────────────────────────
+print(f"\n[{INFO}] 9. Async pipeline timing")
+
+# Verify two rapid POSTs both return 202 quickly (queuing, not blocking)
+t0 = time.monotonic()
+try:
+    s1, _ = post_json(f"{DEEPAGENTS}/chat",
+                      {"message": "async timing check one", "chat_id": DEFAULT_CHAT_ID},
+                      timeout=3)
+    s2, _ = post_json(f"{DEEPAGENTS}/chat",
+                      {"message": "async timing check two", "chat_id": DEFAULT_CHAT_ID},
+                      timeout=3)
+    t_both = time.monotonic() - t0
+    report("Two consecutive POSTs both 202, total < 1s (fire-and-forget queue)",
+           s1 == 202 and s2 == 202 and t_both < 1, f"{t_both:.3f}s")
+except Exception as e:
+    report("Consecutive POST queueing", False, str(e))
+
+print()
+print(f"  To confirm reply-before-memory async ordering, run:")
+print(f"    docker compose -f adolf/docker-compose.yml logs deepagents | grep -E 'replied|stored'")
+print(f"  Expected pattern per message:")
+print(f"    [agent] replied in Xs  ← GPU reply first")
+print(f"    [memory] stored in Ys  ← CPU memory after (Y > X - reply_time)")
+
+
+# ── summary ───────────────────────────────────────────────────────────────────
+print(f"\n{'─'*55}")
+total = len(results)
+passed = sum(1 for _, ok in results if ok)
+failed = total - passed
+print(f"Results: {passed}/{total} passed", end="")
+if failed:
+    print(f"  ({failed} failed)\n")
+    print("Failed checks:")
+    for name, ok in results:
+        if not ok:
+            print(f"  - {name}")
+else:
+    print(" — all good")
+print()