Files
adolf/ARCHITECTURE.md
Alvis 66ab93aa37 Add Adolf architecture doc and integration test script
- ARCHITECTURE.md: comprehensive pipeline description (copied from Gitea wiki)
- test_pipeline.py: tests all services, memory, async timing, and recall

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-23 04:52:40 +00:00

93 lines
3.2 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Adolf
Persistent AI assistant reachable via Telegram. GPU-accelerated inference with long-term memory and web search.
## Architecture
```
Telegram user
↕ (long-polling)
[grammy] Node.js — port 3001
- grammY bot polls Telegram
- on message: fire-and-forget POST /chat to deepagents
- exposes MCP SSE server: tool send_telegram_message(chat_id, text)
↕ fire-and-forget HTTP ↕ MCP SSE tool call
[deepagents] Python FastAPI — port 8000
- POST /chat → 202 Accepted immediately
- background task: run LangGraph react agent
- LLM: qwen3:8b via Ollama GPU (host port 11436)
- tools: search_memory, get_all_memories, web_search
- after reply: async fire-and-forget → store memory on CPU
↕ MCP SSE ↕ HTTP (SearXNG)
[openmemory] Python + mem0 — port 8765 [SearXNG — port 11437]
- MCP tools: add_memory, search_memory, get_all_memories
- mem0 backend: Qdrant (port 6333) + CPU Ollama (port 11435)
- embedder: nomic-embed-text (768 dims)
- extractor: gemma3:1b
- collection: adolf_memories
```
## Queuing and Concurrency
Two semaphores prevent resource contention:
| Semaphore | Guards | Notes |
|-----------|--------|-------|
| `_reply_semaphore(1)` | GPU Ollama (qwen3:8b) | One LLM inference at a time |
| `_memory_semaphore(1)` | CPU Ollama (gemma3:1b) | One memory store at a time |
**Reply-first pipeline:**
1. User message arrives via Telegram → Grammy forwards to deepagents (fire-and-forget)
2. Deepagents queues behind `_reply_semaphore`, runs agent, sends reply via Grammy MCP tool
3. After reply is sent, `asyncio.create_task` fires `store_memory_async` in background
4. Memory task queues behind `_memory_semaphore`, calls `add_memory` on openmemory
5. openmemory uses CPU Ollama: embedding (~0.3s) + extraction (~1.6s) → stored in Qdrant
Reply latency: ~1018s (GPU qwen3:8b inference + tool calls).
Memory latency: ~516s (runs async, never blocks replies).
## External Services (from openai/ stack)
| Service | Host Port | Role |
|---------|-----------|------|
| Ollama GPU | 11436 | Main LLM (qwen3:8b) |
| Ollama CPU | 11435 | Memory embedding + extraction |
| Qdrant | 6333 | Vector store for memories |
| SearXNG | 11437 | Web search |
## Compose Stack
Config: `agap_git/adolf/docker-compose.yml`
```bash
cd agap_git/adolf
docker compose up -d
```
Requires `TELEGRAM_BOT_TOKEN` in `adolf/.env`.
## Memory
- Stored per `chat_id` (Telegram user ID) as `user_id` in mem0
- Semantic search via Qdrant (cosine similarity, 768-dim nomic-embed-text vectors)
- mem0 uses gemma3:1b to extract structured facts before embedding
- Collection: `adolf_memories` in Qdrant
## Files
```
adolf/
├── docker-compose.yml Services: deepagents, openmemory, grammy
├── Dockerfile deepagents container (Python 3.12)
├── agent.py FastAPI + LangGraph react agent
├── .env TELEGRAM_BOT_TOKEN (not committed)
├── openmemory/
│ ├── server.py FastMCP + mem0 MCP tools
│ ├── requirements.txt
│ └── Dockerfile
└── grammy/
├── bot.mjs grammY bot + MCP SSE server
├── package.json
└── Dockerfile
```