From 8cd41940f0a0d66568b0d554f2443f8334452bc0 Mon Sep 17 00:00:00 2001 From: Alvis Date: Thu, 12 Mar 2026 17:31:36 +0000 Subject: [PATCH] Update docs: streaming, CLI container, use_cases tests - /stream/{session_id} SSE endpoint replaces /reply/ for CLI - Medium tier streams per-token via astream() with in_think filtering - CLI now runs as Docker container (Dockerfile.cli, profile:tools) - Correct medium model to qwen3:4b with real-time think block filtering - Add use_cases/ test category to commands section - Update files tree and services table Co-Authored-By: Claude Sonnet 4.6 --- ARCHITECTURE.md | 27 +++++++++++++++++---------- CLAUDE.md | 31 +++++++++++++++++++------------ 2 files changed, 36 insertions(+), 22 deletions(-) diff --git a/ARCHITECTURE.md b/ARCHITECTURE.md index c0504a0..3713b14 100644 --- a/ARCHITECTURE.md +++ b/ARCHITECTURE.md @@ -18,7 +18,8 @@ Autonomous personal assistant with a multi-channel gateway. Three-tier model rou │ │ │ │ │ │ POST /message │ ← all inbound │ │ │ POST /chat (legacy) │ │ -│ │ GET /reply/{id} SSE │ ← CLI polling │ +│ │ GET /stream/{id} SSE │ ← token stream│ +│ │ GET /reply/{id} SSE │ ← legacy poll │ │ │ GET /health │ │ │ │ │ │ │ │ channels.py registry │ │ @@ -42,7 +43,7 @@ Autonomous personal assistant with a multi-channel gateway. Three-tier model rou | Channel | session_id | Inbound | Outbound | |---------|-----------|---------|---------| | Telegram | `tg-` | Grammy long-poll → POST /message | channels.py → POST grammy:3001/send | -| CLI | `cli-` | POST /message directly | GET /reply/{id} SSE stream | +| CLI | `cli-` | POST /message directly | GET /stream/{id} SSE — Rich Live streaming | | Voice | `voice-` | (future) | (future) | ## Unified Message Flow @@ -58,11 +59,13 @@ Autonomous personal assistant with a multi-channel gateway. Three-tier model rou 6. router.route() with enriched history (url_context + memories as system msgs) - if URL content fetched and tier=light → upgrade to medium 7. Invoke agent for tier with url_context + memories in system prompt -8. channels.deliver(session_id, channel, reply_text) - - always puts reply in pending_replies[session_id] queue (for SSE) - - calls channel-specific send callback -9. _store_memory() background task — stores turn in openmemory -10. GET /reply/{session_id} SSE clients receive the reply +8. Token streaming: + - medium: astream() pushes per-token chunks to _stream_queues[session_id]; blocks filtered in real time + - light/complex: full reply pushed as single chunk after completion + - _end_stream() sends [DONE] sentinel +9. channels.deliver(session_id, channel, reply_text) — Telegram callback +10. _store_memory() background task — stores turn in openmemory +11. GET /stream/{session_id} SSE clients receive chunks; CLI renders with Rich Live + final Markdown ``` ## Tool Handling @@ -132,15 +135,19 @@ Conversation history is keyed by session_id (5-turn buffer). ``` adolf/ -├── docker-compose.yml Services: bifrost, deepagents, openmemory, grammy, crawl4ai +├── docker-compose.yml Services: bifrost, deepagents, openmemory, grammy, crawl4ai, cli (profile:tools) ├── Dockerfile deepagents container (Python 3.12) -├── agent.py FastAPI gateway, run_agent_task, Crawl4AI pre-fetch, memory pipeline +├── Dockerfile.cli CLI container (python:3.12-slim + rich) +├── agent.py FastAPI gateway, run_agent_task, Crawl4AI pre-fetch, memory pipeline, /stream/ SSE ├── channels.py Channel registry + deliver() + pending_replies ├── router.py Router class — regex + LLM tier classification ├── vram_manager.py VRAMManager — flush/prewarm/poll Ollama VRAM ├── agent_factory.py _DirectModel (medium) / create_deep_agent (complex) -├── cli.py Interactive CLI REPL client +├── cli.py Interactive CLI REPL — Rich Live streaming + Markdown render ├── wiki_research.py Batch wiki research pipeline (uses /message + SSE) +├── tests/ +│ ├── integration/ Standalone integration test scripts (common.py + test_*.py) +│ └── use_cases/ Claude Code skill markdown files — Claude acts as user + evaluator ├── .env TELEGRAM_BOT_TOKEN (not committed) ├── openmemory/ │ ├── server.py FastMCP + mem0: add_memory, search_memory, get_all_memories diff --git a/CLAUDE.md b/CLAUDE.md index 0e077c4..f7271d9 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -9,9 +9,11 @@ This file provides guidance to Claude Code (claude.ai/code) when working with co docker compose up --build ``` -**Interactive CLI (requires gateway running):** +**Interactive CLI (Docker container, requires gateway running):** ```bash -python3 cli.py [--url http://localhost:8000] [--session cli-alvis] [--timeout 400] +docker compose --profile tools run --rm -it cli +# or with options: +docker compose --profile tools run --rm -it cli python3 cli.py --url http://deepagents:8000 --session cli-alvis ``` **Run integration tests** (from `tests/integration/`, require all Docker services running): @@ -31,6 +33,8 @@ python3 test_routing.py --hard-only # complex-tier + VRAM flush benc Shared config and helpers are in `tests/integration/common.py`. +**Use case tests** (`tests/use_cases/`) — markdown skill files executed by Claude Code, which acts as mock user and quality evaluator. Run by reading the `.md` file and following its steps with tools (Bash, WebFetch, etc.). + ## Architecture Adolf is a multi-channel personal assistant. All LLM inference is routed through **Bifrost**, an open-source Go-based LLM gateway that adds retry logic, failover, and observability in front of Ollama. @@ -49,11 +53,13 @@ Channel adapter → POST /message {text, session_id, channel, user_id} if URL content fetched → upgrade light→medium → invoke agent for tier via Bifrost (url_context + memories in system prompt) deepagents:8000 → bifrost:8080/v1 → ollama:11436 + → _push_stream_chunk() per token (medium streaming) / full reply (light, complex) + → _stream_queues[session_id] asyncio.Queue + → _end_stream() sends [DONE] sentinel → channels.deliver(session_id, channel, reply) - → pending_replies[session_id] queue (SSE) - → channel-specific callback (Telegram POST, CLI no-op) + → channel-specific callback (Telegram POST) → _store_memory() background task (openmemory) -CLI/wiki polling → GET /reply/{session_id} (SSE, blocks until reply) +CLI streaming → GET /stream/{session_id} (SSE, per-token for medium, single-chunk for others) ``` ### Bifrost integration @@ -76,15 +82,15 @@ The router does regex pre-classification first, then LLM classification. Complex A global `asyncio.Semaphore(1)` (`_reply_semaphore`) serializes all LLM inference — one request at a time. -### Thinking mode +### Thinking mode and streaming -qwen3 models produce chain-of-thought `...` tokens via Ollama's OpenAI-compatible endpoint. Adolf controls this via system prompt prefixes: +qwen3 models produce chain-of-thought `...` tokens. Handling differs by tier: -- **Medium** (`qwen2.5:1.5b`): no thinking mode in this model; fast ~3s calls -- **Complex** (`qwen3:8b`): no prefix — thinking enabled by default, used for deep research -- **Router** (`qwen2.5:1.5b`): no thinking support in this model +- **Medium** (`qwen3:4b`): streams via `astream()`. A state machine (`in_think` flag) filters `` blocks in real time — only non-think tokens are pushed to `_stream_queues` and displayed to the user. +- **Complex** (`qwen3:8b`): `create_deep_agent` returns a complete reply; `_strip_think()` filters think blocks before the reply is pushed as a single chunk. +- **Router/light** (`qwen2.5:1.5b`): no thinking support; `_strip_think()` used defensively. -`_strip_think()` in `agent.py` and `router.py` strips any `` blocks from model output before returning to users. +`_strip_think()` in `agent.py` and `router.py` strips any `` blocks from non-streaming output. ### VRAM management (`vram_manager.py`) @@ -93,7 +99,7 @@ Hardware: GTX 1070 (8 GB). Before running the 8b model, medium models are flushe ### Channel adapters (`channels.py`) - **Telegram**: Grammy Node.js bot (`grammy/bot.mjs`) long-polls Telegram → `POST /message`; replies delivered via `POST grammy:3001/send` -- **CLI**: `cli.py` posts to `/message`, then blocks on `GET /reply/{session_id}` SSE +- **CLI**: `cli.py` (Docker container, `profiles: [tools]`) posts to `/message`, then streams from `GET /stream/{session_id}` SSE with Rich `Live` display and final Markdown render. Session IDs: `tg-` for Telegram, `cli-` for CLI. Conversation history: 5-turn buffer per session. @@ -106,6 +112,7 @@ Session IDs: `tg-` for Telegram, `cli-` for CLI. Conversation | `openmemory` | 8765 | FastMCP server + mem0 memory tools (Qdrant-backed) | | `grammy` | 3001 | grammY Telegram bot + `/send` HTTP endpoint | | `crawl4ai` | 11235 | JS-rendered page fetching | +| `cli` | — | Interactive CLI container (`profiles: [tools]`), Rich streaming display | External (from `openai/` stack, host ports): - Ollama GPU: `11436` — all reply inference (via Bifrost) + VRAM management (direct)