Update docs: streaming, CLI container, use_cases tests

- /stream/{session_id} SSE endpoint replaces /reply/ for CLI - Medium tier streams per-token via astream() with in_think filtering - CLI now runs as Docker container (Dockerfile.cli, profile:tools) - Correct medium model to qwen3:4b with real-time think block filtering - Add use_cases/ test category to commands section - Update files tree and services table Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-12 17:31:36 +00:00
parent b04e8a0925
commit 8cd41940f0
2 changed files with 36 additions and 22 deletions
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -9,9 +9,11 @@ This file provides guidance to Claude Code (claude.ai/code) when working with co
 docker compose up --build
 ```

-**Interactive CLI (requires gateway running):**
+**Interactive CLI (Docker container, requires gateway running):**
 ```bash
-python3 cli.py [--url http://localhost:8000] [--session cli-alvis] [--timeout 400]
+docker compose --profile tools run --rm -it cli
+# or with options:
+docker compose --profile tools run --rm -it cli python3 cli.py --url http://deepagents:8000 --session cli-alvis
 ```

 **Run integration tests** (from `tests/integration/`, require all Docker services running):
@@ -31,6 +33,8 @@ python3 test_routing.py --hard-only             # complex-tier + VRAM flush benc

 Shared config and helpers are in `tests/integration/common.py`.

+**Use case tests** (`tests/use_cases/`) — markdown skill files executed by Claude Code, which acts as mock user and quality evaluator. Run by reading the `.md` file and following its steps with tools (Bash, WebFetch, etc.).
+
 ## Architecture

 Adolf is a multi-channel personal assistant. All LLM inference is routed through **Bifrost**, an open-source Go-based LLM gateway that adds retry logic, failover, and observability in front of Ollama.
@@ -49,11 +53,13 @@ Channel adapter → POST /message {text, session_id, channel, user_id}
                        if URL content fetched → upgrade light→medium
                    → invoke agent for tier via Bifrost (url_context + memories in system prompt)
                        deepagents:8000 → bifrost:8080/v1 → ollama:11436
+                    → _push_stream_chunk() per token (medium streaming) / full reply (light, complex)
+                        → _stream_queues[session_id] asyncio.Queue
+                    → _end_stream() sends [DONE] sentinel
                    → channels.deliver(session_id, channel, reply)
-                        → pending_replies[session_id] queue (SSE)
-                        → channel-specific callback (Telegram POST, CLI no-op)
+                        → channel-specific callback (Telegram POST)
                    → _store_memory() background task (openmemory)
-CLI/wiki polling → GET /reply/{session_id}  (SSE, blocks until reply)
+CLI streaming    → GET /stream/{session_id}  (SSE, per-token for medium, single-chunk for others)
 ```

 ### Bifrost integration
@@ -76,15 +82,15 @@ The router does regex pre-classification first, then LLM classification. Complex

 A global `asyncio.Semaphore(1)` (`_reply_semaphore`) serializes all LLM inference — one request at a time.

-### Thinking mode
+### Thinking mode and streaming

-qwen3 models produce chain-of-thought `<think>...</think>` tokens via Ollama's OpenAI-compatible endpoint. Adolf controls this via system prompt prefixes:
+qwen3 models produce chain-of-thought `<think>...</think>` tokens. Handling differs by tier:

- **Medium** (`qwen2.5:1.5b`): no thinking mode in this model; fast ~3s calls
- **Complex** (`qwen3:8b`): no prefix — thinking enabled by default, used for deep research
- **Router** (`qwen2.5:1.5b`): no thinking support in this model
+- **Medium** (`qwen3:4b`): streams via `astream()`. A state machine (`in_think` flag) filters `<think>` blocks in real time — only non-think tokens are pushed to `_stream_queues` and displayed to the user.
+- **Complex** (`qwen3:8b`): `create_deep_agent` returns a complete reply; `_strip_think()` filters think blocks before the reply is pushed as a single chunk.
+- **Router/light** (`qwen2.5:1.5b`): no thinking support; `_strip_think()` used defensively.

-`_strip_think()` in `agent.py` and `router.py` strips any `<think>` blocks from model output before returning to users.
+`_strip_think()` in `agent.py` and `router.py` strips any `<think>` blocks from non-streaming output.

 ### VRAM management (`vram_manager.py`)

@@ -93,7 +99,7 @@ Hardware: GTX 1070 (8 GB). Before running the 8b model, medium models are flushe
 ### Channel adapters (`channels.py`)

 - **Telegram**: Grammy Node.js bot (`grammy/bot.mjs`) long-polls Telegram → `POST /message`; replies delivered via `POST grammy:3001/send`
- **CLI**: `cli.py` posts to `/message`, then blocks on `GET /reply/{session_id}` SSE
+- **CLI**: `cli.py` (Docker container, `profiles: [tools]`) posts to `/message`, then streams from `GET /stream/{session_id}` SSE with Rich `Live` display and final Markdown render.

 Session IDs: `tg-<chat_id>` for Telegram, `cli-<username>` for CLI. Conversation history: 5-turn buffer per session.

@@ -106,6 +112,7 @@ Session IDs: `tg-<chat_id>` for Telegram, `cli-<username>` for CLI. Conversation
 | `openmemory` | 8765 | FastMCP server + mem0 memory tools (Qdrant-backed) |
 | `grammy` | 3001 | grammY Telegram bot + `/send` HTTP endpoint |
 | `crawl4ai` | 11235 | JS-rendered page fetching |
+| `cli` | — | Interactive CLI container (`profiles: [tools]`), Rich streaming display |

 External (from `openai/` stack, host ports):
 - Ollama GPU: `11436` — all reply inference (via Bifrost) + VRAM management (direct)