Update docs: streaming, CLI container, use_cases tests
- /stream/{session_id} SSE endpoint replaces /reply/ for CLI
- Medium tier streams per-token via astream() with in_think filtering
- CLI now runs as Docker container (Dockerfile.cli, profile:tools)
- Correct medium model to qwen3:4b with real-time think block filtering
- Add use_cases/ test category to commands section
- Update files tree and services table
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -18,7 +18,8 @@ Autonomous personal assistant with a multi-channel gateway. Three-tier model rou
|
|||||||
│ │ │ │
|
│ │ │ │
|
||||||
│ │ POST /message │ ← all inbound │
|
│ │ POST /message │ ← all inbound │
|
||||||
│ │ POST /chat (legacy) │ │
|
│ │ POST /chat (legacy) │ │
|
||||||
│ │ GET /reply/{id} SSE │ ← CLI polling │
|
│ │ GET /stream/{id} SSE │ ← token stream│
|
||||||
|
│ │ GET /reply/{id} SSE │ ← legacy poll │
|
||||||
│ │ GET /health │ │
|
│ │ GET /health │ │
|
||||||
│ │ │ │
|
│ │ │ │
|
||||||
│ │ channels.py registry │ │
|
│ │ channels.py registry │ │
|
||||||
@@ -42,7 +43,7 @@ Autonomous personal assistant with a multi-channel gateway. Three-tier model rou
|
|||||||
| Channel | session_id | Inbound | Outbound |
|
| Channel | session_id | Inbound | Outbound |
|
||||||
|---------|-----------|---------|---------|
|
|---------|-----------|---------|---------|
|
||||||
| Telegram | `tg-<chat_id>` | Grammy long-poll → POST /message | channels.py → POST grammy:3001/send |
|
| Telegram | `tg-<chat_id>` | Grammy long-poll → POST /message | channels.py → POST grammy:3001/send |
|
||||||
| CLI | `cli-<user>` | POST /message directly | GET /reply/{id} SSE stream |
|
| CLI | `cli-<user>` | POST /message directly | GET /stream/{id} SSE — Rich Live streaming |
|
||||||
| Voice | `voice-<device>` | (future) | (future) |
|
| Voice | `voice-<device>` | (future) | (future) |
|
||||||
|
|
||||||
## Unified Message Flow
|
## Unified Message Flow
|
||||||
@@ -58,11 +59,13 @@ Autonomous personal assistant with a multi-channel gateway. Three-tier model rou
|
|||||||
6. router.route() with enriched history (url_context + memories as system msgs)
|
6. router.route() with enriched history (url_context + memories as system msgs)
|
||||||
- if URL content fetched and tier=light → upgrade to medium
|
- if URL content fetched and tier=light → upgrade to medium
|
||||||
7. Invoke agent for tier with url_context + memories in system prompt
|
7. Invoke agent for tier with url_context + memories in system prompt
|
||||||
8. channels.deliver(session_id, channel, reply_text)
|
8. Token streaming:
|
||||||
- always puts reply in pending_replies[session_id] queue (for SSE)
|
- medium: astream() pushes per-token chunks to _stream_queues[session_id]; <think> blocks filtered in real time
|
||||||
- calls channel-specific send callback
|
- light/complex: full reply pushed as single chunk after completion
|
||||||
9. _store_memory() background task — stores turn in openmemory
|
- _end_stream() sends [DONE] sentinel
|
||||||
10. GET /reply/{session_id} SSE clients receive the reply
|
9. channels.deliver(session_id, channel, reply_text) — Telegram callback
|
||||||
|
10. _store_memory() background task — stores turn in openmemory
|
||||||
|
11. GET /stream/{session_id} SSE clients receive chunks; CLI renders with Rich Live + final Markdown
|
||||||
```
|
```
|
||||||
|
|
||||||
## Tool Handling
|
## Tool Handling
|
||||||
@@ -132,15 +135,19 @@ Conversation history is keyed by session_id (5-turn buffer).
|
|||||||
|
|
||||||
```
|
```
|
||||||
adolf/
|
adolf/
|
||||||
├── docker-compose.yml Services: bifrost, deepagents, openmemory, grammy, crawl4ai
|
├── docker-compose.yml Services: bifrost, deepagents, openmemory, grammy, crawl4ai, cli (profile:tools)
|
||||||
├── Dockerfile deepagents container (Python 3.12)
|
├── Dockerfile deepagents container (Python 3.12)
|
||||||
├── agent.py FastAPI gateway, run_agent_task, Crawl4AI pre-fetch, memory pipeline
|
├── Dockerfile.cli CLI container (python:3.12-slim + rich)
|
||||||
|
├── agent.py FastAPI gateway, run_agent_task, Crawl4AI pre-fetch, memory pipeline, /stream/ SSE
|
||||||
├── channels.py Channel registry + deliver() + pending_replies
|
├── channels.py Channel registry + deliver() + pending_replies
|
||||||
├── router.py Router class — regex + LLM tier classification
|
├── router.py Router class — regex + LLM tier classification
|
||||||
├── vram_manager.py VRAMManager — flush/prewarm/poll Ollama VRAM
|
├── vram_manager.py VRAMManager — flush/prewarm/poll Ollama VRAM
|
||||||
├── agent_factory.py _DirectModel (medium) / create_deep_agent (complex)
|
├── agent_factory.py _DirectModel (medium) / create_deep_agent (complex)
|
||||||
├── cli.py Interactive CLI REPL client
|
├── cli.py Interactive CLI REPL — Rich Live streaming + Markdown render
|
||||||
├── wiki_research.py Batch wiki research pipeline (uses /message + SSE)
|
├── wiki_research.py Batch wiki research pipeline (uses /message + SSE)
|
||||||
|
├── tests/
|
||||||
|
│ ├── integration/ Standalone integration test scripts (common.py + test_*.py)
|
||||||
|
│ └── use_cases/ Claude Code skill markdown files — Claude acts as user + evaluator
|
||||||
├── .env TELEGRAM_BOT_TOKEN (not committed)
|
├── .env TELEGRAM_BOT_TOKEN (not committed)
|
||||||
├── openmemory/
|
├── openmemory/
|
||||||
│ ├── server.py FastMCP + mem0: add_memory, search_memory, get_all_memories
|
│ ├── server.py FastMCP + mem0: add_memory, search_memory, get_all_memories
|
||||||
|
|||||||
31
CLAUDE.md
31
CLAUDE.md
@@ -9,9 +9,11 @@ This file provides guidance to Claude Code (claude.ai/code) when working with co
|
|||||||
docker compose up --build
|
docker compose up --build
|
||||||
```
|
```
|
||||||
|
|
||||||
**Interactive CLI (requires gateway running):**
|
**Interactive CLI (Docker container, requires gateway running):**
|
||||||
```bash
|
```bash
|
||||||
python3 cli.py [--url http://localhost:8000] [--session cli-alvis] [--timeout 400]
|
docker compose --profile tools run --rm -it cli
|
||||||
|
# or with options:
|
||||||
|
docker compose --profile tools run --rm -it cli python3 cli.py --url http://deepagents:8000 --session cli-alvis
|
||||||
```
|
```
|
||||||
|
|
||||||
**Run integration tests** (from `tests/integration/`, require all Docker services running):
|
**Run integration tests** (from `tests/integration/`, require all Docker services running):
|
||||||
@@ -31,6 +33,8 @@ python3 test_routing.py --hard-only # complex-tier + VRAM flush benc
|
|||||||
|
|
||||||
Shared config and helpers are in `tests/integration/common.py`.
|
Shared config and helpers are in `tests/integration/common.py`.
|
||||||
|
|
||||||
|
**Use case tests** (`tests/use_cases/`) — markdown skill files executed by Claude Code, which acts as mock user and quality evaluator. Run by reading the `.md` file and following its steps with tools (Bash, WebFetch, etc.).
|
||||||
|
|
||||||
## Architecture
|
## Architecture
|
||||||
|
|
||||||
Adolf is a multi-channel personal assistant. All LLM inference is routed through **Bifrost**, an open-source Go-based LLM gateway that adds retry logic, failover, and observability in front of Ollama.
|
Adolf is a multi-channel personal assistant. All LLM inference is routed through **Bifrost**, an open-source Go-based LLM gateway that adds retry logic, failover, and observability in front of Ollama.
|
||||||
@@ -49,11 +53,13 @@ Channel adapter → POST /message {text, session_id, channel, user_id}
|
|||||||
if URL content fetched → upgrade light→medium
|
if URL content fetched → upgrade light→medium
|
||||||
→ invoke agent for tier via Bifrost (url_context + memories in system prompt)
|
→ invoke agent for tier via Bifrost (url_context + memories in system prompt)
|
||||||
deepagents:8000 → bifrost:8080/v1 → ollama:11436
|
deepagents:8000 → bifrost:8080/v1 → ollama:11436
|
||||||
|
→ _push_stream_chunk() per token (medium streaming) / full reply (light, complex)
|
||||||
|
→ _stream_queues[session_id] asyncio.Queue
|
||||||
|
→ _end_stream() sends [DONE] sentinel
|
||||||
→ channels.deliver(session_id, channel, reply)
|
→ channels.deliver(session_id, channel, reply)
|
||||||
→ pending_replies[session_id] queue (SSE)
|
→ channel-specific callback (Telegram POST)
|
||||||
→ channel-specific callback (Telegram POST, CLI no-op)
|
|
||||||
→ _store_memory() background task (openmemory)
|
→ _store_memory() background task (openmemory)
|
||||||
CLI/wiki polling → GET /reply/{session_id} (SSE, blocks until reply)
|
CLI streaming → GET /stream/{session_id} (SSE, per-token for medium, single-chunk for others)
|
||||||
```
|
```
|
||||||
|
|
||||||
### Bifrost integration
|
### Bifrost integration
|
||||||
@@ -76,15 +82,15 @@ The router does regex pre-classification first, then LLM classification. Complex
|
|||||||
|
|
||||||
A global `asyncio.Semaphore(1)` (`_reply_semaphore`) serializes all LLM inference — one request at a time.
|
A global `asyncio.Semaphore(1)` (`_reply_semaphore`) serializes all LLM inference — one request at a time.
|
||||||
|
|
||||||
### Thinking mode
|
### Thinking mode and streaming
|
||||||
|
|
||||||
qwen3 models produce chain-of-thought `<think>...</think>` tokens via Ollama's OpenAI-compatible endpoint. Adolf controls this via system prompt prefixes:
|
qwen3 models produce chain-of-thought `<think>...</think>` tokens. Handling differs by tier:
|
||||||
|
|
||||||
- **Medium** (`qwen2.5:1.5b`): no thinking mode in this model; fast ~3s calls
|
- **Medium** (`qwen3:4b`): streams via `astream()`. A state machine (`in_think` flag) filters `<think>` blocks in real time — only non-think tokens are pushed to `_stream_queues` and displayed to the user.
|
||||||
- **Complex** (`qwen3:8b`): no prefix — thinking enabled by default, used for deep research
|
- **Complex** (`qwen3:8b`): `create_deep_agent` returns a complete reply; `_strip_think()` filters think blocks before the reply is pushed as a single chunk.
|
||||||
- **Router** (`qwen2.5:1.5b`): no thinking support in this model
|
- **Router/light** (`qwen2.5:1.5b`): no thinking support; `_strip_think()` used defensively.
|
||||||
|
|
||||||
`_strip_think()` in `agent.py` and `router.py` strips any `<think>` blocks from model output before returning to users.
|
`_strip_think()` in `agent.py` and `router.py` strips any `<think>` blocks from non-streaming output.
|
||||||
|
|
||||||
### VRAM management (`vram_manager.py`)
|
### VRAM management (`vram_manager.py`)
|
||||||
|
|
||||||
@@ -93,7 +99,7 @@ Hardware: GTX 1070 (8 GB). Before running the 8b model, medium models are flushe
|
|||||||
### Channel adapters (`channels.py`)
|
### Channel adapters (`channels.py`)
|
||||||
|
|
||||||
- **Telegram**: Grammy Node.js bot (`grammy/bot.mjs`) long-polls Telegram → `POST /message`; replies delivered via `POST grammy:3001/send`
|
- **Telegram**: Grammy Node.js bot (`grammy/bot.mjs`) long-polls Telegram → `POST /message`; replies delivered via `POST grammy:3001/send`
|
||||||
- **CLI**: `cli.py` posts to `/message`, then blocks on `GET /reply/{session_id}` SSE
|
- **CLI**: `cli.py` (Docker container, `profiles: [tools]`) posts to `/message`, then streams from `GET /stream/{session_id}` SSE with Rich `Live` display and final Markdown render.
|
||||||
|
|
||||||
Session IDs: `tg-<chat_id>` for Telegram, `cli-<username>` for CLI. Conversation history: 5-turn buffer per session.
|
Session IDs: `tg-<chat_id>` for Telegram, `cli-<username>` for CLI. Conversation history: 5-turn buffer per session.
|
||||||
|
|
||||||
@@ -106,6 +112,7 @@ Session IDs: `tg-<chat_id>` for Telegram, `cli-<username>` for CLI. Conversation
|
|||||||
| `openmemory` | 8765 | FastMCP server + mem0 memory tools (Qdrant-backed) |
|
| `openmemory` | 8765 | FastMCP server + mem0 memory tools (Qdrant-backed) |
|
||||||
| `grammy` | 3001 | grammY Telegram bot + `/send` HTTP endpoint |
|
| `grammy` | 3001 | grammY Telegram bot + `/send` HTTP endpoint |
|
||||||
| `crawl4ai` | 11235 | JS-rendered page fetching |
|
| `crawl4ai` | 11235 | JS-rendered page fetching |
|
||||||
|
| `cli` | — | Interactive CLI container (`profiles: [tools]`), Rich streaming display |
|
||||||
|
|
||||||
External (from `openai/` stack, host ports):
|
External (from `openai/` stack, host ports):
|
||||||
- Ollama GPU: `11436` — all reply inference (via Bifrost) + VRAM management (direct)
|
- Ollama GPU: `11436` — all reply inference (via Bifrost) + VRAM management (direct)
|
||||||
|
|||||||
Reference in New Issue
Block a user