Embed Crawl4AI at all tiers, restore qwen3:4b medium, update docs

- Pre-routing URL fetch: any message with URLs gets content fetched async (httpx.AsyncClient) before routing via _fetch_urls_from_message() - URL context and memories gathered concurrently with asyncio.gather - Light tier upgraded to medium when URL content is present - url_context injected into system prompt for medium and complex agents - Complex agent retains web_search/fetch_url tools + receives pre-fetched content - Medium model restored to qwen3:4b (was temporarily qwen2.5:1.5b) - Unit tests added for _extract_urls - ARCHITECTURE.md: added Tool Handling, Crawl4AI Integration, Memory Pipeline sections - CLAUDE.md: updated request flow and Crawl4AI integration docs Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-12 15:49:34 +00:00
parent f9618a9bbf
commit 50097d6092
8 changed files with 183 additions and 31 deletions
--- a/ARCHITECTURE.md
+++ b/ARCHITECTURE.md
@@ -52,23 +52,66 @@ Autonomous personal assistant with a multi-channel gateway. Three-tier model rou
 2. POST /message {text, session_id, channel, user_id}
 3. 202 Accepted immediately
 4. Background: run_agent_task(message, session_id, channel)
-5. Route → run agent tier → get reply text
-6. channels.deliver(session_id, channel, reply_text)
+5. Parallel IO (asyncio.gather):
+   a. _fetch_urls_from_message() — Crawl4AI fetches any URLs in message
+   b. _retrieve_memories()       — openmemory semantic search for context
+6. router.route() with enriched history (url_context + memories as system msgs)
+   - if URL content fetched and tier=light → upgrade to medium
+7. Invoke agent for tier with url_context + memories in system prompt
+8. channels.deliver(session_id, channel, reply_text)
   - always puts reply in pending_replies[session_id] queue (for SSE)
   - calls channel-specific send callback
-7. GET /reply/{session_id} SSE clients receive the reply
+9. _store_memory() background task — stores turn in openmemory
+10. GET /reply/{session_id} SSE clients receive the reply
 ```

+## Tool Handling
+
+Adolf uses LangChain's tool interface but only the complex agent actually invokes tools at runtime.
+
+**Complex agent (`/think` prefix):** `web_search` and `fetch_url` are defined as `langchain_core.tools.Tool` objects and passed to `create_deep_agent()`. The deepagents library runs an agentic loop (LangGraph `create_react_agent` under the hood) that sends the tool schema to the model via OpenAI function-calling format and handles tool dispatch.
+
+**Medium agent (default):** `_DirectModel` makes a single `model.ainvoke(messages)` call with no tool schema. Context (memories, fetched URL content) is injected via the system prompt instead. This is intentional — `qwen3:4b` behaves unreliably when a tool array is present.
+
+**Memory tools (out-of-loop):** `add_memory` and `search_memory` are LangChain MCP tool objects (via `langchain_mcp_adapters`) but are excluded from both agents' tool lists. They are called directly — `await _memory_add_tool.ainvoke(...)` — outside the agent loop, before and after each turn.
+
 ## Three-Tier Model Routing

-| Tier | Model | VRAM | Trigger | Latency |
-|------|-------|------|---------|---------|
-| Light | qwen2.5:1.5b (router answers) | ~1.2 GB | Router classifies as light | ~2–4s |
-| Medium | qwen3:4b | ~2.5 GB | Default | ~20–40s |
-| Complex | qwen3:8b | ~6.0 GB | `/think` prefix | ~60–120s |
+| Tier | Model | Agent | Trigger | Latency |
+|------|-------|-------|---------|---------|
+| Light | `qwen2.5:1.5b` (router answers directly) | — | Regex pre-match or LLM classifies "light" | ~2–4s |
+| Medium | `qwen3:4b` (`DEEPAGENTS_MODEL`) | `_DirectModel` — single LLM call, no tools | Default; also forced when message contains URLs | ~10–20s |
+| Complex | `qwen3:8b` (`DEEPAGENTS_COMPLEX_MODEL`) | `create_deep_agent` — agentic loop with tools | `/think` prefix only | ~60–120s |

 **`/think` prefix**: forces complex tier, stripped before sending to agent.

+Complex tier is locked out unless the message starts with `/think` — any LLM classification of "complex" is downgraded to medium.
+
+## Crawl4AI Integration
+
+Crawl4AI runs as a Docker service (`crawl4ai:11235`) providing JS-rendered, bot-bypass page fetching.
+
+**Pre-routing fetch (all tiers):**
+- `_URL_RE` detects `https?://` URLs in any incoming message
+- `_crawl4ai_fetch_async()` uses `httpx.AsyncClient` to POST `{urls: [...]}` to `/crawl`
+- Up to 3 URLs fetched concurrently via `asyncio.gather`
+- Fetched content (up to 3000 chars/URL) injected as a system context block into enriched history before routing and into medium/complex system prompts
+- If fetch succeeds and router returns light → tier upgraded to medium
+
+**Complex agent tools:**
+- `web_search`: SearXNG query + Crawl4AI auto-fetch of top 2 result URLs → combined snippet + page text
+- `fetch_url`: Crawl4AI single-URL fetch for any specific URL
+
+## Memory Pipeline
+
+openmemory runs as a FastMCP server (`openmemory:8765`) backed by mem0 + Qdrant + nomic-embed-text.
+
+**Retrieval (before routing):** `_retrieve_memories()` calls `search_memory` MCP tool with the user message as query. Results (threshold ≥ 0.5) are prepended to enriched history so all tiers benefit.
+
+**Storage (after reply):** `_store_memory()` runs as an asyncio background task, calling `add_memory` with `"User: ...\nAssistant: ..."`. The extraction LLM (`qwen2.5:1.5b` on GPU Ollama) pulls facts; dedup is handled by mem0's update prompt.
+
+Memory tools (`add_memory`, `search_memory`, `get_all_memories`) are excluded from agent tool lists — memory management happens outside the agent loop.
+
 ## VRAM Management

 GTX 1070 — 8 GB. Ollama must be restarted if CUDA init fails (model loads on CPU).
@@ -76,7 +119,7 @@ GTX 1070 — 8 GB. Ollama must be restarted if CUDA init fails (model loads on C
 1. Flush explicitly before loading qwen3:8b (`keep_alive=0`)
 2. Verify eviction via `/api/ps` poll (15s timeout) before proceeding
 3. Fallback: timeout → run medium agent instead
-4. Post-complex: flush 8b, pre-warm 4b + router
+4. Post-complex: flush 8b, pre-warm medium + router

 ## Session ID Convention

@@ -89,18 +132,18 @@ Conversation history is keyed by session_id (5-turn buffer).

 ```
 adolf/
-├── docker-compose.yml      Services: deepagents, openmemory, grammy
+├── docker-compose.yml      Services: bifrost, deepagents, openmemory, grammy, crawl4ai
 ├── Dockerfile              deepagents container (Python 3.12)
-├── agent.py                FastAPI gateway + three-tier routing
+├── agent.py                FastAPI gateway, run_agent_task, Crawl4AI pre-fetch, memory pipeline
 ├── channels.py             Channel registry + deliver() + pending_replies
-├── router.py               Router class — qwen2.5:1.5b routing
+├── router.py               Router class — regex + LLM tier classification
 ├── vram_manager.py         VRAMManager — flush/prewarm/poll Ollama VRAM
-├── agent_factory.py        build_medium_agent / build_complex_agent
+├── agent_factory.py        _DirectModel (medium) / create_deep_agent (complex)
 ├── cli.py                  Interactive CLI REPL client
 ├── wiki_research.py        Batch wiki research pipeline (uses /message + SSE)
 ├── .env                    TELEGRAM_BOT_TOKEN (not committed)
 ├── openmemory/
-│   ├── server.py           FastMCP + mem0 MCP tools
+│   ├── server.py           FastMCP + mem0: add_memory, search_memory, get_all_memories
 │   └── Dockerfile
 └── grammy/
    ├── bot.mjs             grammY Telegram bot + POST /send HTTP endpoint
@@ -108,11 +151,11 @@ adolf/
    └── Dockerfile
 ```

-## External Services (from openai/ stack)
+## External Services (host ports, from openai/ stack)

 | Service | Host Port | Role |
 |---------|-----------|------|
-| Ollama GPU | 11436 | All reply inference |
-| Ollama CPU | 11435 | Memory embedding (nomic-embed-text) |
+| Ollama GPU | 11436 | All LLM inference (via Bifrost) + VRAM management (direct) + memory extraction |
+| Ollama CPU | 11435 | nomic-embed-text embeddings for openmemory |
 | Qdrant | 6333 | Vector store for memories |
-| SearXNG | 11437 | Web search |
+| SearXNG | 11437 | Web search (used by `web_search` tool) |