wiki search people tested pipeline

2026-03-05 11:22:34 +00:00
parent ea77b2308b
commit ec45d255f0
19 changed files with 1717 additions and 257 deletions
--- a/langgraph.md
+++ b/langgraph.md
@@ -0,0 +1,247 @@
+# LangGraph: Multi-Model Routing Architecture
+
+## Problem
+
+`create_react_agent` uses one model for all steps in the ReAct loop:
+
+```
+qwen3:4b → decide to call tool    ~37s
+→ run tool                         ~1s
+qwen3:4b → final answer           ~37s
+─────────────────────────────────────
+Total                             ~75s
+```
+
+The routing step is classification + argument extraction — low entropy, constrained output.
+It does not need the same model as answer generation.
+
+---
+
+## Is the Pattern Established? (2025 Research)
+
+Yes. The 2025 consensus from multiple papers treats heterogeneous model architectures
+(small for routing, large for generation) as **settled production engineering**, not research:
+
+- SLMs (1–12B) match or exceed LLMs on schema-constrained tasks (tool calls, JSON, function
+  calling) at 10×–100× lower compute cost (arXiv 2510.03847, arXiv 2506.02153)
+- MasRouter (ACL 2025): routing in multi-agent graphs reduces costs 2× without quality loss
+- Cascade routing (ICLR 2025): 4% accuracy improvement, 30–92% cost reduction vs naive routing
+- NVIDIA research (2025): "Small Language Models are the Future of Agentic AI"
+
+**Limitations acknowledged in literature:**
+- Bad router defeats the purpose — classifier quality is critical
+- Cascade (try small, escalate if uncertain) adds latency on queries that escalate
+- Pre-trained routers (RouteLLM, etc.) are calibrated for specific model pairs; local model
+  pairs need independent validation
+
+---
+
+## Three-Tier Architecture (small → medium → large)
+
+### Concept
+
+```
+Incoming query
+    ↓
+[Router: tiny model or embedding classifier]  ~1-2s
+    ├── simple/conversational → [Medium: qwen3:4b]           ~20s
+    ├── needs tool call       → [Medium: qwen3:4b + tools]   ~20-40s
+    └── complex/multi-step    → [Large: qwen3:8b + sub-agents] ~60s+
+```
+
+### When to route to large
+
+Signals that justify loading a larger model:
+- Multi-step reasoning required (math, code, planning)
+- Sub-agent orchestration (the agent needs to call other agents)
+- Explicit reasoning request ("think through", "analyze carefully")
+- Low confidence from medium model (cascade pattern)
+
+### Trade-offs of three-tier vs two-tier
+
+| | Two-tier | Three-tier |
+|--|---------|-----------|
+| Simple queries | small router + medium answer | small router + medium answer (same) |
+| Complex queries | medium (may struggle) | swap to large (better quality) |
+| GPU constraint | manageable | hard — see below |
+| Routing error cost | low | high (wrong tier = much slower) |
+
+---
+
+## The 8GB GPU Constraint — Core Problem
+
+This is the central issue. Research numbers on model swapping (2025):
+
+**Cold swap from disk (no optimization)**
+- TTFT exceeds 140s for 7B-class models on HDD; 5–15s on NVMe SSD
+- Not viable for interactive use at any tier
+
+**vLLM Sleep Mode (offload to CPU RAM, not disk)**
+- 18–200× faster than cold start; TTFT 2–3s per switch
+- vLLM-specific — not available in Ollama
+
+**Ollama behavior on 8GB VRAM**
+- Default `keep_alive`: 5 minutes — model stays warm after use
+- Two models simultaneously: qwen3:4b (~2.5GB) + qwen2.5:1.5b (~1.2GB) = ~3.7GB — fits
+- qwen3:4b + qwen3:8b = ~8GB — does not reliably fit; eviction required
+- Sequential swap in Ollama: Ollama evicts old model, loads new one from SSD (~5–15s on NVMe)
+- Known Ollama bug: model spills from VRAM to RAM → all subsequent loads stay on CPU until restart
+
+**Conclusion for three-tier on single 8GB GPU:**
+
+| Tier switch | Cost | Viable? |
+|------------|------|---------|
+| tiny router → medium (qwen3:4b) | model swap ~5-15s if router is separate | borderline |
+| medium → large (qwen3:8b) | evict qwen3:4b, load qwen3:8b = ~5-15s additional | no, for interactive |
+| Keep medium always warm, route to large on demand | 5-15s swap overhead per complex query | acceptable if complex queries are rare |
+
+**Honest verdict: three-tier with model swapping is not viable for interactive per-turn latency
+on 8GB VRAM with Ollama.** vLLM with Sleep Mode would make it viable (2–3s switch) but
+requires replacing Ollama.
+
+---
+
+## Practical Architecture for 8GB GPU (Ollama)
+
+### Option 1: Two-tier, both models always in VRAM (recommended)
+
+Keep two small models loaded simultaneously:
+
+```
+qwen2.5:0.5b  (~0.4GB) — router: tool call decision + arg extraction
+qwen3:4b      (~2.5GB) — answer: all generation
+nomic-embed-text (CPU) — embedding: search and store
+qwen2.5:1.5b  (~1.2GB) — extraction: mem0 fact extraction (GPU)
+─────────────────────────────────────────────────────────
+Total VRAM: ~4.1GB — well within 8GB
+```
+
+No swapping needed. Router runs first (~1-2s), answer model runs after (~20s).
+
+```
+Router → tool call JSON or "no tool"    ~1-2s
+→ tool runs (if needed)                  ~1s
+→ Answer model generates reply          ~20s
+─────────────────────────────────────────────
+Total                                   ~22-23s
+```
+
+vs current two-call approach: ~75s.
+
+### Option 2: Semantic routing (encoder-only, free)
+
+Use nomic-embed-text (already running on CPU) as the router:
+
+```python
+query_vec = embed(query)
+sims = {
+    "search_memory": cosine(query_vec, memory_topic_vec),
+    "web_search":    cosine(query_vec, web_topic_vec),
+}
+# If max sim > threshold → call that tool directly
+# Then pass result + original query to answer model
+```
+
+Zero VRAM overhead. ~50ms routing. Can't extract tool args from embedding alone —
+needs hardcoded arg construction (e.g. query = original user message).
+
+### Option 3: Three-tier with rare large-model escalation
+
+Keep qwen3:4b warm. Route to qwen3:8b only for explicitly complex tasks.
+Accept ~10s swap overhead for those queries. qwen3:8b gets unloaded after.
+
+```
+Router → simple → qwen3:4b              ~20s (no swap)
+Router → complex → evict 4b, load 8b → ~30s (10s swap + 20s inference)
+```
+
+Works if <20% of queries are "complex" and users accept occasional slow responses.
+Best implemented with explicit user trigger ("think about this carefully") rather than
+automatic classification, to avoid swap overhead on misclassified queries.
+
+---
+
+## LangGraph Implementation
+
+`create_react_agent` locks to one model. Explicit graph supports per-node models:
+
+```python
+from langgraph.graph import StateGraph, MessagesState
+from langgraph.prebuilt import ToolNode
+from langchain_ollama import ChatOllama
+
+router_model = ChatOllama(model="qwen2.5:0.5b", base_url=OLLAMA_GPU_URL)
+answer_model = ChatOllama(model="qwen3:4b",     base_url=OLLAMA_GPU_URL)
+# For Option 3: large_model = ChatOllama(model="qwen3:8b", ...)
+
+def router_node(state):
+    # Small model only outputs tool call JSON or nothing
+    return {"messages": [router_model.bind_tools(tools).invoke(state["messages"])]}
+
+def answer_node(state):
+    # Large(r) model generates human reply — no tools bound
+    return {"messages": [answer_model.invoke(state["messages"])]}
+
+def route(state) -> str:
+    last = state["messages"][-1]
+    return "tools" if getattr(last, "tool_calls", []) else "answer"
+
+graph = StateGraph(MessagesState)
+graph.add_node("router", router_node)
+graph.add_node("tools", ToolNode(tools))
+graph.add_node("answer", answer_node)
+graph.set_entry_point("router")
+graph.add_conditional_edges("router", route)
+graph.add_edge("tools", "answer")
+graph.add_edge("answer", END)
+agent = graph.compile()
+```
+
+For three-tier, add a complexity classifier node before the router that selects
+`answer_model = medium_model or large_model` based on query signals.
+
+---
+
+## Open Source Routing Tools
+
+| Tool | Ollama support | Status | Notes |
+|------|---------------|--------|-------|
+| LiteLLM | First-class | Active 2025 | Proxy with tiered routing, fallbacks, load balancing |
+| RouteLLM (LMSYS) | Yes (documented) | Stale (last commit Aug 2024) | Calibrated for GPT-4 vs Mixtral pair |
+| Router-R1 | No | Active (NeurIPS 2025) | RL-based, open-sourced on HuggingFace |
+| LLMRouter (ulab) | No | Research 2025 | 16+ routing methods, fair comparison framework |
+| FrugalGPT | No direct | Algorithm only | Portkey.ai has implementation guide |
+
+**Most practical for Ollama**: LiteLLM proxy with tiered model config. Handles routing,
+fallbacks, and load balancing without changing agent code.
+
+---
+
+## Summary: What to Do for Adolf
+
+| | Recommendation |
+|--|---------------|
+| Quick win (zero risk) | Remove "always call search_memory" from system prompt — history buffer covers conversational recall, saves ~37s |
+| Best architecture for 8GB | Two-tier: qwen2.5:0.5b router + qwen3:4b answer, both in VRAM, ~22s total |
+| Three-tier feasibility | Not viable for interactive use with Ollama model swapping; viable with vLLM Sleep Mode (~3s swap) if Ollama is replaced |
+| Complex task routing | Use explicit user trigger or keyword detection rather than automatic classifier — avoids swap penalty on misclassification |
+
+---
+
+## References
+
+- arXiv 2510.03847 — Small Language Models for Agentic Systems: A Survey
+- arXiv 2506.02153 — Small Language Models are the Future of Agentic AI
+- arXiv 2406.04692 — Mixture-of-Agents Enhances LLM Capabilities (original MoA paper)
+- arXiv 2410.10347 — A Unified Approach to Routing and Cascading for LLMs (ICLR 2025)
+- MasRouter — ACL 2025: https://aclanthology.org/2025.acl-long.757.pdf
+- Router-R1 — NeurIPS 2025: https://github.com/ulab-uiuc/Router-R1
+- vLLM Sleep Mode: https://blog.vllm.ai/2025/10/26/sleep-mode.html
+- NVIDIA GPU Memory Swap: https://developer.nvidia.com/blog/cut-model-deployment-costs-while-keeping-performance-with-gpu-memory-swap/
+- LangGraph multi-agent: https://langchain-ai.github.io/langgraph/tutorials/multi_agent/
+- LangGraph custom ReAct: https://langchain-ai.github.io/langgraph/how-tos/react-agent-from-scratch/
+- LiteLLM Ollama docs: https://docs.litellm.ai/docs/providers/ollama
+- RouteLLM + Ollama example: https://github.com/lm-sys/RouteLLM/blob/main/examples/routing_to_local_models.md
+- LLMRouter framework: https://ulab-uiuc.github.io/LLMRouter/
+- Functionary (tool-call fine-tuned): https://github.com/MeetKai/functionary
+- Constrained generation (outlines): https://github.com/dottxt-ai/outlines