wiki search people tested pipeline

This commit is contained in:
Alvis
2026-03-05 11:22:34 +00:00
parent ea77b2308b
commit ec45d255f0
19 changed files with 1717 additions and 257 deletions

247
langgraph.md Normal file
View File

@@ -0,0 +1,247 @@
# LangGraph: Multi-Model Routing Architecture
## Problem
`create_react_agent` uses one model for all steps in the ReAct loop:
```
qwen3:4b → decide to call tool ~37s
→ run tool ~1s
qwen3:4b → final answer ~37s
─────────────────────────────────────
Total ~75s
```
The routing step is classification + argument extraction — low entropy, constrained output.
It does not need the same model as answer generation.
---
## Is the Pattern Established? (2025 Research)
Yes. The 2025 consensus from multiple papers treats heterogeneous model architectures
(small for routing, large for generation) as **settled production engineering**, not research:
- SLMs (112B) match or exceed LLMs on schema-constrained tasks (tool calls, JSON, function
calling) at 10×100× lower compute cost (arXiv 2510.03847, arXiv 2506.02153)
- MasRouter (ACL 2025): routing in multi-agent graphs reduces costs 2× without quality loss
- Cascade routing (ICLR 2025): 4% accuracy improvement, 3092% cost reduction vs naive routing
- NVIDIA research (2025): "Small Language Models are the Future of Agentic AI"
**Limitations acknowledged in literature:**
- Bad router defeats the purpose — classifier quality is critical
- Cascade (try small, escalate if uncertain) adds latency on queries that escalate
- Pre-trained routers (RouteLLM, etc.) are calibrated for specific model pairs; local model
pairs need independent validation
---
## Three-Tier Architecture (small → medium → large)
### Concept
```
Incoming query
[Router: tiny model or embedding classifier] ~1-2s
├── simple/conversational → [Medium: qwen3:4b] ~20s
├── needs tool call → [Medium: qwen3:4b + tools] ~20-40s
└── complex/multi-step → [Large: qwen3:8b + sub-agents] ~60s+
```
### When to route to large
Signals that justify loading a larger model:
- Multi-step reasoning required (math, code, planning)
- Sub-agent orchestration (the agent needs to call other agents)
- Explicit reasoning request ("think through", "analyze carefully")
- Low confidence from medium model (cascade pattern)
### Trade-offs of three-tier vs two-tier
| | Two-tier | Three-tier |
|--|---------|-----------|
| Simple queries | small router + medium answer | small router + medium answer (same) |
| Complex queries | medium (may struggle) | swap to large (better quality) |
| GPU constraint | manageable | hard — see below |
| Routing error cost | low | high (wrong tier = much slower) |
---
## The 8GB GPU Constraint — Core Problem
This is the central issue. Research numbers on model swapping (2025):
**Cold swap from disk (no optimization)**
- TTFT exceeds 140s for 7B-class models on HDD; 515s on NVMe SSD
- Not viable for interactive use at any tier
**vLLM Sleep Mode (offload to CPU RAM, not disk)**
- 18200× faster than cold start; TTFT 23s per switch
- vLLM-specific — not available in Ollama
**Ollama behavior on 8GB VRAM**
- Default `keep_alive`: 5 minutes — model stays warm after use
- Two models simultaneously: qwen3:4b (~2.5GB) + qwen2.5:1.5b (~1.2GB) = ~3.7GB — fits
- qwen3:4b + qwen3:8b = ~8GB — does not reliably fit; eviction required
- Sequential swap in Ollama: Ollama evicts old model, loads new one from SSD (~515s on NVMe)
- Known Ollama bug: model spills from VRAM to RAM → all subsequent loads stay on CPU until restart
**Conclusion for three-tier on single 8GB GPU:**
| Tier switch | Cost | Viable? |
|------------|------|---------|
| tiny router → medium (qwen3:4b) | model swap ~5-15s if router is separate | borderline |
| medium → large (qwen3:8b) | evict qwen3:4b, load qwen3:8b = ~5-15s additional | no, for interactive |
| Keep medium always warm, route to large on demand | 5-15s swap overhead per complex query | acceptable if complex queries are rare |
**Honest verdict: three-tier with model swapping is not viable for interactive per-turn latency
on 8GB VRAM with Ollama.** vLLM with Sleep Mode would make it viable (23s switch) but
requires replacing Ollama.
---
## Practical Architecture for 8GB GPU (Ollama)
### Option 1: Two-tier, both models always in VRAM (recommended)
Keep two small models loaded simultaneously:
```
qwen2.5:0.5b (~0.4GB) — router: tool call decision + arg extraction
qwen3:4b (~2.5GB) — answer: all generation
nomic-embed-text (CPU) — embedding: search and store
qwen2.5:1.5b (~1.2GB) — extraction: mem0 fact extraction (GPU)
─────────────────────────────────────────────────────────
Total VRAM: ~4.1GB — well within 8GB
```
No swapping needed. Router runs first (~1-2s), answer model runs after (~20s).
```
Router → tool call JSON or "no tool" ~1-2s
→ tool runs (if needed) ~1s
→ Answer model generates reply ~20s
─────────────────────────────────────────────
Total ~22-23s
```
vs current two-call approach: ~75s.
### Option 2: Semantic routing (encoder-only, free)
Use nomic-embed-text (already running on CPU) as the router:
```python
query_vec = embed(query)
sims = {
"search_memory": cosine(query_vec, memory_topic_vec),
"web_search": cosine(query_vec, web_topic_vec),
}
# If max sim > threshold → call that tool directly
# Then pass result + original query to answer model
```
Zero VRAM overhead. ~50ms routing. Can't extract tool args from embedding alone —
needs hardcoded arg construction (e.g. query = original user message).
### Option 3: Three-tier with rare large-model escalation
Keep qwen3:4b warm. Route to qwen3:8b only for explicitly complex tasks.
Accept ~10s swap overhead for those queries. qwen3:8b gets unloaded after.
```
Router → simple → qwen3:4b ~20s (no swap)
Router → complex → evict 4b, load 8b → ~30s (10s swap + 20s inference)
```
Works if <20% of queries are "complex" and users accept occasional slow responses.
Best implemented with explicit user trigger ("think about this carefully") rather than
automatic classification, to avoid swap overhead on misclassified queries.
---
## LangGraph Implementation
`create_react_agent` locks to one model. Explicit graph supports per-node models:
```python
from langgraph.graph import StateGraph, MessagesState
from langgraph.prebuilt import ToolNode
from langchain_ollama import ChatOllama
router_model = ChatOllama(model="qwen2.5:0.5b", base_url=OLLAMA_GPU_URL)
answer_model = ChatOllama(model="qwen3:4b", base_url=OLLAMA_GPU_URL)
# For Option 3: large_model = ChatOllama(model="qwen3:8b", ...)
def router_node(state):
# Small model only outputs tool call JSON or nothing
return {"messages": [router_model.bind_tools(tools).invoke(state["messages"])]}
def answer_node(state):
# Large(r) model generates human reply — no tools bound
return {"messages": [answer_model.invoke(state["messages"])]}
def route(state) -> str:
last = state["messages"][-1]
return "tools" if getattr(last, "tool_calls", []) else "answer"
graph = StateGraph(MessagesState)
graph.add_node("router", router_node)
graph.add_node("tools", ToolNode(tools))
graph.add_node("answer", answer_node)
graph.set_entry_point("router")
graph.add_conditional_edges("router", route)
graph.add_edge("tools", "answer")
graph.add_edge("answer", END)
agent = graph.compile()
```
For three-tier, add a complexity classifier node before the router that selects
`answer_model = medium_model or large_model` based on query signals.
---
## Open Source Routing Tools
| Tool | Ollama support | Status | Notes |
|------|---------------|--------|-------|
| LiteLLM | First-class | Active 2025 | Proxy with tiered routing, fallbacks, load balancing |
| RouteLLM (LMSYS) | Yes (documented) | Stale (last commit Aug 2024) | Calibrated for GPT-4 vs Mixtral pair |
| Router-R1 | No | Active (NeurIPS 2025) | RL-based, open-sourced on HuggingFace |
| LLMRouter (ulab) | No | Research 2025 | 16+ routing methods, fair comparison framework |
| FrugalGPT | No direct | Algorithm only | Portkey.ai has implementation guide |
**Most practical for Ollama**: LiteLLM proxy with tiered model config. Handles routing,
fallbacks, and load balancing without changing agent code.
---
## Summary: What to Do for Adolf
| | Recommendation |
|--|---------------|
| Quick win (zero risk) | Remove "always call search_memory" from system prompt — history buffer covers conversational recall, saves ~37s |
| Best architecture for 8GB | Two-tier: qwen2.5:0.5b router + qwen3:4b answer, both in VRAM, ~22s total |
| Three-tier feasibility | Not viable for interactive use with Ollama model swapping; viable with vLLM Sleep Mode (~3s swap) if Ollama is replaced |
| Complex task routing | Use explicit user trigger or keyword detection rather than automatic classifier — avoids swap penalty on misclassification |
---
## References
- arXiv 2510.03847 — Small Language Models for Agentic Systems: A Survey
- arXiv 2506.02153 — Small Language Models are the Future of Agentic AI
- arXiv 2406.04692 — Mixture-of-Agents Enhances LLM Capabilities (original MoA paper)
- arXiv 2410.10347 — A Unified Approach to Routing and Cascading for LLMs (ICLR 2025)
- MasRouter — ACL 2025: https://aclanthology.org/2025.acl-long.757.pdf
- Router-R1 — NeurIPS 2025: https://github.com/ulab-uiuc/Router-R1
- vLLM Sleep Mode: https://blog.vllm.ai/2025/10/26/sleep-mode.html
- NVIDIA GPU Memory Swap: https://developer.nvidia.com/blog/cut-model-deployment-costs-while-keeping-performance-with-gpu-memory-swap/
- LangGraph multi-agent: https://langchain-ai.github.io/langgraph/tutorials/multi_agent/
- LangGraph custom ReAct: https://langchain-ai.github.io/langgraph/how-tos/react-agent-from-scratch/
- LiteLLM Ollama docs: https://docs.litellm.ai/docs/providers/ollama
- RouteLLM + Ollama example: https://github.com/lm-sys/RouteLLM/blob/main/examples/routing_to_local_models.md
- LLMRouter framework: https://ulab-uiuc.github.io/LLMRouter/
- Functionary (tool-call fine-tuned): https://github.com/MeetKai/functionary
- Constrained generation (outlines): https://github.com/dottxt-ai/outlines