wiki search people tested pipeline
This commit is contained in:
247
langgraph.md
Normal file
247
langgraph.md
Normal file
@@ -0,0 +1,247 @@
|
||||
# LangGraph: Multi-Model Routing Architecture
|
||||
|
||||
## Problem
|
||||
|
||||
`create_react_agent` uses one model for all steps in the ReAct loop:
|
||||
|
||||
```
|
||||
qwen3:4b → decide to call tool ~37s
|
||||
→ run tool ~1s
|
||||
qwen3:4b → final answer ~37s
|
||||
─────────────────────────────────────
|
||||
Total ~75s
|
||||
```
|
||||
|
||||
The routing step is classification + argument extraction — low entropy, constrained output.
|
||||
It does not need the same model as answer generation.
|
||||
|
||||
---
|
||||
|
||||
## Is the Pattern Established? (2025 Research)
|
||||
|
||||
Yes. The 2025 consensus from multiple papers treats heterogeneous model architectures
|
||||
(small for routing, large for generation) as **settled production engineering**, not research:
|
||||
|
||||
- SLMs (1–12B) match or exceed LLMs on schema-constrained tasks (tool calls, JSON, function
|
||||
calling) at 10×–100× lower compute cost (arXiv 2510.03847, arXiv 2506.02153)
|
||||
- MasRouter (ACL 2025): routing in multi-agent graphs reduces costs 2× without quality loss
|
||||
- Cascade routing (ICLR 2025): 4% accuracy improvement, 30–92% cost reduction vs naive routing
|
||||
- NVIDIA research (2025): "Small Language Models are the Future of Agentic AI"
|
||||
|
||||
**Limitations acknowledged in literature:**
|
||||
- Bad router defeats the purpose — classifier quality is critical
|
||||
- Cascade (try small, escalate if uncertain) adds latency on queries that escalate
|
||||
- Pre-trained routers (RouteLLM, etc.) are calibrated for specific model pairs; local model
|
||||
pairs need independent validation
|
||||
|
||||
---
|
||||
|
||||
## Three-Tier Architecture (small → medium → large)
|
||||
|
||||
### Concept
|
||||
|
||||
```
|
||||
Incoming query
|
||||
↓
|
||||
[Router: tiny model or embedding classifier] ~1-2s
|
||||
├── simple/conversational → [Medium: qwen3:4b] ~20s
|
||||
├── needs tool call → [Medium: qwen3:4b + tools] ~20-40s
|
||||
└── complex/multi-step → [Large: qwen3:8b + sub-agents] ~60s+
|
||||
```
|
||||
|
||||
### When to route to large
|
||||
|
||||
Signals that justify loading a larger model:
|
||||
- Multi-step reasoning required (math, code, planning)
|
||||
- Sub-agent orchestration (the agent needs to call other agents)
|
||||
- Explicit reasoning request ("think through", "analyze carefully")
|
||||
- Low confidence from medium model (cascade pattern)
|
||||
|
||||
### Trade-offs of three-tier vs two-tier
|
||||
|
||||
| | Two-tier | Three-tier |
|
||||
|--|---------|-----------|
|
||||
| Simple queries | small router + medium answer | small router + medium answer (same) |
|
||||
| Complex queries | medium (may struggle) | swap to large (better quality) |
|
||||
| GPU constraint | manageable | hard — see below |
|
||||
| Routing error cost | low | high (wrong tier = much slower) |
|
||||
|
||||
---
|
||||
|
||||
## The 8GB GPU Constraint — Core Problem
|
||||
|
||||
This is the central issue. Research numbers on model swapping (2025):
|
||||
|
||||
**Cold swap from disk (no optimization)**
|
||||
- TTFT exceeds 140s for 7B-class models on HDD; 5–15s on NVMe SSD
|
||||
- Not viable for interactive use at any tier
|
||||
|
||||
**vLLM Sleep Mode (offload to CPU RAM, not disk)**
|
||||
- 18–200× faster than cold start; TTFT 2–3s per switch
|
||||
- vLLM-specific — not available in Ollama
|
||||
|
||||
**Ollama behavior on 8GB VRAM**
|
||||
- Default `keep_alive`: 5 minutes — model stays warm after use
|
||||
- Two models simultaneously: qwen3:4b (~2.5GB) + qwen2.5:1.5b (~1.2GB) = ~3.7GB — fits
|
||||
- qwen3:4b + qwen3:8b = ~8GB — does not reliably fit; eviction required
|
||||
- Sequential swap in Ollama: Ollama evicts old model, loads new one from SSD (~5–15s on NVMe)
|
||||
- Known Ollama bug: model spills from VRAM to RAM → all subsequent loads stay on CPU until restart
|
||||
|
||||
**Conclusion for three-tier on single 8GB GPU:**
|
||||
|
||||
| Tier switch | Cost | Viable? |
|
||||
|------------|------|---------|
|
||||
| tiny router → medium (qwen3:4b) | model swap ~5-15s if router is separate | borderline |
|
||||
| medium → large (qwen3:8b) | evict qwen3:4b, load qwen3:8b = ~5-15s additional | no, for interactive |
|
||||
| Keep medium always warm, route to large on demand | 5-15s swap overhead per complex query | acceptable if complex queries are rare |
|
||||
|
||||
**Honest verdict: three-tier with model swapping is not viable for interactive per-turn latency
|
||||
on 8GB VRAM with Ollama.** vLLM with Sleep Mode would make it viable (2–3s switch) but
|
||||
requires replacing Ollama.
|
||||
|
||||
---
|
||||
|
||||
## Practical Architecture for 8GB GPU (Ollama)
|
||||
|
||||
### Option 1: Two-tier, both models always in VRAM (recommended)
|
||||
|
||||
Keep two small models loaded simultaneously:
|
||||
|
||||
```
|
||||
qwen2.5:0.5b (~0.4GB) — router: tool call decision + arg extraction
|
||||
qwen3:4b (~2.5GB) — answer: all generation
|
||||
nomic-embed-text (CPU) — embedding: search and store
|
||||
qwen2.5:1.5b (~1.2GB) — extraction: mem0 fact extraction (GPU)
|
||||
─────────────────────────────────────────────────────────
|
||||
Total VRAM: ~4.1GB — well within 8GB
|
||||
```
|
||||
|
||||
No swapping needed. Router runs first (~1-2s), answer model runs after (~20s).
|
||||
|
||||
```
|
||||
Router → tool call JSON or "no tool" ~1-2s
|
||||
→ tool runs (if needed) ~1s
|
||||
→ Answer model generates reply ~20s
|
||||
─────────────────────────────────────────────
|
||||
Total ~22-23s
|
||||
```
|
||||
|
||||
vs current two-call approach: ~75s.
|
||||
|
||||
### Option 2: Semantic routing (encoder-only, free)
|
||||
|
||||
Use nomic-embed-text (already running on CPU) as the router:
|
||||
|
||||
```python
|
||||
query_vec = embed(query)
|
||||
sims = {
|
||||
"search_memory": cosine(query_vec, memory_topic_vec),
|
||||
"web_search": cosine(query_vec, web_topic_vec),
|
||||
}
|
||||
# If max sim > threshold → call that tool directly
|
||||
# Then pass result + original query to answer model
|
||||
```
|
||||
|
||||
Zero VRAM overhead. ~50ms routing. Can't extract tool args from embedding alone —
|
||||
needs hardcoded arg construction (e.g. query = original user message).
|
||||
|
||||
### Option 3: Three-tier with rare large-model escalation
|
||||
|
||||
Keep qwen3:4b warm. Route to qwen3:8b only for explicitly complex tasks.
|
||||
Accept ~10s swap overhead for those queries. qwen3:8b gets unloaded after.
|
||||
|
||||
```
|
||||
Router → simple → qwen3:4b ~20s (no swap)
|
||||
Router → complex → evict 4b, load 8b → ~30s (10s swap + 20s inference)
|
||||
```
|
||||
|
||||
Works if <20% of queries are "complex" and users accept occasional slow responses.
|
||||
Best implemented with explicit user trigger ("think about this carefully") rather than
|
||||
automatic classification, to avoid swap overhead on misclassified queries.
|
||||
|
||||
---
|
||||
|
||||
## LangGraph Implementation
|
||||
|
||||
`create_react_agent` locks to one model. Explicit graph supports per-node models:
|
||||
|
||||
```python
|
||||
from langgraph.graph import StateGraph, MessagesState
|
||||
from langgraph.prebuilt import ToolNode
|
||||
from langchain_ollama import ChatOllama
|
||||
|
||||
router_model = ChatOllama(model="qwen2.5:0.5b", base_url=OLLAMA_GPU_URL)
|
||||
answer_model = ChatOllama(model="qwen3:4b", base_url=OLLAMA_GPU_URL)
|
||||
# For Option 3: large_model = ChatOllama(model="qwen3:8b", ...)
|
||||
|
||||
def router_node(state):
|
||||
# Small model only outputs tool call JSON or nothing
|
||||
return {"messages": [router_model.bind_tools(tools).invoke(state["messages"])]}
|
||||
|
||||
def answer_node(state):
|
||||
# Large(r) model generates human reply — no tools bound
|
||||
return {"messages": [answer_model.invoke(state["messages"])]}
|
||||
|
||||
def route(state) -> str:
|
||||
last = state["messages"][-1]
|
||||
return "tools" if getattr(last, "tool_calls", []) else "answer"
|
||||
|
||||
graph = StateGraph(MessagesState)
|
||||
graph.add_node("router", router_node)
|
||||
graph.add_node("tools", ToolNode(tools))
|
||||
graph.add_node("answer", answer_node)
|
||||
graph.set_entry_point("router")
|
||||
graph.add_conditional_edges("router", route)
|
||||
graph.add_edge("tools", "answer")
|
||||
graph.add_edge("answer", END)
|
||||
agent = graph.compile()
|
||||
```
|
||||
|
||||
For three-tier, add a complexity classifier node before the router that selects
|
||||
`answer_model = medium_model or large_model` based on query signals.
|
||||
|
||||
---
|
||||
|
||||
## Open Source Routing Tools
|
||||
|
||||
| Tool | Ollama support | Status | Notes |
|
||||
|------|---------------|--------|-------|
|
||||
| LiteLLM | First-class | Active 2025 | Proxy with tiered routing, fallbacks, load balancing |
|
||||
| RouteLLM (LMSYS) | Yes (documented) | Stale (last commit Aug 2024) | Calibrated for GPT-4 vs Mixtral pair |
|
||||
| Router-R1 | No | Active (NeurIPS 2025) | RL-based, open-sourced on HuggingFace |
|
||||
| LLMRouter (ulab) | No | Research 2025 | 16+ routing methods, fair comparison framework |
|
||||
| FrugalGPT | No direct | Algorithm only | Portkey.ai has implementation guide |
|
||||
|
||||
**Most practical for Ollama**: LiteLLM proxy with tiered model config. Handles routing,
|
||||
fallbacks, and load balancing without changing agent code.
|
||||
|
||||
---
|
||||
|
||||
## Summary: What to Do for Adolf
|
||||
|
||||
| | Recommendation |
|
||||
|--|---------------|
|
||||
| Quick win (zero risk) | Remove "always call search_memory" from system prompt — history buffer covers conversational recall, saves ~37s |
|
||||
| Best architecture for 8GB | Two-tier: qwen2.5:0.5b router + qwen3:4b answer, both in VRAM, ~22s total |
|
||||
| Three-tier feasibility | Not viable for interactive use with Ollama model swapping; viable with vLLM Sleep Mode (~3s swap) if Ollama is replaced |
|
||||
| Complex task routing | Use explicit user trigger or keyword detection rather than automatic classifier — avoids swap penalty on misclassification |
|
||||
|
||||
---
|
||||
|
||||
## References
|
||||
|
||||
- arXiv 2510.03847 — Small Language Models for Agentic Systems: A Survey
|
||||
- arXiv 2506.02153 — Small Language Models are the Future of Agentic AI
|
||||
- arXiv 2406.04692 — Mixture-of-Agents Enhances LLM Capabilities (original MoA paper)
|
||||
- arXiv 2410.10347 — A Unified Approach to Routing and Cascading for LLMs (ICLR 2025)
|
||||
- MasRouter — ACL 2025: https://aclanthology.org/2025.acl-long.757.pdf
|
||||
- Router-R1 — NeurIPS 2025: https://github.com/ulab-uiuc/Router-R1
|
||||
- vLLM Sleep Mode: https://blog.vllm.ai/2025/10/26/sleep-mode.html
|
||||
- NVIDIA GPU Memory Swap: https://developer.nvidia.com/blog/cut-model-deployment-costs-while-keeping-performance-with-gpu-memory-swap/
|
||||
- LangGraph multi-agent: https://langchain-ai.github.io/langgraph/tutorials/multi_agent/
|
||||
- LangGraph custom ReAct: https://langchain-ai.github.io/langgraph/how-tos/react-agent-from-scratch/
|
||||
- LiteLLM Ollama docs: https://docs.litellm.ai/docs/providers/ollama
|
||||
- RouteLLM + Ollama example: https://github.com/lm-sys/RouteLLM/blob/main/examples/routing_to_local_models.md
|
||||
- LLMRouter framework: https://ulab-uiuc.github.io/LLMRouter/
|
||||
- Functionary (tool-call fine-tuned): https://github.com/MeetKai/functionary
|
||||
- Constrained generation (outlines): https://github.com/dottxt-ai/outlines
|
||||
Reference in New Issue
Block a user