10 KiB
LangGraph: Multi-Model Routing Architecture
Problem
create_react_agent uses one model for all steps in the ReAct loop:
qwen3:4b → decide to call tool ~37s
→ run tool ~1s
qwen3:4b → final answer ~37s
─────────────────────────────────────
Total ~75s
The routing step is classification + argument extraction — low entropy, constrained output. It does not need the same model as answer generation.
Is the Pattern Established? (2025 Research)
Yes. The 2025 consensus from multiple papers treats heterogeneous model architectures (small for routing, large for generation) as settled production engineering, not research:
- SLMs (1–12B) match or exceed LLMs on schema-constrained tasks (tool calls, JSON, function calling) at 10×–100× lower compute cost (arXiv 2510.03847, arXiv 2506.02153)
- MasRouter (ACL 2025): routing in multi-agent graphs reduces costs 2× without quality loss
- Cascade routing (ICLR 2025): 4% accuracy improvement, 30–92% cost reduction vs naive routing
- NVIDIA research (2025): "Small Language Models are the Future of Agentic AI"
Limitations acknowledged in literature:
- Bad router defeats the purpose — classifier quality is critical
- Cascade (try small, escalate if uncertain) adds latency on queries that escalate
- Pre-trained routers (RouteLLM, etc.) are calibrated for specific model pairs; local model pairs need independent validation
Three-Tier Architecture (small → medium → large)
Concept
Incoming query
↓
[Router: tiny model or embedding classifier] ~1-2s
├── simple/conversational → [Medium: qwen3:4b] ~20s
├── needs tool call → [Medium: qwen3:4b + tools] ~20-40s
└── complex/multi-step → [Large: qwen3:8b + sub-agents] ~60s+
When to route to large
Signals that justify loading a larger model:
- Multi-step reasoning required (math, code, planning)
- Sub-agent orchestration (the agent needs to call other agents)
- Explicit reasoning request ("think through", "analyze carefully")
- Low confidence from medium model (cascade pattern)
Trade-offs of three-tier vs two-tier
| Two-tier | Three-tier | |
|---|---|---|
| Simple queries | small router + medium answer | small router + medium answer (same) |
| Complex queries | medium (may struggle) | swap to large (better quality) |
| GPU constraint | manageable | hard — see below |
| Routing error cost | low | high (wrong tier = much slower) |
The 8GB GPU Constraint — Core Problem
This is the central issue. Research numbers on model swapping (2025):
Cold swap from disk (no optimization)
- TTFT exceeds 140s for 7B-class models on HDD; 5–15s on NVMe SSD
- Not viable for interactive use at any tier
vLLM Sleep Mode (offload to CPU RAM, not disk)
- 18–200× faster than cold start; TTFT 2–3s per switch
- vLLM-specific — not available in Ollama
Ollama behavior on 8GB VRAM
- Default
keep_alive: 5 minutes — model stays warm after use - Two models simultaneously: qwen3:4b (~2.5GB) + qwen2.5:1.5b (~1.2GB) = ~3.7GB — fits
- qwen3:4b + qwen3:8b = ~8GB — does not reliably fit; eviction required
- Sequential swap in Ollama: Ollama evicts old model, loads new one from SSD (~5–15s on NVMe)
- Known Ollama bug: model spills from VRAM to RAM → all subsequent loads stay on CPU until restart
Conclusion for three-tier on single 8GB GPU:
| Tier switch | Cost | Viable? |
|---|---|---|
| tiny router → medium (qwen3:4b) | model swap ~5-15s if router is separate | borderline |
| medium → large (qwen3:8b) | evict qwen3:4b, load qwen3:8b = ~5-15s additional | no, for interactive |
| Keep medium always warm, route to large on demand | 5-15s swap overhead per complex query | acceptable if complex queries are rare |
Honest verdict: three-tier with model swapping is not viable for interactive per-turn latency on 8GB VRAM with Ollama. vLLM with Sleep Mode would make it viable (2–3s switch) but requires replacing Ollama.
Practical Architecture for 8GB GPU (Ollama)
Option 1: Two-tier, both models always in VRAM (recommended)
Keep two small models loaded simultaneously:
qwen2.5:0.5b (~0.4GB) — router: tool call decision + arg extraction
qwen3:4b (~2.5GB) — answer: all generation
nomic-embed-text (CPU) — embedding: search and store
qwen2.5:1.5b (~1.2GB) — extraction: mem0 fact extraction (GPU)
─────────────────────────────────────────────────────────
Total VRAM: ~4.1GB — well within 8GB
No swapping needed. Router runs first (~1-2s), answer model runs after (~20s).
Router → tool call JSON or "no tool" ~1-2s
→ tool runs (if needed) ~1s
→ Answer model generates reply ~20s
─────────────────────────────────────────────
Total ~22-23s
vs current two-call approach: ~75s.
Option 2: Semantic routing (encoder-only, free)
Use nomic-embed-text (already running on CPU) as the router:
query_vec = embed(query)
sims = {
"search_memory": cosine(query_vec, memory_topic_vec),
"web_search": cosine(query_vec, web_topic_vec),
}
# If max sim > threshold → call that tool directly
# Then pass result + original query to answer model
Zero VRAM overhead. ~50ms routing. Can't extract tool args from embedding alone — needs hardcoded arg construction (e.g. query = original user message).
Option 3: Three-tier with rare large-model escalation
Keep qwen3:4b warm. Route to qwen3:8b only for explicitly complex tasks. Accept ~10s swap overhead for those queries. qwen3:8b gets unloaded after.
Router → simple → qwen3:4b ~20s (no swap)
Router → complex → evict 4b, load 8b → ~30s (10s swap + 20s inference)
Works if <20% of queries are "complex" and users accept occasional slow responses. Best implemented with explicit user trigger ("think about this carefully") rather than automatic classification, to avoid swap overhead on misclassified queries.
LangGraph Implementation
create_react_agent locks to one model. Explicit graph supports per-node models:
from langgraph.graph import StateGraph, MessagesState
from langgraph.prebuilt import ToolNode
from langchain_ollama import ChatOllama
router_model = ChatOllama(model="qwen2.5:0.5b", base_url=OLLAMA_GPU_URL)
answer_model = ChatOllama(model="qwen3:4b", base_url=OLLAMA_GPU_URL)
# For Option 3: large_model = ChatOllama(model="qwen3:8b", ...)
def router_node(state):
# Small model only outputs tool call JSON or nothing
return {"messages": [router_model.bind_tools(tools).invoke(state["messages"])]}
def answer_node(state):
# Large(r) model generates human reply — no tools bound
return {"messages": [answer_model.invoke(state["messages"])]}
def route(state) -> str:
last = state["messages"][-1]
return "tools" if getattr(last, "tool_calls", []) else "answer"
graph = StateGraph(MessagesState)
graph.add_node("router", router_node)
graph.add_node("tools", ToolNode(tools))
graph.add_node("answer", answer_node)
graph.set_entry_point("router")
graph.add_conditional_edges("router", route)
graph.add_edge("tools", "answer")
graph.add_edge("answer", END)
agent = graph.compile()
For three-tier, add a complexity classifier node before the router that selects
answer_model = medium_model or large_model based on query signals.
Open Source Routing Tools
| Tool | Ollama support | Status | Notes |
|---|---|---|---|
| LiteLLM | First-class | Active 2025 | Proxy with tiered routing, fallbacks, load balancing |
| RouteLLM (LMSYS) | Yes (documented) | Stale (last commit Aug 2024) | Calibrated for GPT-4 vs Mixtral pair |
| Router-R1 | No | Active (NeurIPS 2025) | RL-based, open-sourced on HuggingFace |
| LLMRouter (ulab) | No | Research 2025 | 16+ routing methods, fair comparison framework |
| FrugalGPT | No direct | Algorithm only | Portkey.ai has implementation guide |
Most practical for Ollama: LiteLLM proxy with tiered model config. Handles routing, fallbacks, and load balancing without changing agent code.
Summary: What to Do for Adolf
| Recommendation | |
|---|---|
| Quick win (zero risk) | Remove "always call search_memory" from system prompt — history buffer covers conversational recall, saves ~37s |
| Best architecture for 8GB | Two-tier: qwen2.5:0.5b router + qwen3:4b answer, both in VRAM, ~22s total |
| Three-tier feasibility | Not viable for interactive use with Ollama model swapping; viable with vLLM Sleep Mode (~3s swap) if Ollama is replaced |
| Complex task routing | Use explicit user trigger or keyword detection rather than automatic classifier — avoids swap penalty on misclassification |
References
- arXiv 2510.03847 — Small Language Models for Agentic Systems: A Survey
- arXiv 2506.02153 — Small Language Models are the Future of Agentic AI
- arXiv 2406.04692 — Mixture-of-Agents Enhances LLM Capabilities (original MoA paper)
- arXiv 2410.10347 — A Unified Approach to Routing and Cascading for LLMs (ICLR 2025)
- MasRouter — ACL 2025: https://aclanthology.org/2025.acl-long.757.pdf
- Router-R1 — NeurIPS 2025: https://github.com/ulab-uiuc/Router-R1
- vLLM Sleep Mode: https://blog.vllm.ai/2025/10/26/sleep-mode.html
- NVIDIA GPU Memory Swap: https://developer.nvidia.com/blog/cut-model-deployment-costs-while-keeping-performance-with-gpu-memory-swap/
- LangGraph multi-agent: https://langchain-ai.github.io/langgraph/tutorials/multi_agent/
- LangGraph custom ReAct: https://langchain-ai.github.io/langgraph/how-tos/react-agent-from-scratch/
- LiteLLM Ollama docs: https://docs.litellm.ai/docs/providers/ollama
- RouteLLM + Ollama example: https://github.com/lm-sys/RouteLLM/blob/main/examples/routing_to_local_models.md
- LLMRouter framework: https://ulab-uiuc.github.io/LLMRouter/
- Functionary (tool-call fine-tuned): https://github.com/MeetKai/functionary
- Constrained generation (outlines): https://github.com/dottxt-ai/outlines