# LangGraph: Multi-Model Routing Architecture ## Problem `create_react_agent` uses one model for all steps in the ReAct loop: ``` qwen3:4b → decide to call tool ~37s → run tool ~1s qwen3:4b → final answer ~37s ───────────────────────────────────── Total ~75s ``` The routing step is classification + argument extraction — low entropy, constrained output. It does not need the same model as answer generation. --- ## Is the Pattern Established? (2025 Research) Yes. The 2025 consensus from multiple papers treats heterogeneous model architectures (small for routing, large for generation) as **settled production engineering**, not research: - SLMs (1–12B) match or exceed LLMs on schema-constrained tasks (tool calls, JSON, function calling) at 10×–100× lower compute cost (arXiv 2510.03847, arXiv 2506.02153) - MasRouter (ACL 2025): routing in multi-agent graphs reduces costs 2× without quality loss - Cascade routing (ICLR 2025): 4% accuracy improvement, 30–92% cost reduction vs naive routing - NVIDIA research (2025): "Small Language Models are the Future of Agentic AI" **Limitations acknowledged in literature:** - Bad router defeats the purpose — classifier quality is critical - Cascade (try small, escalate if uncertain) adds latency on queries that escalate - Pre-trained routers (RouteLLM, etc.) are calibrated for specific model pairs; local model pairs need independent validation --- ## Three-Tier Architecture (small → medium → large) ### Concept ``` Incoming query ↓ [Router: tiny model or embedding classifier] ~1-2s ├── simple/conversational → [Medium: qwen3:4b] ~20s ├── needs tool call → [Medium: qwen3:4b + tools] ~20-40s └── complex/multi-step → [Large: qwen3:8b + sub-agents] ~60s+ ``` ### When to route to large Signals that justify loading a larger model: - Multi-step reasoning required (math, code, planning) - Sub-agent orchestration (the agent needs to call other agents) - Explicit reasoning request ("think through", "analyze carefully") - Low confidence from medium model (cascade pattern) ### Trade-offs of three-tier vs two-tier | | Two-tier | Three-tier | |--|---------|-----------| | Simple queries | small router + medium answer | small router + medium answer (same) | | Complex queries | medium (may struggle) | swap to large (better quality) | | GPU constraint | manageable | hard — see below | | Routing error cost | low | high (wrong tier = much slower) | --- ## The 8GB GPU Constraint — Core Problem This is the central issue. Research numbers on model swapping (2025): **Cold swap from disk (no optimization)** - TTFT exceeds 140s for 7B-class models on HDD; 5–15s on NVMe SSD - Not viable for interactive use at any tier **vLLM Sleep Mode (offload to CPU RAM, not disk)** - 18–200× faster than cold start; TTFT 2–3s per switch - vLLM-specific — not available in Ollama **Ollama behavior on 8GB VRAM** - Default `keep_alive`: 5 minutes — model stays warm after use - Two models simultaneously: qwen3:4b (~2.5GB) + qwen2.5:1.5b (~1.2GB) = ~3.7GB — fits - qwen3:4b + qwen3:8b = ~8GB — does not reliably fit; eviction required - Sequential swap in Ollama: Ollama evicts old model, loads new one from SSD (~5–15s on NVMe) - Known Ollama bug: model spills from VRAM to RAM → all subsequent loads stay on CPU until restart **Conclusion for three-tier on single 8GB GPU:** | Tier switch | Cost | Viable? | |------------|------|---------| | tiny router → medium (qwen3:4b) | model swap ~5-15s if router is separate | borderline | | medium → large (qwen3:8b) | evict qwen3:4b, load qwen3:8b = ~5-15s additional | no, for interactive | | Keep medium always warm, route to large on demand | 5-15s swap overhead per complex query | acceptable if complex queries are rare | **Honest verdict: three-tier with model swapping is not viable for interactive per-turn latency on 8GB VRAM with Ollama.** vLLM with Sleep Mode would make it viable (2–3s switch) but requires replacing Ollama. --- ## Practical Architecture for 8GB GPU (Ollama) ### Option 1: Two-tier, both models always in VRAM (recommended) Keep two small models loaded simultaneously: ``` qwen2.5:0.5b (~0.4GB) — router: tool call decision + arg extraction qwen3:4b (~2.5GB) — answer: all generation nomic-embed-text (CPU) — embedding: search and store qwen2.5:1.5b (~1.2GB) — extraction: mem0 fact extraction (GPU) ───────────────────────────────────────────────────────── Total VRAM: ~4.1GB — well within 8GB ``` No swapping needed. Router runs first (~1-2s), answer model runs after (~20s). ``` Router → tool call JSON or "no tool" ~1-2s → tool runs (if needed) ~1s → Answer model generates reply ~20s ───────────────────────────────────────────── Total ~22-23s ``` vs current two-call approach: ~75s. ### Option 2: Semantic routing (encoder-only, free) Use nomic-embed-text (already running on CPU) as the router: ```python query_vec = embed(query) sims = { "search_memory": cosine(query_vec, memory_topic_vec), "web_search": cosine(query_vec, web_topic_vec), } # If max sim > threshold → call that tool directly # Then pass result + original query to answer model ``` Zero VRAM overhead. ~50ms routing. Can't extract tool args from embedding alone — needs hardcoded arg construction (e.g. query = original user message). ### Option 3: Three-tier with rare large-model escalation Keep qwen3:4b warm. Route to qwen3:8b only for explicitly complex tasks. Accept ~10s swap overhead for those queries. qwen3:8b gets unloaded after. ``` Router → simple → qwen3:4b ~20s (no swap) Router → complex → evict 4b, load 8b → ~30s (10s swap + 20s inference) ``` Works if <20% of queries are "complex" and users accept occasional slow responses. Best implemented with explicit user trigger ("think about this carefully") rather than automatic classification, to avoid swap overhead on misclassified queries. --- ## LangGraph Implementation `create_react_agent` locks to one model. Explicit graph supports per-node models: ```python from langgraph.graph import StateGraph, MessagesState from langgraph.prebuilt import ToolNode from langchain_ollama import ChatOllama router_model = ChatOllama(model="qwen2.5:0.5b", base_url=OLLAMA_GPU_URL) answer_model = ChatOllama(model="qwen3:4b", base_url=OLLAMA_GPU_URL) # For Option 3: large_model = ChatOllama(model="qwen3:8b", ...) def router_node(state): # Small model only outputs tool call JSON or nothing return {"messages": [router_model.bind_tools(tools).invoke(state["messages"])]} def answer_node(state): # Large(r) model generates human reply — no tools bound return {"messages": [answer_model.invoke(state["messages"])]} def route(state) -> str: last = state["messages"][-1] return "tools" if getattr(last, "tool_calls", []) else "answer" graph = StateGraph(MessagesState) graph.add_node("router", router_node) graph.add_node("tools", ToolNode(tools)) graph.add_node("answer", answer_node) graph.set_entry_point("router") graph.add_conditional_edges("router", route) graph.add_edge("tools", "answer") graph.add_edge("answer", END) agent = graph.compile() ``` For three-tier, add a complexity classifier node before the router that selects `answer_model = medium_model or large_model` based on query signals. --- ## Open Source Routing Tools | Tool | Ollama support | Status | Notes | |------|---------------|--------|-------| | LiteLLM | First-class | Active 2025 | Proxy with tiered routing, fallbacks, load balancing | | RouteLLM (LMSYS) | Yes (documented) | Stale (last commit Aug 2024) | Calibrated for GPT-4 vs Mixtral pair | | Router-R1 | No | Active (NeurIPS 2025) | RL-based, open-sourced on HuggingFace | | LLMRouter (ulab) | No | Research 2025 | 16+ routing methods, fair comparison framework | | FrugalGPT | No direct | Algorithm only | Portkey.ai has implementation guide | **Most practical for Ollama**: LiteLLM proxy with tiered model config. Handles routing, fallbacks, and load balancing without changing agent code. --- ## Summary: What to Do for Adolf | | Recommendation | |--|---------------| | Quick win (zero risk) | Remove "always call search_memory" from system prompt — history buffer covers conversational recall, saves ~37s | | Best architecture for 8GB | Two-tier: qwen2.5:0.5b router + qwen3:4b answer, both in VRAM, ~22s total | | Three-tier feasibility | Not viable for interactive use with Ollama model swapping; viable with vLLM Sleep Mode (~3s swap) if Ollama is replaced | | Complex task routing | Use explicit user trigger or keyword detection rather than automatic classifier — avoids swap penalty on misclassification | --- ## References - arXiv 2510.03847 — Small Language Models for Agentic Systems: A Survey - arXiv 2506.02153 — Small Language Models are the Future of Agentic AI - arXiv 2406.04692 — Mixture-of-Agents Enhances LLM Capabilities (original MoA paper) - arXiv 2410.10347 — A Unified Approach to Routing and Cascading for LLMs (ICLR 2025) - MasRouter — ACL 2025: https://aclanthology.org/2025.acl-long.757.pdf - Router-R1 — NeurIPS 2025: https://github.com/ulab-uiuc/Router-R1 - vLLM Sleep Mode: https://blog.vllm.ai/2025/10/26/sleep-mode.html - NVIDIA GPU Memory Swap: https://developer.nvidia.com/blog/cut-model-deployment-costs-while-keeping-performance-with-gpu-memory-swap/ - LangGraph multi-agent: https://langchain-ai.github.io/langgraph/tutorials/multi_agent/ - LangGraph custom ReAct: https://langchain-ai.github.io/langgraph/how-tos/react-agent-from-scratch/ - LiteLLM Ollama docs: https://docs.litellm.ai/docs/providers/ollama - RouteLLM + Ollama example: https://github.com/lm-sys/RouteLLM/blob/main/examples/routing_to_local_models.md - LLMRouter framework: https://ulab-uiuc.github.io/LLMRouter/ - Functionary (tool-call fine-tuned): https://github.com/MeetKai/functionary - Constrained generation (outlines): https://github.com/dottxt-ai/outlines