Files
adolf/langgraph.md
2026-03-05 11:22:34 +00:00

10 KiB
Raw Blame History

LangGraph: Multi-Model Routing Architecture

Problem

create_react_agent uses one model for all steps in the ReAct loop:

qwen3:4b → decide to call tool    ~37s
→ run tool                         ~1s
qwen3:4b → final answer           ~37s
─────────────────────────────────────
Total                             ~75s

The routing step is classification + argument extraction — low entropy, constrained output. It does not need the same model as answer generation.


Is the Pattern Established? (2025 Research)

Yes. The 2025 consensus from multiple papers treats heterogeneous model architectures (small for routing, large for generation) as settled production engineering, not research:

  • SLMs (112B) match or exceed LLMs on schema-constrained tasks (tool calls, JSON, function calling) at 10×100× lower compute cost (arXiv 2510.03847, arXiv 2506.02153)
  • MasRouter (ACL 2025): routing in multi-agent graphs reduces costs 2× without quality loss
  • Cascade routing (ICLR 2025): 4% accuracy improvement, 3092% cost reduction vs naive routing
  • NVIDIA research (2025): "Small Language Models are the Future of Agentic AI"

Limitations acknowledged in literature:

  • Bad router defeats the purpose — classifier quality is critical
  • Cascade (try small, escalate if uncertain) adds latency on queries that escalate
  • Pre-trained routers (RouteLLM, etc.) are calibrated for specific model pairs; local model pairs need independent validation

Three-Tier Architecture (small → medium → large)

Concept

Incoming query
    ↓
[Router: tiny model or embedding classifier]  ~1-2s
    ├── simple/conversational → [Medium: qwen3:4b]           ~20s
    ├── needs tool call       → [Medium: qwen3:4b + tools]   ~20-40s
    └── complex/multi-step    → [Large: qwen3:8b + sub-agents] ~60s+

When to route to large

Signals that justify loading a larger model:

  • Multi-step reasoning required (math, code, planning)
  • Sub-agent orchestration (the agent needs to call other agents)
  • Explicit reasoning request ("think through", "analyze carefully")
  • Low confidence from medium model (cascade pattern)

Trade-offs of three-tier vs two-tier

Two-tier Three-tier
Simple queries small router + medium answer small router + medium answer (same)
Complex queries medium (may struggle) swap to large (better quality)
GPU constraint manageable hard — see below
Routing error cost low high (wrong tier = much slower)

The 8GB GPU Constraint — Core Problem

This is the central issue. Research numbers on model swapping (2025):

Cold swap from disk (no optimization)

  • TTFT exceeds 140s for 7B-class models on HDD; 515s on NVMe SSD
  • Not viable for interactive use at any tier

vLLM Sleep Mode (offload to CPU RAM, not disk)

  • 18200× faster than cold start; TTFT 23s per switch
  • vLLM-specific — not available in Ollama

Ollama behavior on 8GB VRAM

  • Default keep_alive: 5 minutes — model stays warm after use
  • Two models simultaneously: qwen3:4b (~2.5GB) + qwen2.5:1.5b (~1.2GB) = ~3.7GB — fits
  • qwen3:4b + qwen3:8b = ~8GB — does not reliably fit; eviction required
  • Sequential swap in Ollama: Ollama evicts old model, loads new one from SSD (~515s on NVMe)
  • Known Ollama bug: model spills from VRAM to RAM → all subsequent loads stay on CPU until restart

Conclusion for three-tier on single 8GB GPU:

Tier switch Cost Viable?
tiny router → medium (qwen3:4b) model swap ~5-15s if router is separate borderline
medium → large (qwen3:8b) evict qwen3:4b, load qwen3:8b = ~5-15s additional no, for interactive
Keep medium always warm, route to large on demand 5-15s swap overhead per complex query acceptable if complex queries are rare

Honest verdict: three-tier with model swapping is not viable for interactive per-turn latency on 8GB VRAM with Ollama. vLLM with Sleep Mode would make it viable (23s switch) but requires replacing Ollama.


Practical Architecture for 8GB GPU (Ollama)

Keep two small models loaded simultaneously:

qwen2.5:0.5b  (~0.4GB) — router: tool call decision + arg extraction
qwen3:4b      (~2.5GB) — answer: all generation
nomic-embed-text (CPU) — embedding: search and store
qwen2.5:1.5b  (~1.2GB) — extraction: mem0 fact extraction (GPU)
─────────────────────────────────────────────────────────
Total VRAM: ~4.1GB — well within 8GB

No swapping needed. Router runs first (~1-2s), answer model runs after (~20s).

Router → tool call JSON or "no tool"    ~1-2s
→ tool runs (if needed)                  ~1s
→ Answer model generates reply          ~20s
─────────────────────────────────────────────
Total                                   ~22-23s

vs current two-call approach: ~75s.

Option 2: Semantic routing (encoder-only, free)

Use nomic-embed-text (already running on CPU) as the router:

query_vec = embed(query)
sims = {
    "search_memory": cosine(query_vec, memory_topic_vec),
    "web_search":    cosine(query_vec, web_topic_vec),
}
# If max sim > threshold → call that tool directly
# Then pass result + original query to answer model

Zero VRAM overhead. ~50ms routing. Can't extract tool args from embedding alone — needs hardcoded arg construction (e.g. query = original user message).

Option 3: Three-tier with rare large-model escalation

Keep qwen3:4b warm. Route to qwen3:8b only for explicitly complex tasks. Accept ~10s swap overhead for those queries. qwen3:8b gets unloaded after.

Router → simple → qwen3:4b              ~20s (no swap)
Router → complex → evict 4b, load 8b → ~30s (10s swap + 20s inference)

Works if <20% of queries are "complex" and users accept occasional slow responses. Best implemented with explicit user trigger ("think about this carefully") rather than automatic classification, to avoid swap overhead on misclassified queries.


LangGraph Implementation

create_react_agent locks to one model. Explicit graph supports per-node models:

from langgraph.graph import StateGraph, MessagesState
from langgraph.prebuilt import ToolNode
from langchain_ollama import ChatOllama

router_model = ChatOllama(model="qwen2.5:0.5b", base_url=OLLAMA_GPU_URL)
answer_model = ChatOllama(model="qwen3:4b",     base_url=OLLAMA_GPU_URL)
# For Option 3: large_model = ChatOllama(model="qwen3:8b", ...)

def router_node(state):
    # Small model only outputs tool call JSON or nothing
    return {"messages": [router_model.bind_tools(tools).invoke(state["messages"])]}

def answer_node(state):
    # Large(r) model generates human reply — no tools bound
    return {"messages": [answer_model.invoke(state["messages"])]}

def route(state) -> str:
    last = state["messages"][-1]
    return "tools" if getattr(last, "tool_calls", []) else "answer"

graph = StateGraph(MessagesState)
graph.add_node("router", router_node)
graph.add_node("tools", ToolNode(tools))
graph.add_node("answer", answer_node)
graph.set_entry_point("router")
graph.add_conditional_edges("router", route)
graph.add_edge("tools", "answer")
graph.add_edge("answer", END)
agent = graph.compile()

For three-tier, add a complexity classifier node before the router that selects answer_model = medium_model or large_model based on query signals.


Open Source Routing Tools

Tool Ollama support Status Notes
LiteLLM First-class Active 2025 Proxy with tiered routing, fallbacks, load balancing
RouteLLM (LMSYS) Yes (documented) Stale (last commit Aug 2024) Calibrated for GPT-4 vs Mixtral pair
Router-R1 No Active (NeurIPS 2025) RL-based, open-sourced on HuggingFace
LLMRouter (ulab) No Research 2025 16+ routing methods, fair comparison framework
FrugalGPT No direct Algorithm only Portkey.ai has implementation guide

Most practical for Ollama: LiteLLM proxy with tiered model config. Handles routing, fallbacks, and load balancing without changing agent code.


Summary: What to Do for Adolf

Recommendation
Quick win (zero risk) Remove "always call search_memory" from system prompt — history buffer covers conversational recall, saves ~37s
Best architecture for 8GB Two-tier: qwen2.5:0.5b router + qwen3:4b answer, both in VRAM, ~22s total
Three-tier feasibility Not viable for interactive use with Ollama model swapping; viable with vLLM Sleep Mode (~3s swap) if Ollama is replaced
Complex task routing Use explicit user trigger or keyword detection rather than automatic classifier — avoids swap penalty on misclassification

References