# LangGraph: Multi-Model Routing Architecture

## Problem

`create_react_agent` uses one model for all steps in the ReAct loop:

```
qwen3:4b → decide to call tool    ~37s
→ run tool                         ~1s
qwen3:4b → final answer           ~37s
─────────────────────────────────────
Total                             ~75s
```

The routing step is classification + argument extraction — low entropy, constrained output.
It does not need the same model as answer generation.

---

## Is the Pattern Established? (2025 Research)

Yes. The 2025 consensus from multiple papers treats heterogeneous model architectures
(small for routing, large for generation) as **settled production engineering**, not research:

- SLMs (1–12B) match or exceed LLMs on schema-constrained tasks (tool calls, JSON, function
  calling) at 10×–100× lower compute cost (arXiv 2510.03847, arXiv 2506.02153)
- MasRouter (ACL 2025): routing in multi-agent graphs reduces costs 2× without quality loss
- Cascade routing (ICLR 2025): 4% accuracy improvement, 30–92% cost reduction vs naive routing
- NVIDIA research (2025): "Small Language Models are the Future of Agentic AI"

**Limitations acknowledged in literature:**
- Bad router defeats the purpose — classifier quality is critical
- Cascade (try small, escalate if uncertain) adds latency on queries that escalate
- Pre-trained routers (RouteLLM, etc.) are calibrated for specific model pairs; local model
  pairs need independent validation

---

## Three-Tier Architecture (small → medium → large)

### Concept

```
Incoming query
    ↓
[Router: tiny model or embedding classifier]  ~1-2s
    ├── simple/conversational → [Medium: qwen3:4b]           ~20s
    ├── needs tool call       → [Medium: qwen3:4b + tools]   ~20-40s
    └── complex/multi-step    → [Large: qwen3:8b + sub-agents] ~60s+
```

### When to route to large

Signals that justify loading a larger model:
- Multi-step reasoning required (math, code, planning)
- Sub-agent orchestration (the agent needs to call other agents)
- Explicit reasoning request ("think through", "analyze carefully")
- Low confidence from medium model (cascade pattern)

### Trade-offs of three-tier vs two-tier

| | Two-tier | Three-tier |
|--|---------|-----------|
| Simple queries | small router + medium answer | small router + medium answer (same) |
| Complex queries | medium (may struggle) | swap to large (better quality) |
| GPU constraint | manageable | hard — see below |
| Routing error cost | low | high (wrong tier = much slower) |

---

## The 8GB GPU Constraint — Core Problem

This is the central issue. Research numbers on model swapping (2025):

**Cold swap from disk (no optimization)**
- TTFT exceeds 140s for 7B-class models on HDD; 5–15s on NVMe SSD
- Not viable for interactive use at any tier

**vLLM Sleep Mode (offload to CPU RAM, not disk)**
- 18–200× faster than cold start; TTFT 2–3s per switch
- vLLM-specific — not available in Ollama

**Ollama behavior on 8GB VRAM**
- Default `keep_alive`: 5 minutes — model stays warm after use
- Two models simultaneously: qwen3:4b (~2.5GB) + qwen2.5:1.5b (~1.2GB) = ~3.7GB — fits
- qwen3:4b + qwen3:8b = ~8GB — does not reliably fit; eviction required
- Sequential swap in Ollama: Ollama evicts old model, loads new one from SSD (~5–15s on NVMe)
- Known Ollama bug: model spills from VRAM to RAM → all subsequent loads stay on CPU until restart

**Conclusion for three-tier on single 8GB GPU:**

| Tier switch | Cost | Viable? |
|------------|------|---------|
| tiny router → medium (qwen3:4b) | model swap ~5-15s if router is separate | borderline |
| medium → large (qwen3:8b) | evict qwen3:4b, load qwen3:8b = ~5-15s additional | no, for interactive |
| Keep medium always warm, route to large on demand | 5-15s swap overhead per complex query | acceptable if complex queries are rare |

**Honest verdict: three-tier with model swapping is not viable for interactive per-turn latency
on 8GB VRAM with Ollama.** vLLM with Sleep Mode would make it viable (2–3s switch) but
requires replacing Ollama.

---

## Practical Architecture for 8GB GPU (Ollama)

### Option 1: Two-tier, both models always in VRAM (recommended)

Keep two small models loaded simultaneously:

```
qwen2.5:0.5b  (~0.4GB) — router: tool call decision + arg extraction
qwen3:4b      (~2.5GB) — answer: all generation
nomic-embed-text (CPU) — embedding: search and store
qwen2.5:1.5b  (~1.2GB) — extraction: mem0 fact extraction (GPU)
─────────────────────────────────────────────────────────
Total VRAM: ~4.1GB — well within 8GB
```

No swapping needed. Router runs first (~1-2s), answer model runs after (~20s).

```
Router → tool call JSON or "no tool"    ~1-2s
→ tool runs (if needed)                  ~1s
→ Answer model generates reply          ~20s
─────────────────────────────────────────────
Total                                   ~22-23s
```

vs current two-call approach: ~75s.

### Option 2: Semantic routing (encoder-only, free)

Use nomic-embed-text (already running on CPU) as the router:

```python
query_vec = embed(query)
sims = {
    "search_memory": cosine(query_vec, memory_topic_vec),
    "web_search":    cosine(query_vec, web_topic_vec),
}
# If max sim > threshold → call that tool directly
# Then pass result + original query to answer model
```

Zero VRAM overhead. ~50ms routing. Can't extract tool args from embedding alone —
needs hardcoded arg construction (e.g. query = original user message).

### Option 3: Three-tier with rare large-model escalation

Keep qwen3:4b warm. Route to qwen3:8b only for explicitly complex tasks.
Accept ~10s swap overhead for those queries. qwen3:8b gets unloaded after.

```
Router → simple → qwen3:4b              ~20s (no swap)
Router → complex → evict 4b, load 8b → ~30s (10s swap + 20s inference)
```

Works if <20% of queries are "complex" and users accept occasional slow responses.
Best implemented with explicit user trigger ("think about this carefully") rather than
automatic classification, to avoid swap overhead on misclassified queries.

---

## LangGraph Implementation

`create_react_agent` locks to one model. Explicit graph supports per-node models:

```python
from langgraph.graph import StateGraph, MessagesState
from langgraph.prebuilt import ToolNode
from langchain_ollama import ChatOllama

router_model = ChatOllama(model="qwen2.5:0.5b", base_url=OLLAMA_GPU_URL)
answer_model = ChatOllama(model="qwen3:4b",     base_url=OLLAMA_GPU_URL)
# For Option 3: large_model = ChatOllama(model="qwen3:8b", ...)

def router_node(state):
    # Small model only outputs tool call JSON or nothing
    return {"messages": [router_model.bind_tools(tools).invoke(state["messages"])]}

def answer_node(state):
    # Large(r) model generates human reply — no tools bound
    return {"messages": [answer_model.invoke(state["messages"])]}

def route(state) -> str:
    last = state["messages"][-1]
    return "tools" if getattr(last, "tool_calls", []) else "answer"

graph = StateGraph(MessagesState)
graph.add_node("router", router_node)
graph.add_node("tools", ToolNode(tools))
graph.add_node("answer", answer_node)
graph.set_entry_point("router")
graph.add_conditional_edges("router", route)
graph.add_edge("tools", "answer")
graph.add_edge("answer", END)
agent = graph.compile()
```

For three-tier, add a complexity classifier node before the router that selects
`answer_model = medium_model or large_model` based on query signals.

---

## Open Source Routing Tools

| Tool | Ollama support | Status | Notes |
|------|---------------|--------|-------|
| LiteLLM | First-class | Active 2025 | Proxy with tiered routing, fallbacks, load balancing |
| RouteLLM (LMSYS) | Yes (documented) | Stale (last commit Aug 2024) | Calibrated for GPT-4 vs Mixtral pair |
| Router-R1 | No | Active (NeurIPS 2025) | RL-based, open-sourced on HuggingFace |
| LLMRouter (ulab) | No | Research 2025 | 16+ routing methods, fair comparison framework |
| FrugalGPT | No direct | Algorithm only | Portkey.ai has implementation guide |

**Most practical for Ollama**: LiteLLM proxy with tiered model config. Handles routing,
fallbacks, and load balancing without changing agent code.

---

## Summary: What to Do for Adolf

| | Recommendation |
|--|---------------|
| Quick win (zero risk) | Remove "always call search_memory" from system prompt — history buffer covers conversational recall, saves ~37s |
| Best architecture for 8GB | Two-tier: qwen2.5:0.5b router + qwen3:4b answer, both in VRAM, ~22s total |
| Three-tier feasibility | Not viable for interactive use with Ollama model swapping; viable with vLLM Sleep Mode (~3s swap) if Ollama is replaced |
| Complex task routing | Use explicit user trigger or keyword detection rather than automatic classifier — avoids swap penalty on misclassification |

---

## References

- arXiv 2510.03847 — Small Language Models for Agentic Systems: A Survey
- arXiv 2506.02153 — Small Language Models are the Future of Agentic AI
- arXiv 2406.04692 — Mixture-of-Agents Enhances LLM Capabilities (original MoA paper)
- arXiv 2410.10347 — A Unified Approach to Routing and Cascading for LLMs (ICLR 2025)
- MasRouter — ACL 2025: https://aclanthology.org/2025.acl-long.757.pdf
- Router-R1 — NeurIPS 2025: https://github.com/ulab-uiuc/Router-R1
- vLLM Sleep Mode: https://blog.vllm.ai/2025/10/26/sleep-mode.html
- NVIDIA GPU Memory Swap: https://developer.nvidia.com/blog/cut-model-deployment-costs-while-keeping-performance-with-gpu-memory-swap/
- LangGraph multi-agent: https://langchain-ai.github.io/langgraph/tutorials/multi_agent/
- LangGraph custom ReAct: https://langchain-ai.github.io/langgraph/how-tos/react-agent-from-scratch/
- LiteLLM Ollama docs: https://docs.litellm.ai/docs/providers/ollama
- RouteLLM + Ollama example: https://github.com/lm-sys/RouteLLM/blob/main/examples/routing_to_local_models.md
- LLMRouter framework: https://ulab-uiuc.github.io/LLMRouter/
- Functionary (tool-call fine-tuned): https://github.com/MeetKai/functionary
- Constrained generation (outlines): https://github.com/dottxt-ai/outlines