alvis/adolf

Fork 0

Files

Alvis ec45d255f0 wiki search people tested pipeline

2026-03-05 11:22:34 +00:00

10 KiB

Raw Blame History

LangGraph: Multi-Model Routing Architecture

Problem

create_react_agent uses one model for all steps in the ReAct loop:

qwen3:4b → decide to call tool    ~37s
→ run tool                         ~1s
qwen3:4b → final answer           ~37s
─────────────────────────────────────
Total                             ~75s

The routing step is classification + argument extraction — low entropy, constrained output. It does not need the same model as answer generation.

Is the Pattern Established? (2025 Research)

Yes. The 2025 consensus from multiple papers treats heterogeneous model architectures (small for routing, large for generation) as settled production engineering, not research:

SLMs (1–12B) match or exceed LLMs on schema-constrained tasks (tool calls, JSON, function calling) at 10×–100× lower compute cost (arXiv 2510.03847, arXiv 2506.02153)
MasRouter (ACL 2025): routing in multi-agent graphs reduces costs 2× without quality loss
Cascade routing (ICLR 2025): 4% accuracy improvement, 30–92% cost reduction vs naive routing
NVIDIA research (2025): "Small Language Models are the Future of Agentic AI"

Limitations acknowledged in literature:

Bad router defeats the purpose — classifier quality is critical
Cascade (try small, escalate if uncertain) adds latency on queries that escalate
Pre-trained routers (RouteLLM, etc.) are calibrated for specific model pairs; local model pairs need independent validation

Three-Tier Architecture (small → medium → large)

Concept

Incoming query
    ↓
[Router: tiny model or embedding classifier]  ~1-2s
    ├── simple/conversational → [Medium: qwen3:4b]           ~20s
    ├── needs tool call       → [Medium: qwen3:4b + tools]   ~20-40s
    └── complex/multi-step    → [Large: qwen3:8b + sub-agents] ~60s+

When to route to large

Signals that justify loading a larger model:

Multi-step reasoning required (math, code, planning)
Sub-agent orchestration (the agent needs to call other agents)
Explicit reasoning request ("think through", "analyze carefully")
Low confidence from medium model (cascade pattern)

Trade-offs of three-tier vs two-tier

	Two-tier	Three-tier
Simple queries	small router + medium answer	small router + medium answer (same)
Complex queries	medium (may struggle)	swap to large (better quality)
GPU constraint	manageable	hard — see below
Routing error cost	low	high (wrong tier = much slower)

The 8GB GPU Constraint — Core Problem

This is the central issue. Research numbers on model swapping (2025):

Cold swap from disk (no optimization)

TTFT exceeds 140s for 7B-class models on HDD; 5–15s on NVMe SSD
Not viable for interactive use at any tier

vLLM Sleep Mode (offload to CPU RAM, not disk)

18–200× faster than cold start; TTFT 2–3s per switch
vLLM-specific — not available in Ollama

Ollama behavior on 8GB VRAM

Default keep_alive: 5 minutes — model stays warm after use
Two models simultaneously: qwen3:4b (~2.5GB) + qwen2.5:1.5b (~1.2GB) = ~3.7GB — fits
qwen3:4b + qwen3:8b = ~8GB — does not reliably fit; eviction required
Sequential swap in Ollama: Ollama evicts old model, loads new one from SSD (~5–15s on NVMe)
Known Ollama bug: model spills from VRAM to RAM → all subsequent loads stay on CPU until restart

Conclusion for three-tier on single 8GB GPU:

Tier switch	Cost	Viable?
tiny router → medium (qwen3:4b)	model swap ~5-15s if router is separate	borderline
medium → large (qwen3:8b)	evict qwen3:4b, load qwen3:8b = ~5-15s additional	no, for interactive
Keep medium always warm, route to large on demand	5-15s swap overhead per complex query	acceptable if complex queries are rare

Honest verdict: three-tier with model swapping is not viable for interactive per-turn latency on 8GB VRAM with Ollama. vLLM with Sleep Mode would make it viable (2–3s switch) but requires replacing Ollama.

Practical Architecture for 8GB GPU (Ollama)

Option 1: Two-tier, both models always in VRAM (recommended)

Keep two small models loaded simultaneously:

qwen2.5:0.5b  (~0.4GB) — router: tool call decision + arg extraction
qwen3:4b      (~2.5GB) — answer: all generation
nomic-embed-text (CPU) — embedding: search and store
qwen2.5:1.5b  (~1.2GB) — extraction: mem0 fact extraction (GPU)
─────────────────────────────────────────────────────────
Total VRAM: ~4.1GB — well within 8GB

No swapping needed. Router runs first (~1-2s), answer model runs after (~20s).

Router → tool call JSON or "no tool"    ~1-2s
→ tool runs (if needed)                  ~1s
→ Answer model generates reply          ~20s
─────────────────────────────────────────────
Total                                   ~22-23s

vs current two-call approach: ~75s.

Option 2: Semantic routing (encoder-only, free)

Use nomic-embed-text (already running on CPU) as the router:

query_vec = embed(query)
sims = {
    "search_memory": cosine(query_vec, memory_topic_vec),
    "web_search":    cosine(query_vec, web_topic_vec),
}
# If max sim > threshold → call that tool directly
# Then pass result + original query to answer model

Zero VRAM overhead. ~50ms routing. Can't extract tool args from embedding alone — needs hardcoded arg construction (e.g. query = original user message).

Option 3: Three-tier with rare large-model escalation

Keep qwen3:4b warm. Route to qwen3:8b only for explicitly complex tasks. Accept ~10s swap overhead for those queries. qwen3:8b gets unloaded after.

Router → simple → qwen3:4b              ~20s (no swap)
Router → complex → evict 4b, load 8b → ~30s (10s swap + 20s inference)

Works if <20% of queries are "complex" and users accept occasional slow responses. Best implemented with explicit user trigger ("think about this carefully") rather than automatic classification, to avoid swap overhead on misclassified queries.

LangGraph Implementation

create_react_agent locks to one model. Explicit graph supports per-node models:

from langgraph.graph import StateGraph, MessagesState
from langgraph.prebuilt import ToolNode
from langchain_ollama import ChatOllama

router_model = ChatOllama(model="qwen2.5:0.5b", base_url=OLLAMA_GPU_URL)
answer_model = ChatOllama(model="qwen3:4b",     base_url=OLLAMA_GPU_URL)
# For Option 3: large_model = ChatOllama(model="qwen3:8b", ...)

def router_node(state):
    # Small model only outputs tool call JSON or nothing
    return {"messages": [router_model.bind_tools(tools).invoke(state["messages"])]}

def answer_node(state):
    # Large(r) model generates human reply — no tools bound
    return {"messages": [answer_model.invoke(state["messages"])]}

def route(state) -> str:
    last = state["messages"][-1]
    return "tools" if getattr(last, "tool_calls", []) else "answer"

graph = StateGraph(MessagesState)
graph.add_node("router", router_node)
graph.add_node("tools", ToolNode(tools))
graph.add_node("answer", answer_node)
graph.set_entry_point("router")
graph.add_conditional_edges("router", route)
graph.add_edge("tools", "answer")
graph.add_edge("answer", END)
agent = graph.compile()

For three-tier, add a complexity classifier node before the router that selects answer_model = medium_model or large_model based on query signals.

Open Source Routing Tools

Tool	Ollama support	Status	Notes
LiteLLM	First-class	Active 2025	Proxy with tiered routing, fallbacks, load balancing
RouteLLM (LMSYS)	Yes (documented)	Stale (last commit Aug 2024)	Calibrated for GPT-4 vs Mixtral pair
Router-R1	No	Active (NeurIPS 2025)	RL-based, open-sourced on HuggingFace
LLMRouter (ulab)	No	Research 2025	16+ routing methods, fair comparison framework
FrugalGPT	No direct	Algorithm only	Portkey.ai has implementation guide

Most practical for Ollama: LiteLLM proxy with tiered model config. Handles routing, fallbacks, and load balancing without changing agent code.

Summary: What to Do for Adolf

	Recommendation
Quick win (zero risk)	Remove "always call search_memory" from system prompt — history buffer covers conversational recall, saves ~37s
Best architecture for 8GB	Two-tier: qwen2.5:0.5b router + qwen3:4b answer, both in VRAM, ~22s total
Three-tier feasibility	Not viable for interactive use with Ollama model swapping; viable with vLLM Sleep Mode (~3s swap) if Ollama is replaced
Complex task routing	Use explicit user trigger or keyword detection rather than automatic classifier — avoids swap penalty on misclassification

References

arXiv 2510.03847 — Small Language Models for Agentic Systems: A Survey
arXiv 2506.02153 — Small Language Models are the Future of Agentic AI
arXiv 2406.04692 — Mixture-of-Agents Enhances LLM Capabilities (original MoA paper)
arXiv 2410.10347 — A Unified Approach to Routing and Cascading for LLMs (ICLR 2025)
MasRouter — ACL 2025: https://aclanthology.org/2025.acl-long.757.pdf
Router-R1 — NeurIPS 2025: https://github.com/ulab-uiuc/Router-R1
vLLM Sleep Mode: https://blog.vllm.ai/2025/10/26/sleep-mode.html
NVIDIA GPU Memory Swap: https://developer.nvidia.com/blog/cut-model-deployment-costs-while-keeping-performance-with-gpu-memory-swap/
LangGraph multi-agent: https://langchain-ai.github.io/langgraph/tutorials/multi_agent/
LangGraph custom ReAct: https://langchain-ai.github.io/langgraph/how-tos/react-agent-from-scratch/
LiteLLM Ollama docs: https://docs.litellm.ai/docs/providers/ollama
RouteLLM + Ollama example: https://github.com/lm-sys/RouteLLM/blob/main/examples/routing_to_local_models.md
LLMRouter framework: https://ulab-uiuc.github.io/LLMRouter/
Functionary (tool-call fine-tuned): https://github.com/MeetKai/functionary
Constrained generation (outlines): https://github.com/dottxt-ai/outlines

10 KiB Raw Blame History Unescape Escape