# Reasoning & Self-Reflection in Local LLM Agents Research-backed notes on implementing multi-stage reasoning for local 4-8B models (2025). --- ## TL;DR For local 4-8B models, **programmatic self-critique loops rarely justify their cost**. Native thinking tokens (Qwen3 `enable_thinking=True`) or external verifiers give better results at lower complexity. See bottom of this file for recommendations. --- ## Reasoning Patterns ### Chain-of-Thought (CoT) Single forward pass, model thinks step-by-step before answering. Zero implementation cost — just a prompt change. Typical gain: +5-10pp on multi-step tasks vs no CoT. No latency overhead beyond the extra output tokens. ### Reflexion (Shinn et al., NeurIPS 2023) Multiple complete attempts. After each attempt, the model writes a textual critique of what went wrong and stores it in episodic memory. Next attempt is conditioned on that memory. ``` attempt 1 → fail → write critique → attempt 2 (reads critique) → ... ``` Key results (GPT-4): HumanEval 80% → 91% pass@1. Cost: N complete task executions. At 30s/attempt, 5 trials = 2.5 minutes. Implementation: https://github.com/noahshinn/reflexion ### Reflection Loop (in-turn revision) Within a single turn: generate → critique → revise → [repeat]. Simpler than Reflexion. More common in practice. ``` Generate → Critique → Revise → [stop condition] ``` Stop condition options: max iterations, score threshold, external verifier passes. ### ReAct + Reflect Standard ReAct (Reason + Act) with an added Reflect step after failed tool calls. Most common production pattern. Adds 1-3 extra LLM calls per failed action. ### Tree of Thoughts (ToT) Explore N reasoning branches simultaneously, evaluate each node, BFS/DFS search. Branching factor 3, depth 3 = 54 LLM calls per problem. Prohibitive for local models. Works only if the model has strong self-evaluation capability (typically ≥32B). ToTRL-trained Qwen3-8B achieved 0.633 on AIME 2025 — but required training-time RL, not a prompt trick. ### Graph of Thoughts (GoT) Generalizes ToT to arbitrary DAGs: thoughts can merge, split, or loop. 62% improvement in sorting vs ToT, 31% cost reduction. Implementation: https://github.com/spcl/graph-of-thoughts Higher complexity than ToT; graph structure is problem-specific. --- ## Native Thinking Tokens (vs Programmatic Reflection) Open models with built-in reasoning scratchpads: | Model | Size | Ollama | Toggle | Notes | |-------|------|--------|--------|-------| | Qwen3 | 0.6B–235B | Yes | enable_thinking / think=True/False | Best option for local use | | Qwen3-4B-Thinking-2507 | 4B | Yes | Always on | Dedicated thinking variant | | QwQ-32B | 32B | Yes | Always on | Strong reasoning, needs VRAM | | DeepSeek-R1 distills | 1.5B–70B | Yes | Always on | Llama/Qwen base | ### Qwen3 thinking toggle in Ollama / LangChain ```python # LangChain model = ChatOllama(model="qwen3:4b", think=True, num_ctx=8192) # Prompt-level (Ollama API) # /think — enable per-request # /no_think — disable per-request ``` Latency: thinking mode is 2-3x slower in wall-clock time (model generates internal `...` tokens before answering). Qwen3-VL 8B Thinking: 262s vs 65s on a complex visual reasoning task — but meaningfully better output. ### Native thinking vs programmatic loop | | Native thinking | Programmatic multi-stage | |--|----------------|------------------------| | API calls | 1 | N (rounds × 2) | | Implementation | Zero | Significant | | Quality on 4-8B | Good (capability in weights) | Poor (weak model critiques itself) | | Transparency | Opaque (one streamed block) | Inspectable per stage | | Controllability | thinking_budget only | Full control | | Latency | 2-3x tokens, 1 call | N × base latency | **For local 4-8B: native thinking almost always beats a hand-coded reflection loop.** --- ## Does Programmatic Reflection Work on Small Models? Short answer: **mostly no without external verification**. From the research (2024-2025): - **"When Hindsight is Not 20/20" (arXiv 2404.09129)**: Self-reflection often makes small models worse. A model that generated an error also lacks the capability to identify it. It confidently accepts flawed reasoning on re-reading. - **THINKSLM (EMNLP 2025)**: Inference-time self-critique on Llama-3.1-8B is unreliable. Training-time distilled reasoning traces help; prompt-based self-critique does not. - **Nature 2025 study**: Large gains (GPT-4: +18.5pp) diminish sharply for smaller models. - **Latency cost**: Each reflection round on a local 8B adds 5-30s. A 3-round loop = 3x latency for 0-5% gain (or regression) on most tasks. ### When it actually helps on small models 1. **External verifier**: model doesn't self-evaluate — it reads objective pass/fail feedback (unit tests, JSON schema checker, math verifier, search result grader). Most reliable pattern. No self-evaluation capability required. 2. **Stronger critic**: generate with 4B, critique with 32B or API model. Hybrid approach. 3. **Native thinking weights**: reflection happens in a single forward pass with trained weights. Far more reliable than prompt-based self-critique. 4. **Structured error types**: code syntax, JSON validity, regex match — computable error signal, not linguistic self-assessment. --- ## LangGraph Reflection Loop Implementation LangGraph is suited for this because it supports cyclic graphs with state. ### Minimal reflection graph ```python from langgraph.graph import StateGraph, START, END, MessagesState from langchain_ollama import ChatOllama llm = ChatOllama(model="qwen3:4b", think=False) critic_llm = ChatOllama(model="qwen3:4b", think=True) # or stronger model MAX_REFLECTIONS = 2 def generate(state): response = llm.invoke(state["messages"]) return {"messages": [response], "iterations": state.get("iterations", 0)} def reflect(state): critique = critic_llm.invoke( [{"role": "system", "content": "Critique this response. Be specific about errors."}] + state["messages"] ) return { "messages": [{"role": "user", "content": critique.content}], "iterations": state["iterations"] + 1, } def should_reflect(state) -> str: if state.get("iterations", 0) >= MAX_REFLECTIONS: return END # Optionally: check external verifier here return "reflect" graph = StateGraph(MessagesState) graph.add_node("generate", generate) graph.add_node("reflect", reflect) graph.add_edge(START, "generate") graph.add_conditional_edges("generate", should_reflect) graph.add_edge("reflect", "generate") agent = graph.compile() ``` ### Self-Correcting RAG (CRAG pattern) ``` Retrieve → Grade documents → [rewrite query if bad] → Generate → Grade answer → [loop or END] ``` The document grader and answer grader are the "external verifiers" — they do objective quality checks rather than linguistic self-critique. LangChain tutorial: https://learnopencv.com/langgraph-self-correcting-agent-code-generation/ --- ## Alternative Tooling ### DSPy (recommended for pipeline optimization) DSPy treats prompts as learnable parameters. Define input/output signatures, run an optimizer on examples, and DSPy auto-tunes prompts for your specific model. ```python import dspy lm = dspy.LM('ollama_chat/qwen3:4b', api_base='http://localhost:11434', api_key='') dspy.configure(lm=lm) class Reflect(dspy.Module): def __init__(self): self.gen = dspy.ChainOfThought("question -> answer") self.critique = dspy.ChainOfThought("question, answer -> critique, improved_answer") def forward(self, question): first = self.gen(question=question) return self.critique(question=question, answer=first.answer).improved_answer ``` Works with Ollama. Optimizer (BootstrapFewShot, MIPRO) tunes prompts automatically but requires multiple LLM calls per training example — slow on local hardware. ### Outlines (structured output) Constrained decoding — guarantees valid JSON/regex output from any model. Use this inside a reflection loop to ensure the critic always returns structured feedback. Works with Ollama via OpenAI-compatible API. https://dottxt-ai.github.io/outlines/ ### SGLang High-performance GPU serving runtime (replaces Ollama for GPU inference). Natively understands `...` tokens, caches KV-prefix across reflection rounds (RadixAttention). If you replace Ollama with SGLang: reflection loops become significantly cheaper because repeated prompt prefixes are cache-hit. https://github.com/sgl-project/sglang --- ## Benchmarks Summary | Setup | Task | Quality Gain | Latency Cost | |-------|------|-------------|-------------| | GPT-4 + Reflexion | HumanEval | +11pp (80→91%) | ~5x | | GPT-4 + reflection | Problem solving | +18.5pp | ~3x | | Llama-7B + programmatic self-critique | Math | +7.1% | ~3x | | Local 8B + same-model critique (typical) | General | 0-5% (often regression) | 2-3x | | Qwen3-8B + native thinking | AIME 2025 | Matches models 10x larger | 2-3x tokens | | Any model + external verifier (tests) | Code | +15-26pp | 1.5-2x | --- ## Practical Recommendations for Adolf (local qwen3:4b / 8b) | Goal | Approach | Cost | |------|----------|------| | Better reasoning on hard questions | `think=True` in ChatOllama | 2-3x latency, zero code | | Code/JSON correctness | External verifier (schema check, exec) + retry loop | +1 LLM call on failure | | Complex multi-step tasks | Route to qwen3:8b with `think=True` | model swap + 2-3x tokens | | Full reflection loop | Only if using stronger critic model or external verifier | significant complexity | | Avoid | Programmatic self-critique using same 4-8B model as critic | adds latency, no gain | --- ## References - Reflexion (Shinn et al., NeurIPS 2023): https://arxiv.org/abs/2303.11366 - Tree of Thoughts: https://arxiv.org/abs/2305.10601 - ToTRL (Qwen3 RL training): https://arxiv.org/html/2505.12717v1 - Graph of Thoughts: https://arxiv.org/abs/2308.09687 - Adaptive GoT (2025): https://arxiv.org/pdf/2502.05078 - When Hindsight is Not 20/20: https://arxiv.org/html/2404.09129v1 - THINKSLM (EMNLP 2025): https://aclanthology.org/2025.emnlp-main.1659.pdf - MAR — Multi-Agent Reflexion: https://arxiv.org/html/2512.20845 - Qwen3 technical report: https://arxiv.org/pdf/2505.09388 - Qwen3 thinking cost measurement: https://medium.com/@frankmorales_91352/the-computational-cost-of-cognitive-depth-qwen3-vl-8b-instruct-vs-thinking-2517b677ba29 - DeepSeek-R1: https://arxiv.org/abs/2501.12948 - LangChain reflection blog: https://blog.langchain.com/reflection-agents/ - LangGraph CRAG: https://learnopencv.com/langgraph-self-correcting-agent-code-generation/ - DSPy: https://dspy.ai/ - Outlines: https://dottxt-ai.github.io/outlines/ - SGLang: https://github.com/sgl-project/sglang - graph-of-thoughts: https://github.com/spcl/graph-of-thoughts - reflexion (original code): https://github.com/noahshinn/reflexion