alvis/adolf

Fork 0

Files

Alvis ec45d255f0 wiki search people tested pipeline

2026-03-05 11:22:34 +00:00

11 KiB

Raw Blame History

Reasoning & Self-Reflection in Local LLM Agents

Research-backed notes on implementing multi-stage reasoning for local 4-8B models (2025).

TL;DR

For local 4-8B models, programmatic self-critique loops rarely justify their cost. Native thinking tokens (Qwen3 enable_thinking=True) or external verifiers give better results at lower complexity. See bottom of this file for recommendations.

Reasoning Patterns

Chain-of-Thought (CoT)

Single forward pass, model thinks step-by-step before answering. Zero implementation cost — just a prompt change. Typical gain: +5-10pp on multi-step tasks vs no CoT. No latency overhead beyond the extra output tokens.

Reflexion (Shinn et al., NeurIPS 2023)

Multiple complete attempts. After each attempt, the model writes a textual critique of what went wrong and stores it in episodic memory. Next attempt is conditioned on that memory.

attempt 1 → fail → write critique → attempt 2 (reads critique) → ...

Key results (GPT-4): HumanEval 80% → 91% pass@1. Cost: N complete task executions. At 30s/attempt, 5 trials = 2.5 minutes. Implementation: https://github.com/noahshinn/reflexion

Reflection Loop (in-turn revision)

Within a single turn: generate → critique → revise → [repeat]. Simpler than Reflexion. More common in practice.

Generate → Critique → Revise → [stop condition]

Stop condition options: max iterations, score threshold, external verifier passes.

ReAct + Reflect

Standard ReAct (Reason + Act) with an added Reflect step after failed tool calls. Most common production pattern. Adds 1-3 extra LLM calls per failed action.

Tree of Thoughts (ToT)

Explore N reasoning branches simultaneously, evaluate each node, BFS/DFS search. Branching factor 3, depth 3 = 54 LLM calls per problem. Prohibitive for local models. Works only if the model has strong self-evaluation capability (typically ≥32B). ToTRL-trained Qwen3-8B achieved 0.633 on AIME 2025 — but required training-time RL, not a prompt trick.

Graph of Thoughts (GoT)

Generalizes ToT to arbitrary DAGs: thoughts can merge, split, or loop. 62% improvement in sorting vs ToT, 31% cost reduction. Implementation: https://github.com/spcl/graph-of-thoughts Higher complexity than ToT; graph structure is problem-specific.

Native Thinking Tokens (vs Programmatic Reflection)

Open models with built-in reasoning scratchpads:

Model	Size	Ollama	Toggle	Notes
Qwen3	0.6B–235B	Yes	enable_thinking / think=True/False	Best option for local use
Qwen3-4B-Thinking-2507	4B	Yes	Always on	Dedicated thinking variant
QwQ-32B	32B	Yes	Always on	Strong reasoning, needs VRAM
DeepSeek-R1 distills	1.5B–70B	Yes	Always on	Llama/Qwen base

Qwen3 thinking toggle in Ollama / LangChain

# LangChain
model = ChatOllama(model="qwen3:4b", think=True, num_ctx=8192)

# Prompt-level (Ollama API)
# /think  — enable per-request
# /no_think — disable per-request

Latency: thinking mode is 2-3x slower in wall-clock time (model generates internal <think>...</think> tokens before answering). Qwen3-VL 8B Thinking: 262s vs 65s on a complex visual reasoning task — but meaningfully better output.

Native thinking vs programmatic loop

	Native thinking	Programmatic multi-stage
API calls	1	N (rounds × 2)
Implementation	Zero	Significant
Quality on 4-8B	Good (capability in weights)	Poor (weak model critiques itself)
Transparency	Opaque (one streamed block)	Inspectable per stage
Controllability	thinking_budget only	Full control
Latency	2-3x tokens, 1 call	N × base latency

For local 4-8B: native thinking almost always beats a hand-coded reflection loop.

Does Programmatic Reflection Work on Small Models?

Short answer: mostly no without external verification.

From the research (2024-2025):

"When Hindsight is Not 20/20" (arXiv 2404.09129): Self-reflection often makes small models worse. A model that generated an error also lacks the capability to identify it. It confidently accepts flawed reasoning on re-reading.
THINKSLM (EMNLP 2025): Inference-time self-critique on Llama-3.1-8B is unreliable. Training-time distilled reasoning traces help; prompt-based self-critique does not.
Nature 2025 study: Large gains (GPT-4: +18.5pp) diminish sharply for smaller models.
Latency cost: Each reflection round on a local 8B adds 5-30s. A 3-round loop = 3x latency for 0-5% gain (or regression) on most tasks.

When it actually helps on small models

External verifier: model doesn't self-evaluate — it reads objective pass/fail feedback (unit tests, JSON schema checker, math verifier, search result grader). Most reliable pattern. No self-evaluation capability required.
Stronger critic: generate with 4B, critique with 32B or API model. Hybrid approach.
Native thinking weights: reflection happens in a single forward pass with trained weights. Far more reliable than prompt-based self-critique.
Structured error types: code syntax, JSON validity, regex match — computable error signal, not linguistic self-assessment.

LangGraph Reflection Loop Implementation

LangGraph is suited for this because it supports cyclic graphs with state.

Minimal reflection graph

from langgraph.graph import StateGraph, START, END, MessagesState
from langchain_ollama import ChatOllama

llm = ChatOllama(model="qwen3:4b", think=False)
critic_llm = ChatOllama(model="qwen3:4b", think=True)  # or stronger model

MAX_REFLECTIONS = 2

def generate(state):
    response = llm.invoke(state["messages"])
    return {"messages": [response], "iterations": state.get("iterations", 0)}

def reflect(state):
    critique = critic_llm.invoke(
        [{"role": "system", "content": "Critique this response. Be specific about errors."}]
        + state["messages"]
    )
    return {
        "messages": [{"role": "user", "content": critique.content}],
        "iterations": state["iterations"] + 1,
    }

def should_reflect(state) -> str:
    if state.get("iterations", 0) >= MAX_REFLECTIONS:
        return END
    # Optionally: check external verifier here
    return "reflect"

graph = StateGraph(MessagesState)
graph.add_node("generate", generate)
graph.add_node("reflect", reflect)
graph.add_edge(START, "generate")
graph.add_conditional_edges("generate", should_reflect)
graph.add_edge("reflect", "generate")
agent = graph.compile()

Self-Correcting RAG (CRAG pattern)

Retrieve → Grade documents → [rewrite query if bad] → Generate → Grade answer → [loop or END]

The document grader and answer grader are the "external verifiers" — they do objective quality checks rather than linguistic self-critique. LangChain tutorial: https://learnopencv.com/langgraph-self-correcting-agent-code-generation/

Alternative Tooling

DSPy (recommended for pipeline optimization)

DSPy treats prompts as learnable parameters. Define input/output signatures, run an optimizer on examples, and DSPy auto-tunes prompts for your specific model.

import dspy
lm = dspy.LM('ollama_chat/qwen3:4b', api_base='http://localhost:11434', api_key='')
dspy.configure(lm=lm)

class Reflect(dspy.Module):
    def __init__(self):
        self.gen = dspy.ChainOfThought("question -> answer")
        self.critique = dspy.ChainOfThought("question, answer -> critique, improved_answer")
    def forward(self, question):
        first = self.gen(question=question)
        return self.critique(question=question, answer=first.answer).improved_answer

Works with Ollama. Optimizer (BootstrapFewShot, MIPRO) tunes prompts automatically but requires multiple LLM calls per training example — slow on local hardware.

Outlines (structured output)

Constrained decoding — guarantees valid JSON/regex output from any model. Use this inside a reflection loop to ensure the critic always returns structured feedback. Works with Ollama via OpenAI-compatible API. https://dottxt-ai.github.io/outlines/

SGLang

High-performance GPU serving runtime (replaces Ollama for GPU inference). Natively understands <think>...</think> tokens, caches KV-prefix across reflection rounds (RadixAttention). If you replace Ollama with SGLang: reflection loops become significantly cheaper because repeated prompt prefixes are cache-hit. https://github.com/sgl-project/sglang

Benchmarks Summary

Setup	Task	Quality Gain	Latency Cost
GPT-4 + Reflexion	HumanEval	+11pp (80→91%)	~5x
GPT-4 + reflection	Problem solving	+18.5pp	~3x
Llama-7B + programmatic self-critique	Math	+7.1%	~3x
Local 8B + same-model critique (typical)	General	0-5% (often regression)	2-3x
Qwen3-8B + native thinking	AIME 2025	Matches models 10x larger	2-3x tokens
Any model + external verifier (tests)	Code	+15-26pp	1.5-2x

Practical Recommendations for Adolf (local qwen3:4b / 8b)

Goal	Approach	Cost
Better reasoning on hard questions	`think=True` in ChatOllama	2-3x latency, zero code
Code/JSON correctness	External verifier (schema check, exec) + retry loop	+1 LLM call on failure
Complex multi-step tasks	Route to qwen3:8b with `think=True`	model swap + 2-3x tokens
Full reflection loop	Only if using stronger critic model or external verifier	significant complexity
Avoid	Programmatic self-critique using same 4-8B model as critic	adds latency, no gain

References

Reflexion (Shinn et al., NeurIPS 2023): https://arxiv.org/abs/2303.11366
Tree of Thoughts: https://arxiv.org/abs/2305.10601
ToTRL (Qwen3 RL training): https://arxiv.org/html/2505.12717v1
Graph of Thoughts: https://arxiv.org/abs/2308.09687
Adaptive GoT (2025): https://arxiv.org/pdf/2502.05078
When Hindsight is Not 20/20: https://arxiv.org/html/2404.09129v1
THINKSLM (EMNLP 2025): https://aclanthology.org/2025.emnlp-main.1659.pdf
MAR — Multi-Agent Reflexion: https://arxiv.org/html/2512.20845
Qwen3 technical report: https://arxiv.org/pdf/2505.09388
Qwen3 thinking cost measurement: https://medium.com/@frankmorales_91352/the-computational-cost-of-cognitive-depth-qwen3-vl-8b-instruct-vs-thinking-2517b677ba29
DeepSeek-R1: https://arxiv.org/abs/2501.12948
LangChain reflection blog: https://blog.langchain.com/reflection-agents/
LangGraph CRAG: https://learnopencv.com/langgraph-self-correcting-agent-code-generation/
DSPy: https://dspy.ai/
Outlines: https://dottxt-ai.github.io/outlines/
SGLang: https://github.com/sgl-project/sglang
graph-of-thoughts: https://github.com/spcl/graph-of-thoughts
reflexion (original code): https://github.com/noahshinn/reflexion

11 KiB Raw Blame History Unescape Escape