Files
AgapHost/adolf/reasoning.md
2026-03-05 11:22:34 +00:00

288 lines
11 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Reasoning & Self-Reflection in Local LLM Agents
Research-backed notes on implementing multi-stage reasoning for local 4-8B models (2025).
---
## TL;DR
For local 4-8B models, **programmatic self-critique loops rarely justify their cost**.
Native thinking tokens (Qwen3 `enable_thinking=True`) or external verifiers
give better results at lower complexity. See bottom of this file for recommendations.
---
## Reasoning Patterns
### Chain-of-Thought (CoT)
Single forward pass, model thinks step-by-step before answering.
Zero implementation cost — just a prompt change.
Typical gain: +5-10pp on multi-step tasks vs no CoT.
No latency overhead beyond the extra output tokens.
### Reflexion (Shinn et al., NeurIPS 2023)
Multiple complete attempts. After each attempt, the model writes a textual critique
of what went wrong and stores it in episodic memory. Next attempt is conditioned on
that memory.
```
attempt 1 → fail → write critique → attempt 2 (reads critique) → ...
```
Key results (GPT-4): HumanEval 80% → 91% pass@1.
Cost: N complete task executions. At 30s/attempt, 5 trials = 2.5 minutes.
Implementation: https://github.com/noahshinn/reflexion
### Reflection Loop (in-turn revision)
Within a single turn: generate → critique → revise → [repeat].
Simpler than Reflexion. More common in practice.
```
Generate → Critique → Revise → [stop condition]
```
Stop condition options: max iterations, score threshold, external verifier passes.
### ReAct + Reflect
Standard ReAct (Reason + Act) with an added Reflect step after failed tool calls.
Most common production pattern. Adds 1-3 extra LLM calls per failed action.
### Tree of Thoughts (ToT)
Explore N reasoning branches simultaneously, evaluate each node, BFS/DFS search.
Branching factor 3, depth 3 = 54 LLM calls per problem. Prohibitive for local models.
Works only if the model has strong self-evaluation capability (typically ≥32B).
ToTRL-trained Qwen3-8B achieved 0.633 on AIME 2025 — but required training-time RL, not
a prompt trick.
### Graph of Thoughts (GoT)
Generalizes ToT to arbitrary DAGs: thoughts can merge, split, or loop.
62% improvement in sorting vs ToT, 31% cost reduction.
Implementation: https://github.com/spcl/graph-of-thoughts
Higher complexity than ToT; graph structure is problem-specific.
---
## Native Thinking Tokens (vs Programmatic Reflection)
Open models with built-in reasoning scratchpads:
| Model | Size | Ollama | Toggle | Notes |
|-------|------|--------|--------|-------|
| Qwen3 | 0.6B235B | Yes | enable_thinking / think=True/False | Best option for local use |
| Qwen3-4B-Thinking-2507 | 4B | Yes | Always on | Dedicated thinking variant |
| QwQ-32B | 32B | Yes | Always on | Strong reasoning, needs VRAM |
| DeepSeek-R1 distills | 1.5B70B | Yes | Always on | Llama/Qwen base |
### Qwen3 thinking toggle in Ollama / LangChain
```python
# LangChain
model = ChatOllama(model="qwen3:4b", think=True, num_ctx=8192)
# Prompt-level (Ollama API)
# /think — enable per-request
# /no_think — disable per-request
```
Latency: thinking mode is 2-3x slower in wall-clock time (model generates internal
`<think>...</think>` tokens before answering). Qwen3-VL 8B Thinking: 262s vs 65s on a
complex visual reasoning task — but meaningfully better output.
### Native thinking vs programmatic loop
| | Native thinking | Programmatic multi-stage |
|--|----------------|------------------------|
| API calls | 1 | N (rounds × 2) |
| Implementation | Zero | Significant |
| Quality on 4-8B | Good (capability in weights) | Poor (weak model critiques itself) |
| Transparency | Opaque (one streamed block) | Inspectable per stage |
| Controllability | thinking_budget only | Full control |
| Latency | 2-3x tokens, 1 call | N × base latency |
**For local 4-8B: native thinking almost always beats a hand-coded reflection loop.**
---
## Does Programmatic Reflection Work on Small Models?
Short answer: **mostly no without external verification**.
From the research (2024-2025):
- **"When Hindsight is Not 20/20" (arXiv 2404.09129)**: Self-reflection often makes
small models worse. A model that generated an error also lacks the capability to
identify it. It confidently accepts flawed reasoning on re-reading.
- **THINKSLM (EMNLP 2025)**: Inference-time self-critique on Llama-3.1-8B is unreliable.
Training-time distilled reasoning traces help; prompt-based self-critique does not.
- **Nature 2025 study**: Large gains (GPT-4: +18.5pp) diminish sharply for smaller models.
- **Latency cost**: Each reflection round on a local 8B adds 5-30s. A 3-round loop
= 3x latency for 0-5% gain (or regression) on most tasks.
### When it actually helps on small models
1. **External verifier**: model doesn't self-evaluate — it reads objective pass/fail
feedback (unit tests, JSON schema checker, math verifier, search result grader).
Most reliable pattern. No self-evaluation capability required.
2. **Stronger critic**: generate with 4B, critique with 32B or API model. Hybrid approach.
3. **Native thinking weights**: reflection happens in a single forward pass with
trained weights. Far more reliable than prompt-based self-critique.
4. **Structured error types**: code syntax, JSON validity, regex match — computable
error signal, not linguistic self-assessment.
---
## LangGraph Reflection Loop Implementation
LangGraph is suited for this because it supports cyclic graphs with state.
### Minimal reflection graph
```python
from langgraph.graph import StateGraph, START, END, MessagesState
from langchain_ollama import ChatOllama
llm = ChatOllama(model="qwen3:4b", think=False)
critic_llm = ChatOllama(model="qwen3:4b", think=True) # or stronger model
MAX_REFLECTIONS = 2
def generate(state):
response = llm.invoke(state["messages"])
return {"messages": [response], "iterations": state.get("iterations", 0)}
def reflect(state):
critique = critic_llm.invoke(
[{"role": "system", "content": "Critique this response. Be specific about errors."}]
+ state["messages"]
)
return {
"messages": [{"role": "user", "content": critique.content}],
"iterations": state["iterations"] + 1,
}
def should_reflect(state) -> str:
if state.get("iterations", 0) >= MAX_REFLECTIONS:
return END
# Optionally: check external verifier here
return "reflect"
graph = StateGraph(MessagesState)
graph.add_node("generate", generate)
graph.add_node("reflect", reflect)
graph.add_edge(START, "generate")
graph.add_conditional_edges("generate", should_reflect)
graph.add_edge("reflect", "generate")
agent = graph.compile()
```
### Self-Correcting RAG (CRAG pattern)
```
Retrieve → Grade documents → [rewrite query if bad] → Generate → Grade answer → [loop or END]
```
The document grader and answer grader are the "external verifiers" — they do
objective quality checks rather than linguistic self-critique.
LangChain tutorial: https://learnopencv.com/langgraph-self-correcting-agent-code-generation/
---
## Alternative Tooling
### DSPy (recommended for pipeline optimization)
DSPy treats prompts as learnable parameters. Define input/output signatures, run an
optimizer on examples, and DSPy auto-tunes prompts for your specific model.
```python
import dspy
lm = dspy.LM('ollama_chat/qwen3:4b', api_base='http://localhost:11434', api_key='')
dspy.configure(lm=lm)
class Reflect(dspy.Module):
def __init__(self):
self.gen = dspy.ChainOfThought("question -> answer")
self.critique = dspy.ChainOfThought("question, answer -> critique, improved_answer")
def forward(self, question):
first = self.gen(question=question)
return self.critique(question=question, answer=first.answer).improved_answer
```
Works with Ollama. Optimizer (BootstrapFewShot, MIPRO) tunes prompts automatically
but requires multiple LLM calls per training example — slow on local hardware.
### Outlines (structured output)
Constrained decoding — guarantees valid JSON/regex output from any model.
Use this inside a reflection loop to ensure the critic always returns structured feedback.
Works with Ollama via OpenAI-compatible API.
https://dottxt-ai.github.io/outlines/
### SGLang
High-performance GPU serving runtime (replaces Ollama for GPU inference).
Natively understands `<think>...</think>` tokens, caches KV-prefix across reflection
rounds (RadixAttention). If you replace Ollama with SGLang: reflection loops become
significantly cheaper because repeated prompt prefixes are cache-hit.
https://github.com/sgl-project/sglang
---
## Benchmarks Summary
| Setup | Task | Quality Gain | Latency Cost |
|-------|------|-------------|-------------|
| GPT-4 + Reflexion | HumanEval | +11pp (80→91%) | ~5x |
| GPT-4 + reflection | Problem solving | +18.5pp | ~3x |
| Llama-7B + programmatic self-critique | Math | +7.1% | ~3x |
| Local 8B + same-model critique (typical) | General | 0-5% (often regression) | 2-3x |
| Qwen3-8B + native thinking | AIME 2025 | Matches models 10x larger | 2-3x tokens |
| Any model + external verifier (tests) | Code | +15-26pp | 1.5-2x |
---
## Practical Recommendations for Adolf (local qwen3:4b / 8b)
| Goal | Approach | Cost |
|------|----------|------|
| Better reasoning on hard questions | `think=True` in ChatOllama | 2-3x latency, zero code |
| Code/JSON correctness | External verifier (schema check, exec) + retry loop | +1 LLM call on failure |
| Complex multi-step tasks | Route to qwen3:8b with `think=True` | model swap + 2-3x tokens |
| Full reflection loop | Only if using stronger critic model or external verifier | significant complexity |
| Avoid | Programmatic self-critique using same 4-8B model as critic | adds latency, no gain |
---
## References
- Reflexion (Shinn et al., NeurIPS 2023): https://arxiv.org/abs/2303.11366
- Tree of Thoughts: https://arxiv.org/abs/2305.10601
- ToTRL (Qwen3 RL training): https://arxiv.org/html/2505.12717v1
- Graph of Thoughts: https://arxiv.org/abs/2308.09687
- Adaptive GoT (2025): https://arxiv.org/pdf/2502.05078
- When Hindsight is Not 20/20: https://arxiv.org/html/2404.09129v1
- THINKSLM (EMNLP 2025): https://aclanthology.org/2025.emnlp-main.1659.pdf
- MAR — Multi-Agent Reflexion: https://arxiv.org/html/2512.20845
- Qwen3 technical report: https://arxiv.org/pdf/2505.09388
- Qwen3 thinking cost measurement: https://medium.com/@frankmorales_91352/the-computational-cost-of-cognitive-depth-qwen3-vl-8b-instruct-vs-thinking-2517b677ba29
- DeepSeek-R1: https://arxiv.org/abs/2501.12948
- LangChain reflection blog: https://blog.langchain.com/reflection-agents/
- LangGraph CRAG: https://learnopencv.com/langgraph-self-correcting-agent-code-generation/
- DSPy: https://dspy.ai/
- Outlines: https://dottxt-ai.github.io/outlines/
- SGLang: https://github.com/sgl-project/sglang
- graph-of-thoughts: https://github.com/spcl/graph-of-thoughts
- reflexion (original code): https://github.com/noahshinn/reflexion