wiki search people tested pipeline
This commit is contained in:
287
adolf/reasoning.md
Normal file
287
adolf/reasoning.md
Normal file
@@ -0,0 +1,287 @@
|
||||
# Reasoning & Self-Reflection in Local LLM Agents
|
||||
|
||||
Research-backed notes on implementing multi-stage reasoning for local 4-8B models (2025).
|
||||
|
||||
---
|
||||
|
||||
## TL;DR
|
||||
|
||||
For local 4-8B models, **programmatic self-critique loops rarely justify their cost**.
|
||||
Native thinking tokens (Qwen3 `enable_thinking=True`) or external verifiers
|
||||
give better results at lower complexity. See bottom of this file for recommendations.
|
||||
|
||||
---
|
||||
|
||||
## Reasoning Patterns
|
||||
|
||||
### Chain-of-Thought (CoT)
|
||||
|
||||
Single forward pass, model thinks step-by-step before answering.
|
||||
Zero implementation cost — just a prompt change.
|
||||
Typical gain: +5-10pp on multi-step tasks vs no CoT.
|
||||
No latency overhead beyond the extra output tokens.
|
||||
|
||||
### Reflexion (Shinn et al., NeurIPS 2023)
|
||||
|
||||
Multiple complete attempts. After each attempt, the model writes a textual critique
|
||||
of what went wrong and stores it in episodic memory. Next attempt is conditioned on
|
||||
that memory.
|
||||
|
||||
```
|
||||
attempt 1 → fail → write critique → attempt 2 (reads critique) → ...
|
||||
```
|
||||
|
||||
Key results (GPT-4): HumanEval 80% → 91% pass@1.
|
||||
Cost: N complete task executions. At 30s/attempt, 5 trials = 2.5 minutes.
|
||||
Implementation: https://github.com/noahshinn/reflexion
|
||||
|
||||
### Reflection Loop (in-turn revision)
|
||||
|
||||
Within a single turn: generate → critique → revise → [repeat].
|
||||
Simpler than Reflexion. More common in practice.
|
||||
|
||||
```
|
||||
Generate → Critique → Revise → [stop condition]
|
||||
```
|
||||
|
||||
Stop condition options: max iterations, score threshold, external verifier passes.
|
||||
|
||||
### ReAct + Reflect
|
||||
|
||||
Standard ReAct (Reason + Act) with an added Reflect step after failed tool calls.
|
||||
Most common production pattern. Adds 1-3 extra LLM calls per failed action.
|
||||
|
||||
### Tree of Thoughts (ToT)
|
||||
|
||||
Explore N reasoning branches simultaneously, evaluate each node, BFS/DFS search.
|
||||
Branching factor 3, depth 3 = 54 LLM calls per problem. Prohibitive for local models.
|
||||
Works only if the model has strong self-evaluation capability (typically ≥32B).
|
||||
ToTRL-trained Qwen3-8B achieved 0.633 on AIME 2025 — but required training-time RL, not
|
||||
a prompt trick.
|
||||
|
||||
### Graph of Thoughts (GoT)
|
||||
|
||||
Generalizes ToT to arbitrary DAGs: thoughts can merge, split, or loop.
|
||||
62% improvement in sorting vs ToT, 31% cost reduction.
|
||||
Implementation: https://github.com/spcl/graph-of-thoughts
|
||||
Higher complexity than ToT; graph structure is problem-specific.
|
||||
|
||||
---
|
||||
|
||||
## Native Thinking Tokens (vs Programmatic Reflection)
|
||||
|
||||
Open models with built-in reasoning scratchpads:
|
||||
|
||||
| Model | Size | Ollama | Toggle | Notes |
|
||||
|-------|------|--------|--------|-------|
|
||||
| Qwen3 | 0.6B–235B | Yes | enable_thinking / think=True/False | Best option for local use |
|
||||
| Qwen3-4B-Thinking-2507 | 4B | Yes | Always on | Dedicated thinking variant |
|
||||
| QwQ-32B | 32B | Yes | Always on | Strong reasoning, needs VRAM |
|
||||
| DeepSeek-R1 distills | 1.5B–70B | Yes | Always on | Llama/Qwen base |
|
||||
|
||||
### Qwen3 thinking toggle in Ollama / LangChain
|
||||
|
||||
```python
|
||||
# LangChain
|
||||
model = ChatOllama(model="qwen3:4b", think=True, num_ctx=8192)
|
||||
|
||||
# Prompt-level (Ollama API)
|
||||
# /think — enable per-request
|
||||
# /no_think — disable per-request
|
||||
```
|
||||
|
||||
Latency: thinking mode is 2-3x slower in wall-clock time (model generates internal
|
||||
`<think>...</think>` tokens before answering). Qwen3-VL 8B Thinking: 262s vs 65s on a
|
||||
complex visual reasoning task — but meaningfully better output.
|
||||
|
||||
### Native thinking vs programmatic loop
|
||||
|
||||
| | Native thinking | Programmatic multi-stage |
|
||||
|--|----------------|------------------------|
|
||||
| API calls | 1 | N (rounds × 2) |
|
||||
| Implementation | Zero | Significant |
|
||||
| Quality on 4-8B | Good (capability in weights) | Poor (weak model critiques itself) |
|
||||
| Transparency | Opaque (one streamed block) | Inspectable per stage |
|
||||
| Controllability | thinking_budget only | Full control |
|
||||
| Latency | 2-3x tokens, 1 call | N × base latency |
|
||||
|
||||
**For local 4-8B: native thinking almost always beats a hand-coded reflection loop.**
|
||||
|
||||
---
|
||||
|
||||
## Does Programmatic Reflection Work on Small Models?
|
||||
|
||||
Short answer: **mostly no without external verification**.
|
||||
|
||||
From the research (2024-2025):
|
||||
|
||||
- **"When Hindsight is Not 20/20" (arXiv 2404.09129)**: Self-reflection often makes
|
||||
small models worse. A model that generated an error also lacks the capability to
|
||||
identify it. It confidently accepts flawed reasoning on re-reading.
|
||||
|
||||
- **THINKSLM (EMNLP 2025)**: Inference-time self-critique on Llama-3.1-8B is unreliable.
|
||||
Training-time distilled reasoning traces help; prompt-based self-critique does not.
|
||||
|
||||
- **Nature 2025 study**: Large gains (GPT-4: +18.5pp) diminish sharply for smaller models.
|
||||
|
||||
- **Latency cost**: Each reflection round on a local 8B adds 5-30s. A 3-round loop
|
||||
= 3x latency for 0-5% gain (or regression) on most tasks.
|
||||
|
||||
### When it actually helps on small models
|
||||
|
||||
1. **External verifier**: model doesn't self-evaluate — it reads objective pass/fail
|
||||
feedback (unit tests, JSON schema checker, math verifier, search result grader).
|
||||
Most reliable pattern. No self-evaluation capability required.
|
||||
|
||||
2. **Stronger critic**: generate with 4B, critique with 32B or API model. Hybrid approach.
|
||||
|
||||
3. **Native thinking weights**: reflection happens in a single forward pass with
|
||||
trained weights. Far more reliable than prompt-based self-critique.
|
||||
|
||||
4. **Structured error types**: code syntax, JSON validity, regex match — computable
|
||||
error signal, not linguistic self-assessment.
|
||||
|
||||
---
|
||||
|
||||
## LangGraph Reflection Loop Implementation
|
||||
|
||||
LangGraph is suited for this because it supports cyclic graphs with state.
|
||||
|
||||
### Minimal reflection graph
|
||||
|
||||
```python
|
||||
from langgraph.graph import StateGraph, START, END, MessagesState
|
||||
from langchain_ollama import ChatOllama
|
||||
|
||||
llm = ChatOllama(model="qwen3:4b", think=False)
|
||||
critic_llm = ChatOllama(model="qwen3:4b", think=True) # or stronger model
|
||||
|
||||
MAX_REFLECTIONS = 2
|
||||
|
||||
def generate(state):
|
||||
response = llm.invoke(state["messages"])
|
||||
return {"messages": [response], "iterations": state.get("iterations", 0)}
|
||||
|
||||
def reflect(state):
|
||||
critique = critic_llm.invoke(
|
||||
[{"role": "system", "content": "Critique this response. Be specific about errors."}]
|
||||
+ state["messages"]
|
||||
)
|
||||
return {
|
||||
"messages": [{"role": "user", "content": critique.content}],
|
||||
"iterations": state["iterations"] + 1,
|
||||
}
|
||||
|
||||
def should_reflect(state) -> str:
|
||||
if state.get("iterations", 0) >= MAX_REFLECTIONS:
|
||||
return END
|
||||
# Optionally: check external verifier here
|
||||
return "reflect"
|
||||
|
||||
graph = StateGraph(MessagesState)
|
||||
graph.add_node("generate", generate)
|
||||
graph.add_node("reflect", reflect)
|
||||
graph.add_edge(START, "generate")
|
||||
graph.add_conditional_edges("generate", should_reflect)
|
||||
graph.add_edge("reflect", "generate")
|
||||
agent = graph.compile()
|
||||
```
|
||||
|
||||
### Self-Correcting RAG (CRAG pattern)
|
||||
|
||||
```
|
||||
Retrieve → Grade documents → [rewrite query if bad] → Generate → Grade answer → [loop or END]
|
||||
```
|
||||
|
||||
The document grader and answer grader are the "external verifiers" — they do
|
||||
objective quality checks rather than linguistic self-critique.
|
||||
LangChain tutorial: https://learnopencv.com/langgraph-self-correcting-agent-code-generation/
|
||||
|
||||
---
|
||||
|
||||
## Alternative Tooling
|
||||
|
||||
### DSPy (recommended for pipeline optimization)
|
||||
|
||||
DSPy treats prompts as learnable parameters. Define input/output signatures, run an
|
||||
optimizer on examples, and DSPy auto-tunes prompts for your specific model.
|
||||
|
||||
```python
|
||||
import dspy
|
||||
lm = dspy.LM('ollama_chat/qwen3:4b', api_base='http://localhost:11434', api_key='')
|
||||
dspy.configure(lm=lm)
|
||||
|
||||
class Reflect(dspy.Module):
|
||||
def __init__(self):
|
||||
self.gen = dspy.ChainOfThought("question -> answer")
|
||||
self.critique = dspy.ChainOfThought("question, answer -> critique, improved_answer")
|
||||
def forward(self, question):
|
||||
first = self.gen(question=question)
|
||||
return self.critique(question=question, answer=first.answer).improved_answer
|
||||
```
|
||||
|
||||
Works with Ollama. Optimizer (BootstrapFewShot, MIPRO) tunes prompts automatically
|
||||
but requires multiple LLM calls per training example — slow on local hardware.
|
||||
|
||||
### Outlines (structured output)
|
||||
|
||||
Constrained decoding — guarantees valid JSON/regex output from any model.
|
||||
Use this inside a reflection loop to ensure the critic always returns structured feedback.
|
||||
Works with Ollama via OpenAI-compatible API.
|
||||
https://dottxt-ai.github.io/outlines/
|
||||
|
||||
### SGLang
|
||||
|
||||
High-performance GPU serving runtime (replaces Ollama for GPU inference).
|
||||
Natively understands `<think>...</think>` tokens, caches KV-prefix across reflection
|
||||
rounds (RadixAttention). If you replace Ollama with SGLang: reflection loops become
|
||||
significantly cheaper because repeated prompt prefixes are cache-hit.
|
||||
https://github.com/sgl-project/sglang
|
||||
|
||||
---
|
||||
|
||||
## Benchmarks Summary
|
||||
|
||||
| Setup | Task | Quality Gain | Latency Cost |
|
||||
|-------|------|-------------|-------------|
|
||||
| GPT-4 + Reflexion | HumanEval | +11pp (80→91%) | ~5x |
|
||||
| GPT-4 + reflection | Problem solving | +18.5pp | ~3x |
|
||||
| Llama-7B + programmatic self-critique | Math | +7.1% | ~3x |
|
||||
| Local 8B + same-model critique (typical) | General | 0-5% (often regression) | 2-3x |
|
||||
| Qwen3-8B + native thinking | AIME 2025 | Matches models 10x larger | 2-3x tokens |
|
||||
| Any model + external verifier (tests) | Code | +15-26pp | 1.5-2x |
|
||||
|
||||
---
|
||||
|
||||
## Practical Recommendations for Adolf (local qwen3:4b / 8b)
|
||||
|
||||
| Goal | Approach | Cost |
|
||||
|------|----------|------|
|
||||
| Better reasoning on hard questions | `think=True` in ChatOllama | 2-3x latency, zero code |
|
||||
| Code/JSON correctness | External verifier (schema check, exec) + retry loop | +1 LLM call on failure |
|
||||
| Complex multi-step tasks | Route to qwen3:8b with `think=True` | model swap + 2-3x tokens |
|
||||
| Full reflection loop | Only if using stronger critic model or external verifier | significant complexity |
|
||||
| Avoid | Programmatic self-critique using same 4-8B model as critic | adds latency, no gain |
|
||||
|
||||
---
|
||||
|
||||
## References
|
||||
|
||||
- Reflexion (Shinn et al., NeurIPS 2023): https://arxiv.org/abs/2303.11366
|
||||
- Tree of Thoughts: https://arxiv.org/abs/2305.10601
|
||||
- ToTRL (Qwen3 RL training): https://arxiv.org/html/2505.12717v1
|
||||
- Graph of Thoughts: https://arxiv.org/abs/2308.09687
|
||||
- Adaptive GoT (2025): https://arxiv.org/pdf/2502.05078
|
||||
- When Hindsight is Not 20/20: https://arxiv.org/html/2404.09129v1
|
||||
- THINKSLM (EMNLP 2025): https://aclanthology.org/2025.emnlp-main.1659.pdf
|
||||
- MAR — Multi-Agent Reflexion: https://arxiv.org/html/2512.20845
|
||||
- Qwen3 technical report: https://arxiv.org/pdf/2505.09388
|
||||
- Qwen3 thinking cost measurement: https://medium.com/@frankmorales_91352/the-computational-cost-of-cognitive-depth-qwen3-vl-8b-instruct-vs-thinking-2517b677ba29
|
||||
- DeepSeek-R1: https://arxiv.org/abs/2501.12948
|
||||
- LangChain reflection blog: https://blog.langchain.com/reflection-agents/
|
||||
- LangGraph CRAG: https://learnopencv.com/langgraph-self-correcting-agent-code-generation/
|
||||
- DSPy: https://dspy.ai/
|
||||
- Outlines: https://dottxt-ai.github.io/outlines/
|
||||
- SGLang: https://github.com/sgl-project/sglang
|
||||
- graph-of-thoughts: https://github.com/spcl/graph-of-thoughts
|
||||
- reflexion (original code): https://github.com/noahshinn/reflexion
|
||||
Reference in New Issue
Block a user