Files
adolf/tests/use_cases/apple_pie_research.md
Alvis edc9a96f7a Add use_cases test category as Claude Code skill instructions
Use cases are markdown files that Claude Code reads, executes step by step
using its tools, and evaluates with its own judgment — not assertion scripts.

- cli_startup.md: pipe EOF into cli.py, verify banner and exit code 0
- apple_pie_research.md: /think query → complex tier → web_search + fetch →
  evaluate recipe quality, sources, and structure

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-12 17:01:13 +00:00

1.5 KiB

Use Case: Apple Pie Research

Verify that a deep research query triggers the complex tier, uses web search and page fetching, and produces a substantive, well-sourced recipe response.

Steps

1. Send the research query (the /think prefix forces complex tier):

curl -s -X POST http://localhost:8000/message \
  -H "Content-Type: application/json" \
  -d '{"text": "/think what is the best recipe for an apple pie?", "session_id": "use-case-apple-pie", "channel": "cli", "user_id": "claude"}'

2. Wait for the reply via SSE (complex tier can take up to 5 minutes):

curl -s -N --max-time 300 "http://localhost:8000/reply/use-case-apple-pie"

3. Confirm tier and tool usage in agent logs:

docker compose -f /home/alvis/adolf/docker-compose.yml logs deepagents \
  --since=600s --no-log-prefix | grep -E "tier=complex|web_search|fetch_url|crawl4ai"

Evaluate (use your judgment)

Check each of the following:

  • Tier: logs show tier=complex for this session
  • Tool use: logs show web_search or fetch_url calls during the request
  • Ingredients: response lists specific apple pie ingredients (apples, flour, butter, sugar, etc.)
  • Method: response includes preparation or baking steps
  • Sources: response cites real URLs it fetched, not invented links
  • Quality: response is structured and practical — not a refusal, stub, or generic placeholder

Report PASS only if all six criteria are met. For any failure, state which criterion failed and quote the relevant part of the response or logs.