Files

Alvis edc9a96f7a Add use_cases test category as Claude Code skill instructions

Use cases are markdown files that Claude Code reads, executes step by step
using its tools, and evaluates with its own judgment — not assertion scripts.

- cli_startup.md: pipe EOF into cli.py, verify banner and exit code 0
- apple_pie_research.md: /think query → complex tier → web_search + fetch →
  evaluate recipe quality, sources, and structure

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-03-12 17:01:13 +00:00

1.5 KiB

Raw Blame History

Use Case: Apple Pie Research

Verify that a deep research query triggers the complex tier, uses web search and page fetching, and produces a substantive, well-sourced recipe response.

Steps

1. Send the research query (the /think prefix forces complex tier):

curl -s -X POST http://localhost:8000/message \
  -H "Content-Type: application/json" \
  -d '{"text": "/think what is the best recipe for an apple pie?", "session_id": "use-case-apple-pie", "channel": "cli", "user_id": "claude"}'

2. Wait for the reply via SSE (complex tier can take up to 5 minutes):

curl -s -N --max-time 300 "http://localhost:8000/reply/use-case-apple-pie"

3. Confirm tier and tool usage in agent logs:

docker compose -f /home/alvis/adolf/docker-compose.yml logs deepagents \
  --since=600s --no-log-prefix | grep -E "tier=complex|web_search|fetch_url|crawl4ai"

Evaluate (use your judgment)

Check each of the following:

Tier: logs show tier=complex for this session
Tool use: logs show web_search or fetch_url calls during the request
Ingredients: response lists specific apple pie ingredients (apples, flour, butter, sugar, etc.)
Method: response includes preparation or baking steps
Sources: response cites real URLs it fetched, not invented links
Quality: response is structured and practical — not a refusal, stub, or generic placeholder

Report PASS only if all six criteria are met. For any failure, state which criterion failed and quote the relevant part of the response or logs.

1.5 KiB Raw Blame History

Use Case: Apple Pie Research

Steps

Evaluate (use your judgment)

1.5 KiB

Raw Blame History