adolf/openmemory/CLAUDE.md

# openmemory

FastMCP server wrapping mem0 for persistent per-session memory, backed by Qdrant + nomic-embed-text.

## Tools exposed (MCP)

- `add_memory(text, user_id)` — extract facts from a conversation turn and store in Qdrant
- `search_memory(query, user_id)` — semantic search, returns results with score ≥ 0.5
- `get_all_memories(user_id)` — dump all stored memories for a session

These are called directly by `agent.py` (outside the agent loop), never exposed to the LLM as tools.

## Two Ollama instances

- **GPU** (`OLLAMA_GPU_URL`, port 11436) — extraction model (`qwen2.5:1.5b`): pulls facts from conversation text
- **CPU** (`OLLAMA_CPU_URL`, port 11435) — embedding model (`nomic-embed-text`): 50–150 ms per query

## Prompts

Custom `EXTRACTION_PROMPT` starts with `/no_think` to suppress qwen3 chain-of-thought and get clean JSON output. Custom `UPDATE_MEMORY_PROMPT` handles deduplication — mem0 merges new facts with existing ones rather than creating duplicates.

## Notes

- Qdrant collection is created automatically on first use
- Memory is keyed by `user_id` which equals `session_id` in `agent.py`
- Extraction runs after the reply is sent (background task) — GPU contention with medium model is avoided since the semaphore is released before `_store_memory()` is scheduled