feat: embedding-based task clustering — nomic-embed-text for semantic dedup + pattern extraction #97

New Issue

alvis · 2026-04-17T08:16:39Z

alvis commented

2026-04-17 08:16:39 +00:00

Goal

Use nomic-embed-text embeddings (same approach as Taskpile) to:

Detect duplicate / near-duplicate tasks before generation (avoid showing similar tips twice)
Extract user task patterns ("lots of work tasks", "overdue personal errands") for richer context
Identify stale tasks the user consistently ignores (signal for the bandit)

Implementation

# ml/features/embeddings.py
async def embed_tasks(tasks: list[Task]) -> list[np.ndarray]:
    # calls LiteLLM embedder alias → nomic-embed-text
    ...

def cosine_similarity_matrix(embeddings) -> np.ndarray: ...

def cluster_tasks(embeddings, threshold=0.8) -> list[list[int]]:
    # group near-duplicates
    ...

def extract_theme_labels(clusters, tasks) -> list[str]:
    # LLM labels each cluster: "work meetings", "home errands"
    ...

Storage

CREATE TABLE task_embeddings (
  task_id TEXT PRIMARY KEY,
  embed_model TEXT,
  embedding BLOB,  -- numpy float32 array, serialized
  content_hash TEXT,
  computed_at TIMESTAMP
);

Invalidation

Same pattern as Taskpile: content_hash on task title+description+due_date. If hash changes, recompute embedding.

Notes

Depends on #88 (context assembler will use cluster labels as features)
Batch computation in Airflow DAG (#94)
Individual recompute inline when task changes (like Taskpile worker pattern)

## Goal Use `nomic-embed-text` embeddings (same approach as Taskpile) to: 1. Detect duplicate / near-duplicate tasks before generation (avoid showing similar tips twice) 2. Extract user task patterns ("lots of work tasks", "overdue personal errands") for richer context 3. Identify stale tasks the user consistently ignores (signal for the bandit) ## Implementation ```python # ml/features/embeddings.py async def embed_tasks(tasks: list[Task]) -> list[np.ndarray]: # calls LiteLLM embedder alias → nomic-embed-text ... def cosine_similarity_matrix(embeddings) -> np.ndarray: ... def cluster_tasks(embeddings, threshold=0.8) -> list[list[int]]: # group near-duplicates ... def extract_theme_labels(clusters, tasks) -> list[str]: # LLM labels each cluster: "work meetings", "home errands" ... ``` ## Storage ```sql CREATE TABLE task_embeddings ( task_id TEXT PRIMARY KEY, embed_model TEXT, embedding BLOB, -- numpy float32 array, serialized content_hash TEXT, computed_at TIMESTAMP ); ``` ## Invalidation Same pattern as Taskpile: `content_hash` on task title+description+due_date. If hash changes, recompute embedding. ## Notes - Depends on #88 (context assembler will use cluster labels as features) - Batch computation in Airflow DAG (#94) - Individual recompute inline when task changes (like Taskpile worker pattern)

alvis added this to the M4 — MLOps at scale milestone 2026-04-17 08:16:39 +00:00

alvis referenced this issue

2026-05-05 09:26:02 +00:00

agents/focus-area: auto-infer focus areas from task clustering #113

alvis closed this issue

2026-05-06 06:54:36 +00:00

alvis referenced this issue from a commit

2026-05-06 06:55:01 +00:00

feat(agents): semantic task clustering + focus-area inferred preferred_areas (#97, #113)

alvis referenced this issue from a commit

2026-05-06 06:56:22 +00:00

docs: update CLAUDE.md with session learnings (#97, #113)