feat: embedding-based task clustering — nomic-embed-text for semantic dedup + pattern extraction #97

Closed
opened 2026-04-17 08:16:39 +00:00 by alvis · 0 comments
Owner

Goal

Use nomic-embed-text embeddings (same approach as Taskpile) to:

  1. Detect duplicate / near-duplicate tasks before generation (avoid showing similar tips twice)
  2. Extract user task patterns ("lots of work tasks", "overdue personal errands") for richer context
  3. Identify stale tasks the user consistently ignores (signal for the bandit)

Implementation

# ml/features/embeddings.py
async def embed_tasks(tasks: list[Task]) -> list[np.ndarray]:
    # calls LiteLLM embedder alias → nomic-embed-text
    ...

def cosine_similarity_matrix(embeddings) -> np.ndarray: ...

def cluster_tasks(embeddings, threshold=0.8) -> list[list[int]]:
    # group near-duplicates
    ...

def extract_theme_labels(clusters, tasks) -> list[str]:
    # LLM labels each cluster: "work meetings", "home errands"
    ...

Storage

CREATE TABLE task_embeddings (
  task_id TEXT PRIMARY KEY,
  embed_model TEXT,
  embedding BLOB,  -- numpy float32 array, serialized
  content_hash TEXT,
  computed_at TIMESTAMP
);

Invalidation

Same pattern as Taskpile: content_hash on task title+description+due_date. If hash changes, recompute embedding.

Notes

  • Depends on #88 (context assembler will use cluster labels as features)
  • Batch computation in Airflow DAG (#94)
  • Individual recompute inline when task changes (like Taskpile worker pattern)
## Goal Use `nomic-embed-text` embeddings (same approach as Taskpile) to: 1. Detect duplicate / near-duplicate tasks before generation (avoid showing similar tips twice) 2. Extract user task patterns ("lots of work tasks", "overdue personal errands") for richer context 3. Identify stale tasks the user consistently ignores (signal for the bandit) ## Implementation ```python # ml/features/embeddings.py async def embed_tasks(tasks: list[Task]) -> list[np.ndarray]: # calls LiteLLM embedder alias → nomic-embed-text ... def cosine_similarity_matrix(embeddings) -> np.ndarray: ... def cluster_tasks(embeddings, threshold=0.8) -> list[list[int]]: # group near-duplicates ... def extract_theme_labels(clusters, tasks) -> list[str]: # LLM labels each cluster: "work meetings", "home errands" ... ``` ## Storage ```sql CREATE TABLE task_embeddings ( task_id TEXT PRIMARY KEY, embed_model TEXT, embedding BLOB, -- numpy float32 array, serialized content_hash TEXT, computed_at TIMESTAMP ); ``` ## Invalidation Same pattern as Taskpile: `content_hash` on task title+description+due_date. If hash changes, recompute embedding. ## Notes - Depends on #88 (context assembler will use cluster labels as features) - Batch computation in Airflow DAG (#94) - Individual recompute inline when task changes (like Taskpile worker pattern)
alvis added this to the M4 — MLOps at scale milestone 2026-04-17 08:16:39 +00:00
alvis closed this issue 2026-05-06 06:54:36 +00:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: alvis/oO#97