taskpile/docs/mlops/pipeline.md

# Feature Pipeline

## Worker lifecycle

```
startup
  │
  ▼
notify.notify_one()   ← wake immediately to drain any pending rows
  │
  ▼
loop:
  drain loop:
    next_stale() ──► None → break
         │
         ▼
    generate_description(title)
         │ error → set_failed(current model IDs), sleep 5s, break
         ▼
    get_embedding(description)
         │ error → set_failed(current model IDs), sleep 5s, break
         ▼
    UPDATE task_features SET status='ready', embedding=blob, …
         │
         ▼
    recompute_for_task(task_id)
      DELETE task_edges WHERE source=id OR target=id
      load all other 'ready' embeddings
      INSERT pairs with cosine_sim ≥ min_similarity

  tokio::select!
    notified()         ← new task created/updated
    sleep(60s)         ← retry failed rows
```

## Content hash and cache invalidation

```
content_hash = sha256( prompt_version || "\0" || title )
```

A `task_features` row is considered **stale** when:
- `status = 'pending'` — explicitly queued
- `status = 'failed'` and `updated_at < now − 30s` — retry after backoff
- `status = 'ready'` and `desc_model ≠ config.desc_model` — model changed
- `status = 'ready'` and `embed_model ≠ config.embed_model`
- `status = 'ready'` and `prompt_version ≠ config.prompt_version`

A stale-but-ready row serves its existing data until the worker overwrites it, so the graph never shows a "hole" during recomputation.

## Changing models

Edit `backend/src/ml/config.rs`:

```rust
pub fn default() -> Self {
    Self {
        desc_model: "qwen2.5:7b".to_string(),      // upgraded
        embed_model: "nomic-embed-text".to_string(),
        prompt_version: "v2".to_string(),            // bump when prompt changes
        min_similarity: 0.75,                        // wider edges
        ..
    }
}
```

On the next startup (or `notify_one()`), `next_stale()` returns every row whose stored config no longer matches. The worker re-runs them in oldest-first order.

## Prompt versioning

The prompt template is matched on `prompt_version` in `ml/ollama.rs::render_prompt`. Old versions remain compilable — bumping the version adds a new match arm rather than overwriting the old one, so descriptions produced by `v1` can always be reproduced.

```rust
fn render_prompt(prompt_version: &str, task_title: &str) -> String {
    match prompt_version {
        "v2" => format!("…new prompt…{task_title}…"),
        _ => format!("…v1 prompt…{task_title}…"),  // "v1" and legacy
    }
}
```

## Embedding storage

Embeddings are stored as raw little-endian f32 bytes in a `BLOB` column:

```
[f32 LE] [f32 LE] [f32 LE] … (768 floats for nomic-embed-text = 3072 bytes)
```

`encode_embedding` / `decode_embedding` in `ml/features.rs` handle the conversion. The `embed_dim` column records the dimension so readers don't have to hard-code the model's output size.